mr : a tiny relational database for minneron

1. overview
- 1.1. GOAL a simple graph database with a strong relational bias.
- 1.2. project breakdown
2. implementation
3. Extensions
- 3.1. Version Control

1 overview

1.1 GOAL a simple graph database with a strong relational bias.

We want to store data and documents in one place.
We want to use the relational model when possible.
- It's always possible.
  - Some things like source code, html might even benefit.
  - Imagine querying for all the level 2 headlines in posts by a certain author between some dates.
- But sometimes we don't actually need the relational model and it can be more efficient to just use a raw graph.
  - Example: storing the NFA/DFA for a grammar.

1.2 project breakdown

1.2.1 storing nodes, relations, arbitrary graphs

variable-length content gets stored as nodes:
- strings / text
- images
- audio
- video
- compiled code
- raw binary data
fixed width tables are stored as relations
1. really just a special kind of blob that grows
2. each row is a fixed-size array of bytes
3. these can be converted to record structures when needed
4. booleans are stored as octets unless they can be bit packed.
default 3-column table (triplestore) for arbitrary graphs
semi-structured data can go either way
- html files
- source code

1.2.2 indexing the data

use a b+ tree to store locations of table/string descriptors
Individual columns can be indexed, just like strings.

1.2.3 paged model for database layout

linked list of pages
Links are stored in their own descriptor pages, separate from the pages.
Mostly this just seems cleaner to me, but also:
- Pages can use an entire power-of-2-sized block for data (hopefully making pointer arithmetic slightly more efficient).
- The linked list can be walked and examined without actually loading it, so you can figure out which blocks contain a particular blob without actually loading and walking the pages. ¹
These index pages are just sequences of integers.

These would act as a sort of parallel array to the pages themselves (the pages are contained a virtual "array" in that they're numbered sequentially within the database file). On the first index page, item[0] would contain the .next link for page[0], and so on.
The link index can grow like any other table, and thus contains itself.

Suppose the index grows large enough that it needs to be extended, but the next block is already in use.
In the linked list, 0 reperesents null.

The pages are arranged in linked lists, or trees trees, but never in a loop. Therefore, there would never be a back-link to page 0, and it's safe to use 0 to represent a link to null (and thus the end of a chain).

Zero is a nice number to compare to because many CPUs make it eazy to branch when things are zero.

Negative values might represent unused pages, or pages in need of cleanup.
relations with fixed-width values always start on a page boundary

this just makes it slightly simpler to do arithmetic

1.2.4 correlating data in ram : the relational algebra

1.2.5 the ACID properties

For working in RAM, though, we may just want to append to entries to a temporary table. This should help with ACID.

2 implementation

2.1 implementing nodes

2.1.1 Nodes are just blobs of text or binary data.

2.1.2 NIds are assigned sequentially, starting at 1

2.1.3 NId → Str

Find the string entry in the descriptor table. (Easy since they're in order.)
Use this to find the start page of the string.
Given length, it's easy to figure out which pages to load, and then load the whole string.

2.1.4 Str → NId

This uses the string descriptor table in conjunction with a b+ tree as an index. The string is used as the comparison key for the lookup,
since we want the string index to be ordered for sorting, probably the first 4-8 characters should be cast as an int, and if that doesn't bring us to a leaf node, use the next set of characters.
the value stored in the b+ tree is just a pointer to the string table

2.2 Indexing : B+ Trees

2.3 Storing Tuples

2.3.1 Records should (could?) be stored in sequence, and updated in-place.

3 Extensions

3.1 Version Control

3.1.1 For version control, it probably makes sense to keep a running log of transactions as triples.

3.1.2 For branching, we would annotate each triple with `.prev` pointer, to create a linked list.

3.1.3 As an optimization, the current state of the working copy would be cached as a more traditional database.

3.1.4 The old value for each change could also be stored.

Footnotes:

I don't know if this is useful or not yet, but it seems like it might come in handy for garbage collection/defragmentation.

mr : a tiny relational database for minneron

Table of Contents

1 overview

1.1 GOAL a simple graph database with a strong relational bias.

1.2 project breakdown

1.2.1 storing nodes, relations, arbitrary graphs

1.2.2 indexing the data

1.2.3 paged model for database layout

1.2.4 correlating data in ram : the relational algebra

1.2.5 the ACID properties

2 implementation

2.1 implementing nodes

2.1.1 Nodes are just blobs of text or binary data.

2.1.2 NIds are assigned sequentially, starting at 1

2.1.3 NId → Str

2.1.4 Str → NId

2.2 Indexing : B+ Trees

2.3 Storing Tuples

2.3.1 Records should (could?) be stored in sequence, and updated in-place.

3 Extensions

3.1 Version Control

3.1.1 For version control, it probably makes sense to keep a running log of transactions as triples.

3.1.2 For branching, we would annotate each triple with .prev pointer, to create a linked list.

3.1.3 As an optimization, the current state of the working copy would be cached as a more traditional database.

3.1.4 The old value for each change could also be stored.

Footnotes:

3.1.2 For branching, we would annotate each triple with `.prev` pointer, to create a linked list.