The Problem with B+Trees
Practically every transactional database allows users to index values using B+trees or their moral equivalent. B+trees make excellent use of both disk caches and CPU caches and are one of the best known ways to access data by keys when the key values are unpredictable. However, updating B+trees is a problem.
Consider what happens when you insert, say, a thousand rows into a table with an index. Laying down the data is fairly quick. You fill up disk pages with the on-disk format of the data and then write the pages to disk. The number of disk pages that you have to access is typically little more than the space actually required to hold the data. If the rows average 100 bytes and you are using the typical 4KB pages, then you have to access on the order of 20 pages.
But lets suppose that there is a single B+tree index on this data and that the index uses 4KB pages and uses about 16 bytes per key, including overhead and reservation space. A single page holds 256 keys. If there are 1 million distinct keys, this means that we have about 4,000 leaf pages. Now, some of those 1,000 new value are going to have the same key, and some of the ones with different keys will go in the same page, but many won’t. With a thousand inserts you can expect to have to access hundreds of disk pages to do the same database operation that only took around 20 pages without the index.
This is important because accessing disk is orders of magnitude slower than anything happening in memory and CPU. And it’s even worse than that because in most systems, that 20 pages needed to lay out the data can be done mostly with a streaming write, while accessing the index pages will often require random-access reads and writes, and random-access disk access is a couple of orders of magnitude slower than streaming access, so keeping indexes updated can be hugely expensive.
This cost is mitigated in many systems by caching disk pages in memory. Many of those hundreds of index pages will already be in the buffer cache so they do not have to be read. But in some systems they do have to be written on transaction commit, so you are only saving about half of a cost that is orders of magnitude too large. Saving half of tens of thousands of time too much still leaves you with tens of thousands of times too much.
One way that ADS tried to avoid this cost is by using large disk pages. With 64KB pages instead of 4KB pages, there would only be about 250 leaf pages, one sixteenth as many random accesses to do to disk. Of course ADS payed the corresponding costs of larger pages.
Another way that ADS reduced the costs of maintaining indexes was by keeping the index in memory. New index items were recorded in a special memory-based index structure based on a hybrid of b+trees and binary trees. The root was basically a b+tree root, but the entries of the b+tree were binary trees. I don’t recall the details of how these structures were maintained.
Inserting a new index entry did not always require access to a disk page. Instead, the value would simply be added to the memory-based index. If the value were to be overwritten again, it might never be written to disk which means that the page in the correct range to hold that value might never be read from disk. However, the update was logged in the transaction log, so it could be recovered in case of a crash.
For a delete, ADS didn’t do so well. It would always search for and load the index page containing the key to be deleted and mark the entry as deleted. This would be done by moving the value to the memory-based index. Even if that key were added back before the index page had to be written to disk, the page would still have to be written because in general the mapping would have changed. That is, the key used to point to one row, now it points to a different row.
An update was just treated as a delete of the old key and an insert of the new one.
When an updated index page was written back to disk, ADS would first go through the associated memory-based index and fold in any values in the range. Index pages were never overwritten, instead, new pages would be written similarly to an append-only B+tree. This meant that ADS also had to write the interior-node pages, when it might not have had to otherwise (if just writing a page back to the same place, there is no need to update the pointers to the page). However, in practice, ADS kept most of its interior nodes in memory most of the time, so those pages typically only had to be written out during checkpoints.
Append-only B+trees tend to use a lot of disk space. ADS got some of the advantages of this without the ever-expanding space by using a free list of pages that were no longer needed. This is similar to the strategy used by Lightening DB, which currently seems to be the fastest Berkeley-DB-like database.
ADS could have tried to arrange things so that the new index pages were all written to adjacent disk pages which would allow streaming writes, but that was not done for regular inserts. Instead, ADS just used an unordered list of free pages. I think part of the reason for this was the belief that in the workloads that we were optimizing for, there would not be a lot of cases where a lot of index pages would have to be written out at once. ADS did have a bulk loading mode that would create indexes using streaming writes.
Thoughts on ADS Indexes
I don’t believe that the modified B-trees that ADS used to cache index updates are the best data structure for concurrent access to an ordered list. I’m not familiar with current research on this topic, but I’ll bet skip lists would work well in this context.
Skip lists have average O(log n) performance like the various binary tree-based data structures, but the overall performance is usually slightly less. But skip lists can be made thread-safe very efficiently, and I think it’s likely that thread-safe skip lists would be faster than any thread-safe tree data structure that could be used to represent memory-based indexes.
A skip list is just a stack of linked lists and inserting a value into a skip list is just a matter of inserting into a stack of linked lists. Inserting into a linked list can be done using common machine atomic instructions with very little overhead. Also, as long as you do the inserts from the lowest to the highest, the skip list will always be correct, so there is no need to lock anything while doing an insert. Just atomically insert the elements of the stack from lowest to highest.
Another possible improvement in the ADS indexing scheme is to cache deletes just like inserts so that disk pages do not have to be accessed when doing deletes. This would be fairly easy to do and it would work like this: when a an update or delete causes a need to delete an index entry, if the index page is already in memory, then do the delete as usual. If the index page is not in memory, then just add a delete node to the in-memory portion of the index without loading the page.
When an index page gets loaded, you have to manage the delete nodes that already exist for the index and make sure readers do not see an index entry from the on-disk page when there is a corresponding deleted index in the in-memory index.. There are various ways to do this, depending on how disk-based and memory-based indexes are related to each other.
Eventually, you have to load the index pages into memory to get the cached deletes onto disk, but by caching the deletes in memory, you may not need to have the disk page in memory for so long, taking up space in the page cache.