Sunday, November 24, 2013

Secrets of the ANTs Data Server 04 --Indexes

The Problem with B+Trees

Practically every transactional database allows users to index values using B+trees or their moral equivalent. B+trees make excellent use of both disk caches and CPU caches and are one of the best known ways to access data by keys when the key values are unpredictable. However, updating B+trees is a problem.

Consider what happens when you insert, say, a thousand rows into a table with an index. Laying down the data is fairly quick. You fill up disk pages with the on-disk format of the data and then write the pages to disk. The number of disk pages that you have to access is typically little more than the space actually required to hold the data. If the rows average 100 bytes and you are using the typical 4KB pages, then you have to access on the order of 20 pages.

But lets suppose that there is a single B+tree index on this data and that the index uses 4KB pages and uses about 16 bytes per key, including overhead and reservation space. A single page holds 256 keys. If there are 1 million distinct keys, this means that we have about 4,000 leaf pages. Now, some of those 1,000 new value are going to have the same key, and some of the ones with different keys will go in the same page, but many won’t. With a thousand inserts you can expect to have to access hundreds of disk pages to do the same database operation that only took around 20 pages without the index.

This is important because accessing disk is orders of magnitude slower than anything happening in memory and CPU. And it’s even worse than that because in most systems, that 20 pages needed to lay out the data can be done mostly with a streaming write, while accessing the index pages will often require random-access reads and writes, and random-access disk access is a couple of orders of magnitude slower than streaming access, so keeping indexes updated can be hugely expensive.

This cost is mitigated in many systems by caching disk pages in memory. Many of those hundreds of index pages will already be in the buffer cache so they do not have to be read. But in some systems they do have to be written on transaction commit, so you are only saving about half of a cost that is orders of magnitude too large. Saving half of tens of thousands of time too much still leaves you with tens of thousands of times too much.

One way that ADS tried to avoid this cost is by using large disk pages. With 64KB pages instead of 4KB pages, there would only be about 250 leaf pages, one sixteenth as many random accesses to do to disk. Of course ADS payed the corresponding costs of larger pages.

ADS Indexing

Another way that ADS reduced the costs of maintaining indexes was by keeping the index in memory. New index items were recorded in a special memory-based index structure based on a hybrid of b+trees and binary trees. The root was basically a b+tree root, but the entries of the b+tree were binary trees. I don’t recall the details of how these structures were maintained.

Inserting a new index entry did not always require access to a disk page. Instead, the value would simply be added to the memory-based index. If the value were to be overwritten again, it might never be written to disk which means that the page in the correct range to hold that value might never be read from disk. However, the update was logged in the transaction log, so it could be recovered in case of a crash.

For a delete, ADS didn’t do so well. It would always search for and load the index page containing the key to be deleted and mark the entry as deleted. This would be done by moving the value to the memory-based index. Even if that key were added back before the index page had to be written to disk, the page would still have to be written because in general the mapping would have changed. That is, the key used to point to one row, now it points to a different row.

An update was just treated as a delete of the old key and an insert of the new one.

When an updated index page was written back to disk, ADS would first go through the associated memory-based index and fold in any values in the range. Index pages were never overwritten, instead, new pages would be written similarly to an append-only B+tree. This meant that ADS also had to write the interior-node pages, when it might not have had to otherwise (if just writing a page back to the same place, there is no need to update the pointers to the page). However, in practice, ADS kept most of its interior nodes in memory most of the time, so those pages typically only had to be written out during checkpoints.

Append-only B+trees tend to use a lot of disk space. ADS got some of the advantages of this without the ever-expanding space by using a free list of pages that were no longer needed. This is similar to the strategy used by Lightening DB, which currently seems to be the fastest Berkeley-DB-like database.

ADS could have tried to arrange things so that the new index pages were all written to adjacent disk pages which would allow streaming writes, but that was not done for regular inserts. Instead, ADS just used an unordered list of free pages. I think part of the reason for this was the belief that in the workloads that we were optimizing for, there would not be a lot of cases where a lot of index pages would have to be written out at once. ADS did have a bulk loading mode that would create indexes using streaming writes.

Thoughts on ADS Indexes

I don’t believe that the modified B-trees that ADS used to cache index updates are the best data structure for concurrent access to an ordered list. I’m not familiar with current research on this topic, but I’ll bet skip lists would work well in this context.

Skip Lists

Skip lists have average O(log n) performance like the various binary tree-based data structures, but the overall performance is usually slightly less. But skip lists can be made thread-safe very efficiently, and I think it’s likely that thread-safe skip lists would be faster than any thread-safe tree data structure that could be used to represent memory-based indexes.

A skip list is just a stack of linked lists and inserting a value into a skip list is just a matter of inserting into a stack of linked lists. Inserting into a linked list can be done using common machine atomic instructions with very little overhead. Also, as long as you do the inserts from the lowest to the highest, the skip list will always be correct, so there is no need to lock anything while doing an insert. Just atomically insert the elements of the stack from lowest to highest.

Cached Deletes

Another possible improvement in the ADS indexing scheme is to cache deletes just like inserts so that disk pages do not have to be accessed when doing deletes. This would be fairly easy to do and it would work like this: when a an update or delete causes a need to delete an index entry, if the index page is already in memory, then do the delete as usual. If the index page is not in memory, then just add a delete node to the in-memory portion of the index without loading the page.

When an index page gets loaded, you have to manage the delete nodes that already exist for the index and make sure readers do not see an index entry from the on-disk page when there is a corresponding deleted index in the in-memory index.. There are various ways to do this, depending on how disk-based and memory-based indexes are related to each other.

Eventually, you have to load the index pages into memory to get the cached deletes onto disk, but by caching the deletes in memory, you may not need to have the disk page in memory for so long, taking up space in the page cache.


Sunday, November 17, 2013

Secrets of the ANTs Data Server 03 -- Mutli-Version Concurrency Control

This section contains drawing and equations that Blogger can't handle, so I'll just give a link to it:

Secrets of the ANTs Data Server 03 -- Mutli-Version Concurrency Control

This is the first really fun section that talks about ADS optimization. I think my next posting will be my own take on how the MVCC system could have been improved.


Saturday, November 16, 2013

Secrets of the ANTs Data Server 02 -- Lock-Free


ADS was billed as lock-free, and there is an important sense in which it was lock-free. I’ll explain that in a minute, but first I want to explain how it was not lock-free. The word “lock” seems to mean something different to database people than it does to general programmers who work on parallel systems. To non-database programmers there is really no such thing as a truly lock-free program with parallel threads of execution (I’m using “thread” in a generic sense in this section). There are always going to be critical sections --places in the code where two threads might access the same data-- and you have to do something to lock out other threads when one thread is in the critical section.

ADS makes extensive use of lock-free data structures, but it still has  several critical sections and it uses spinlocks to guard those critical sections. Spin locks can waste processing time if not used carefully and under some conditions can cause starvation --a situation where one thread is trying to get in, but other threads keep jumping in first. However, used judiciously, spin locks can be very effective and they were used properly in ADS.

There is another kind of lock that you can’t avoid in a SQL engine like ADS that allows arbitrary user-supplied queries. In such systems, you need some sort of flag that tells you when one transaction has modified a particular value, so if another transaction tries to modify the same value before the first transaction has committed, then this is prevented.

However, what SQL database people typically call locks is more complicated than this. In many SQL databases there is a complex hierarchy of locks that are used during normal UPDATE, INSERT and DELETE statements to prevent inconsistent updating of data and to do so while also avoiding the two typical problems that locks cause: starvation and deadlock.

Starvation is what happens when multiple threads of execution are waiting on the same lock and one of the threads never manages to get the lock. Every time the thread that has the lock releases it, some other thread grabs it and the poor starving thread --for some reason-- is never fast enough or never in the right position to get it.  Traditional database  locks prevent this from happening by using a first-in-first-out queue of threads that are waiting on one particular lock. Threads get in line and get to take the lock in the same order that they originally requested it.

Deadlock happens when there is a cycle of locks like this: thread A has lock 1 and is blocked on lock 0 so thread A can never make progress until lock 0 becomes available. Thread B has lock 2 and is blocked on lock 1. And thread C has lock 0 and is blocked on lock 2. Every one of the threads is blocked, waiting for another thread to release a lock, and every one of the threads is holding the lock that another thread needs to continue. No thread will ever make progress because they are all blocked on a lock that will never be released because the thread that holds it is also blocked.

Traditional database locks typically avoid deadlock by having a set of locks at different levels of exclusion such as “read lock”, “write lock”, “intended write lock”, etc. and a set of rules for taking locks and promoting locks arranged so that if all of the threads follow the rules, then deadlock is impossible.

ADS does not have these sorts of traditional database locks for normal database operations. ADS does use a shared lock/exclusive lock strategy on various schema objects and internal data structures to keep them from being changed while someone is using them, but this isn’t normal database work. During normal database operations (UPDATE, INSERT, DELETE) ADS does block changes to data cells that have already been changed by another transaction (a cell is one column in one row) but it does not have multiple lock levels, does not have any kind of lock data structure (for database updates), and does not queue up transactions that are waiting on a cell. I’ll describe that in the section on MVCC.

ADS also does not use a traditional lock structure to guarantee exclusive access when updating internal data structures like indexes that need to be updated when a cell is updated, nor does it use a lock-free data structure for on-disk indexes. I’ll describe how it does that in the section on phased execution.


Saturday, November 9, 2013

Secrets of the ANTs Data Server 01 --Introduction

The ANTs Data Server was a SQL database that I worked on from 2001 to 2007. At the time it was arguably one of the most technologically advanced databases in the world. The brainchild of Cliff Hersch, ADS used lock-free multiversion concurrency control, an internal architecture based on cooperative multitasking, and compiler that compiled user queries to machine code. It used a set of threads (typically one per core) that each concentrated on doing small tasks and only swapped tasks at points where there was very little context to record so there was never a full-scale context switch, and the more parallel work you gave it, the better it performed. The storage management was designed so that most allocations and deallocations were just a few instructions and each task thread had its own allocation regions to minimize the critical sections associated with memory management.

In future posts I’ll get into details of how all of this worked. I’d like to thank Achim Bode (whose tenure at ANTs was pretty much coextensive) with mine for reviewing these documents and pointing out quite a few errors in my memory of the technology.

About the Server

We had benchmarks for ANTs Data Server  that showed us running up to 80 times faster than other commercial databases of the day. Of course, those were for special work loads that ADS did especially well on, but for typical high-volume transactional workloads, we were typically 10 times faster.

Performance numbers like that may make you think “in-memory database”, but ADS was a full ACID database that stored all data on disk and had the usual persistence guarantees of disk-based databases (although for performance it did rely on having enough RAM to keep most of the working set in memory). ADS had read-committed and serializable transaction isolation levels. It had fully functional logging, checkpointing, on-line backup, and transactionally safe read-only hot standby servers with automatic failover from the clients.

ADS supported pretty much all of SQL 92 with additions from later SQL standards as well as many compatibility features for Oracle, Microsoft, Sybase, and Informix SQL as well as a stored-procedure language inspired by PL/SQL. It had B+tree indexes, a cost-based planner and a facility to gather statistics for the planner. ADS used ODBC as its native driver and could cache query prepares on the client to unload work from the server.

Some History

Unfortunately, ANTs Data Server never sold well in the commercial space to the great disappointment of the team who created it --one of the most talented and personable groups of people I have ever worked with.

Some time in the year 2000, Eugene Malik created a project in Microsoft Visual Studio to begin the implementation of Cliff’s revolutionary idea for Asynchronous Non-blocking Threads. I wouldn’t join the team until about a year later. In those days, all of the cool kids wrote enterprise software on Linux or Unix, but the choice to develop in Windows was a critical factor in our rapid progress. There were no development platforms on Unix or Linux that could compare with Microsoft Visual Studio for convenience or ease of use.

In fact, even today, a decade later, I’ve never found a development platform on Linux that is as good as the Visual Studio of 2001, although Delphi was also a tremendous development environment in those days and I never got a chance to try Kylix, the version of Delphi that ran on Linux. I haven’t used Visual Studio since then, but given Microsoft’s propensity over the last decade to favor novelty over consistency and usability, I expect that modern iterations of Visual Studio are probably not as good.

Leading the team at that point were Cliff Hersch and Jeff Spirn. Jeff was the only one on the initial team with real experience in database internals. It was hard for a small company to hire experienced database engineers from real companies like Oracle, IBM and Microsoft, and Cliff and Jeff had the philosophy that experience was less important because they were using a novel implementation. Consequently, their hiring strategy was just to look for smart people that they thought they would enjoy working with. Jeff used to say that when interviewing candidates he tried to answer these three questions: “Can they do it? Will they do it? Can I stand them while they are doing it?” Frankly, that’s a risky policy, but it worked out very well at ANTs.

What neither Cliff nor Jeff had was experience in managing a commercial software product. For that the company hired Girish Mundata who came on board a few months after I did, sometime in 2002, I believe. Girish had enormous energy and enthusiasm and managed to single-handedly change the culture from that of a research project to that of a project to produce a commercial product. Personally, I preferred the research-project environment that Cliff and Jeff had fostered, but at some point you do have to put out a product if you want investors to keep giving you money. It’s a cruel and heartless world.

We went through the usual pains of a startup and spent a lot of time in the endless rounds of the sales-driven development cycle:

1. we tell the sales team that we have X
2. the sales team goes out to sell X
3. the sales team comes back and says that they can’t sell X but they could sell Y if we had it.
4. the engineering team creates Y
5. Set X=Y and go to 1.

If you ever find yourself in that loop in a startup, then I advise you to run, not walk, to the nearest exit. That goes whether you are an employee or an investor.

I believe our first release was in 2003 and sales were practically non-existent. After five years of the sales-driven development cycle, our investors had had enough (they shouldn’t have put up with it that long) and the CEO responded by getting out of the server business and selling off the intellectual property in ADS to two different companies. In 2008, ANTs sold a source code license to Sybase to use ADS in their database products, but inside sources tell me it was never used. Later that year, ANTs sold ADS to Four J’s, a company that writes database applications based on Informix. Four J’s sold a descendent of ADS as “Genero db” and a lot of the original team went to Four J’s to continue working on the product.

The total revenue that ANTs saw from those sales was about 4 or 4.5 million --I don’t recall exactly. I estimated at the time that we sold ADS for about 1/10 of the amount that we had invested in creating it. Our team, which was never very large to begin with, was split into three groups: some went to Sybase, the some went to Four J’s, and some stayed on at ANTs to work on another database-related product (several of the Sybase and remaining ANTs team eventually ended up at Four J’s).

Those were some unpleasant times. Some of the team didn’t want to leave but ANTs was required by the terms of the sale to send to each of Sybase and Four J’s a team with the talent and knowledge to independently continue work on ADS. I was a manager at the time and had to help divide up the teams which was not fun since so many of the team were friends. And naturally there were hard feelings from some of those we chose to send off because they felt like they were being coerced. Upper management didn’t handle it well, and Sybase, at least, made no effort to help people feel better about it. Sybase later sued ANTs because so many of the people we sent there quickly left. I have no idea what their legal argument was for ANTs being responsible that Sybase couldn’t motivate their own employees to stay.

I stayed on at ANTs and became the Director of Development but soon left because I really didn’t enjoy management, wasn’t excited about the new product, and had an opportunity to work elsewhere on a shared-nothing distributed big-data database. For political reasons I couldn’t tell anyone where I was going, so I unfortunately couldn’t explain to my friends why I didn’t want to join them at Four J’s when I left ANTs. I think I made up some lame excuse about wanting to take a long vacation before deciding.

Genero DB, the Four J’s rebranding of ADS, is still apparently available, although almost none of the original crew is still working there and my understanding is that they stopped all development work around 2010 or so. Most of the people who developed ADS are now at Oracle, Sybase/SAP, Google, or other big companies and about half are still working on databases.

In a surprise move (to me, anyway), ANTs has bought ADS back from Four J’s. I don’t know what their plans are with the product.

Revealing the Secrets

Now, for the first time ever (not counting lots of casual conversations over the years) I shall reveal the secrets of the ANTs Data Server: how the non-locking architecture worked, how the compiler worked, how storage management worked, etc. I’ll publish in a series of posts about specific features or components.