Sunday, January 26, 2014

A Compositional SELECT Statement for SQL

In a previous post I discussed the advantages of breaking queries up into smaller parts like Pig Latin does and then suggested how one might do this and have a more SQL-like language than Pig Latin. This post is a further development of that idea but in this post I’ll be trying to define a compositional language that still looks exactly like SQL when you put the parts together in the right way. And SQL will be a subset of this language in the sense that any legal SQL query is a valid query in this language and does the same thing.

This document uses mathematics that blogger can't handle so please go here to see it.

Sunday, January 12, 2014

Thoughts on Improving ADS 04 --A Compound Proposal

Transaction Management and Concurrency Control in DDS

Here is a scheme that makes use of the three techniques I suggested in the previous sections. I’ll refer to this system as the DDS to distinguish it from ADS. DDS stands for Dave’s Data Server.

The history array

Each page in DDS has a single history pointer in the header. When the page is on disk the history pointer is NULL. When the page is in memory but there are no updates to the page, the history pointer is NULL. When a writer wants to modify any value in the page, it begins by setting up a history array and assigning the page’s history pointer to point to it.

The history array is an array of history slots, one history slot for each row in the page

class HistorySlot {
 TransactionID tid;
 HistoryBlock* next; // history created by tid
typedef HistoryBlock* HistoryArray;

The tid is the transaction ID of the transaction that put the current value in the row and created the first block in the history chain.

This may seem like a lot of overhead for what may be an update of a single cell in a single row, but the memory used is just 16 bytes per row. That is the same amount of memory that ADS uses in row headers, which DDS doesn’t need. In DDS, we only need to allocate the 16 bytes per row if there is an update in the page, whereas ADS essentially always has to allocate that space. And although we do need to zero out the array after allocation this is much less expensive than reading the same amount of memory from disk and then writing it back to disk which is what ADS did.

Row history

The row history uses the strategy of a single chain of history for each row. A single history block look like this:

class RowHistoryBlock : public HistorySlot {
 char* cols; // array mapping column number to offset in buf[]
 uint32_t buf[0]; // the previous values before this update

This class inherits tid and next from HistorySlot. The tid is the transaction ID of the transaction that created the next block in the history list. This field is NULL if there is no next block in the history list, indicating that there is no history older than this. Defining tid this way lets us delete history in chunks when the history is so old that no current transaction will ever access it.

cols is an array of unsigned integers of type large enough to enumerate all of the columns beginning at 0 (I assume 2^32-1 columns is enough). We use 0 as a reserved special value. Each member of cols gives the byte offset of the corresponding column in buf. For example if column n is updated and the column has type T then the historic value of the column n is

*(T*)(buf + (cols[n-1])

If cols[n-1] is 0 then the history block does not contain history for that column. In other words, the update that this history block is for did not update that column.

cols is shared by all of the related history blocks. If an UPDATE statement updates 1000 rows in a table, there will 1000 different versions of buf, but only one copy of cols shared by all of the RowHistoryBlocks. So if there are multiple rows accessed in the table then cols will usually be loaded into the CPU cache once at the beginning and will not cause further cache misses.

Transaction IDs

There are two global variables

// tid to assign to the next transaction
TransactionID nextTransaction;
// tid of the last transaction that committed
TransactionID lastCommittedTransaction;

Each time a transaction starts, it sets two transaction state variables. Ignoring critical-section issues:

TransactionID transactionID = nextTransaction++;
TransactionID historyID = lastCommittedTransaction;

When a transaction commits, lastCommittedTransaction is set to the ID of the committing transaction so all transactions that start before another transaction commits will use that transaction ID for history. In effect each transaction T1 sees the world frozen at the time of the commit of the transaction T2 such that T1’s historyID was T2’s transactionID.

However transactions do not necessarily commit in the order in which they started. We can’t say that one transaction is older than another based on the order of the transaction IDs, because that order is based on when transactions start rather than on when they commit but “older than” has to be based on the order in which transactions commit. A transactionID x is older than transactionID y if the transaction of y committed after x.

To implement this, we define a transaction ID as a pointer to a Transaction struct:

typedef struct Transaction {
 Timestamp commit;
} *TransactionID;

bool older(TransactionID tid1, TransactionID tid2)
 return tid1->commit < tid2->commit;

We only use the older() function to compare transactions that have committed (recall that for uncommitted transactions we compare historyID, not transactionID).


The procedure for reading a row is complex enough that giving code is probably the clearest way to say it. This code has not been tested but it does give an idea of how things might be implemented in C++ and how much of work would be involved in reading values from a row.

void readRow(
 TablePage* page, // the page containing the data
 unsigned rowNum, // the number of the row in the page
 uint32_t* columnList, // list of column numbers to read
 uint32_t* bufOffsets; // offsets of the columns in colBuf
 char* colBuf) // a buffer for the column values
 Table* table = lookupTable(page);
 char* rowBuf = page->buf + rowNum * table->rowSize;

 // read the values from the row
 for (unsigned i=0; columnList[i]; i++) {
   unsigned colNum = columnList[i];
   char* bufPtr = colBuf+bufOffsets[i];
   char* rowPtr = rowBuf+table->offsets[colNum];
   size_t colSize = t->columnSizes[colNum];
   memcpy((void*)bufPtr, (void*)rowPtr, colSize);
 if (page->history == 0) return; // no history for this page
 HistorySlot* slot = &page->history[rowNum];
 // go through the history list, looking for historic values.
 // If the transaction that created the next history block is
 // older than the age of this thread (historyID), then stop.
 for (;;) {
   RowHistoryBlock* b;
   TransactionID tid;

   // this condition is true if there is no history for the row
   // or if we are at the end of the history list
   if (older(tid,historyID)) return;

   // for each column that the reader needs, get any history in
   // this block
   for (unsigned i = 0; columnList[i]; i++) {
     unsigned colNum = columnList[i];
     unsigned histOffset = b->cols[colNum];
     if (histOffset == 0) continue; // no history for this column
     // There is history for this column so get the value,
     // overwriting the value that was extracted from the row.
     // There may also be earlier history older on in this list
     // which will overwrite this value
     char* bufPtr = b->buf+histOffset;
     char* rowPtr = rowBuf+table->offsets[colNum];
     memcpy((void*)bufPtr, (void*)rowPtr, colSize);
   slot = (HistorySlot*)b;


Since a writer updates the cell in the row before the transaction commits, this implies that on rollback, the transaction has to go through all of the updates it made and undo them. One of the big motivations for this entire section was to avoid traversing all of the updates, but it is a lot less of a problem to do this for rollbacks than for commits because in most properly functioning systems, commits are a lot more common than rollbacks. Because of this, just avoiding the traversal on commits still might give a big advantage over ADS.

But why do anything at all for rollbacks? Why not just mark the transactions as rolled-back?  Readers would treat the history of the rolled-back transaction just like an active uncommitted transaction, meaning that they would ignore the rolled-back values and take the previously committed value instead. Writers would be similar except that some minor changes have to be made so that a writer can write a change over the history of a rolled-back transaction, which they could not do for a running uncommitted transaction.

The problem with this strategy is that it turns history blocks into records permanently holding current data. For example, suppose a particular cell contains the value 4 and transaction T1 updates the cell to 5. In DDS, 5 is immediately written into the cell and 4 is saved in the history block. If the T1 now gets rolled back, then the value 5 in the cell is no longer the current data because that update is not valid. The current value, 4, is now in the history block. Until some process moves the 4 back into the cell, that history block contains live data.

This is a problem because history gets freed automatically by date. When could we safely free the block containing the 4? At some point, the oldest transaction in the system will be younger than the time of the rollback, but even at this point, every transaction needs to use the history block with a 4 because the value in the cell is still not the current value. The history block with the 4 never expires.

Still, there is a time when the 4 has to be written back into the cell and that is the time at which the page is written to disk. So DDS keeps a list of all pages updated by a given transaction. When the transaction is rolled back, DDS starts writing all of those pages to disk. Then DDS only has to hold onto the history of the transaction until all of the pages updated by the transaction have been written to disk.

The storage allocated for the history of rolled-back transactions cannot be recovered until the page with the update is written to disk. This is no problem for small updates, buf if the transaction failed and had to roll back because, say, the transaction’s history used up so much memory the system ran out, then this could be a problem. But that’s an exceptional case that will slowly correct itself as the pages are written to disk.

Note that to free the memory, the pages don’t actually have to be written to disk, they only have to be prepared for writing to disk. To prepare for writing a page to disk, DDS has to go through the page history

Writing Pages to Disk

Never overwrite existing pages, keep existing pages as the checkpoint
Page buffering tries to avoid writing pages with active history

History is written in serialized form in a separate page so it can be freed independently

Thursday, January 9, 2014

Thoughts on an Expression-Based Query Language

If you have ever done serious work in SQL, you know how unwieldy and unreadable the queries can get with deeply nested subqueries and other complex structures. You can create temporary tables to make the query look cleaner, but then it might run slower. Pig Latin is a query language that addresses this problem by breaking out the individual query elements and letting you assign them to variables. For example instead of the SQL query


in Pig Latin you could write something like this

t1 = LOAD <table1>;
t2 = FILTER t1 BY x<y;
t3 = DISTINCT t2
t4 = FOR EACH t3 GENERATE x,y;
t5 = ORDER t4 BY x,y;
t6 = LIMIT t5 10;
DUMP t6;

The table variables t1, t2, etc. are not temporary tables. They are more like SQL views that get optimized in with the final DUMP query.

When queries are deeply nested and complex, this ability to break out the components can be very useful. But why restrict the language to such simple constructs? This reminds me of the old FORTRAN where you could not have expressions in an IF statement. An IF statement could only do a simple relational test such as less than, equals, etc. and could only test against variables, not against expressions, for example

i = k+1
IF i .EQ. j

instead of

IF i+1 .EQ. j

One of the big advantages of other languages like Algol was the ability to put a full boolean expression in that IF statement:

IF i+1=j OR i+2=j THEN

This allowed for much cleaner, less wordy programming. Yet Pig Latin seems to be following the old FORTRAN mistake of limiting what the arguments to a construct can be.

Another thing about Pig Latin that I think could be improved is to design the operators so that they can be run together to make an expression that looks something like a SQL query. For non-nested queries, SQL queries give a nice linear expression except for the SELECT clause. The SELECT is an odd duck that combines projection --which logically goes at the end, after ORDER BY-- and aggregation which logically goes with the GROUP BY clause. It’s actually position doesn’t fit either function that it serves. And with analytic functions, the SELECT is even more overloaded in what it does.

Here is an alternative way to design a query language that lets the query writer choose whether to break things out like Pig Latin or combine components to look like SQL.

<table> PROJECT <expression-list>
Projects <expression-list> from <table>. In other words, acts like the SQL query
SELECT <expression-list> FROM <table>
However, no aggregate functions are allowed.

<table> WHERE <expression>
Returns the rows from <table> for which <expression> is true. In other words, behaves like the SQL query
SELECT * FROM <table> WHERE <expression>.

<table> GROUP BY <aggregation-list>
Groups <table> by the expressions in <aggregation-list> that do not contain aggregate functions, and calculates group aggregates with the aggregate function. In other words, behaves like the SQL query
SELECT <aggregation-list> FROM <table> GROUP BY <columns>
where <columns> is a list of all of the expressions in <aggregation-list> that do not contain aggregate functions. If <aggregation-list> is “*” then this does a DISTINCT on the table. If <columns> is empty, then the aggregate functions are calculated over the whole table returning a 1-row result.

<table> ORDER BY <order-list>
Orders the rows of <table> according to <order-list>, producing an ordered table as output.

<table> LIMIT <expression>
Returns the first <expression> rows of <table>.

Now you can write compound expressions like this:

<table1> WHERE x<y GROUP BY * ORDER BY x PROJECT x,y LIMIT 10;

which looks a lot like SQL. Or you can break out the individual parts an assign them to table variables like Pig Latin does.

A few additional comments:
  • In order to make the above 1-line query work, you have to define the precedence of the various keyword operators appropriately.
  • There is no need for a HAVING operator because the WHERE operator will work just as well after a GROUP BY as before it.
  • There are various ways to deal with the SQL analytic functions, but I haven’t come up with a really good solution yet. I’ll write more when I do.
  • A complete language of this type also needs something way to do joins, for example an operator like <table1> JOIN <table2> ON <expression>.
  • We also need UNION, EXCEPT, and INTERSECT, or some equivalent.
  • In order to make this work well, we need to define an ordered table which is output by the ORDER BY clause such that the ordering continues through some subsequent  operators (such as analytic functions).