Transaction Management and Concurrency Control in DDS
Here is a scheme that makes use of the three techniques I suggested in the previous sections. I’ll refer to this system as the DDS to distinguish it from ADS. DDS stands for Dave’s Data Server.
The history array
Each page in DDS has a single history pointer in the header. When the page is on disk the history pointer is NULL. When the page is in memory but there are no updates to the page, the history pointer is NULL. When a writer wants to modify any value in the page, it begins by setting up a history array and assigning the page’s history pointer to point to it.
The history array is an array of history slots, one history slot for each row in the page
class HistorySlot {
TransactionID tid;
HistoryBlock* next; // history created by tid
};
typedef HistoryBlock* HistoryArray;
The tid is the transaction ID of the transaction that put the current value in the row and created the first block in the history chain.
This may seem like a lot of overhead for what may be an update of a single cell in a single row, but the memory used is just 16 bytes per row. That is the same amount of memory that ADS uses in row headers, which DDS doesn’t need. In DDS, we only need to allocate the 16 bytes per row if there is an update in the page, whereas ADS essentially always has to allocate that space. And although we do need to zero out the array after allocation this is much less expensive than reading the same amount of memory from disk and then writing it back to disk which is what ADS did.
Row history
The row history uses the strategy of a single chain of history for each row. A single history block look like this:
class RowHistoryBlock : public HistorySlot {
char* cols; // array mapping column number to offset in buf[]
uint32_t buf[0]; // the previous values before this update
};
This class inherits tid and next from HistorySlot. The tid is the transaction ID of the transaction that created the next block in the history list. This field is NULL if there is no next block in the history list, indicating that there is no history older than this. Defining tid this way lets us delete history in chunks when the history is so old that no current transaction will ever access it.
cols is an array of unsigned integers of type large enough to enumerate all of the columns beginning at 0 (I assume 2^32-1 columns is enough). We use 0 as a reserved special value. Each member of cols gives the byte offset of the corresponding column in buf. For example if column n is updated and the column has type T then the historic value of the column n is
*(T*)(buf + (cols[n-1])
If cols[n-1] is 0 then the history block does not contain history for that column. In other words, the update that this history block is for did not update that column.
cols is shared by all of the related history blocks. If an UPDATE statement updates 1000 rows in a table, there will 1000 different versions of buf, but only one copy of cols shared by all of the RowHistoryBlocks. So if there are multiple rows accessed in the table then cols will usually be loaded into the CPU cache once at the beginning and will not cause further cache misses.
Transaction IDs
There are two global variables
// tid to assign to the next transaction
TransactionID nextTransaction;
// tid of the last transaction that committed
TransactionID lastCommittedTransaction;
Each time a transaction starts, it sets two transaction state variables. Ignoring critical-section issues:
TransactionID transactionID = nextTransaction++;
TransactionID historyID = lastCommittedTransaction;
When a transaction commits, lastCommittedTransaction is set to the ID of the committing transaction so all transactions that start before another transaction commits will use that transaction ID for history. In effect each transaction T1 sees the world frozen at the time of the commit of the transaction T2 such that T1’s historyID was T2’s transactionID.
However transactions do not necessarily commit in the order in which they started. We can’t say that one transaction is older than another based on the order of the transaction IDs, because that order is based on when transactions start rather than on when they commit but “older than” has to be based on the order in which transactions commit. A transactionID x is older than transactionID y if the transaction of y committed after x.
To implement this, we define a transaction ID as a pointer to a Transaction struct:
typedef struct Transaction {
Timestamp commit;
...
} *TransactionID;
bool older(TransactionID tid1, TransactionID tid2)
{
return tid1->commit < tid2->commit;
}
We only use the older() function to compare transactions that have committed (recall that for uncommitted transactions we compare historyID, not transactionID).
Reads
The procedure for reading a row is complex enough that giving code is probably the clearest way to say it. This code has not been tested but it does give an idea of how things might be implemented in C++ and how much of work would be involved in reading values from a row.
void readRow(
TablePage* page, // the page containing the data
unsigned rowNum, // the number of the row in the page
uint32_t* columnList, // list of column numbers to read
uint32_t* bufOffsets; // offsets of the columns in colBuf
char* colBuf) // a buffer for the column values
{
Table* table = lookupTable(page);
char* rowBuf = page->buf + rowNum * table->rowSize;
// read the values from the row
for (unsigned i=0; columnList[i]; i++) {
unsigned colNum = columnList[i];
char* bufPtr = colBuf+bufOffsets[i];
char* rowPtr = rowBuf+table->offsets[colNum];
size_t colSize = t->columnSizes[colNum];
memcpy((void*)bufPtr, (void*)rowPtr, colSize);
}
if (page->history == 0) return; // no history for this page
HistorySlot* slot = &page->history[rowNum];
// go through the history list, looking for historic values.
// If the transaction that created the next history block is
// older than the age of this thread (historyID), then stop.
for (;;) {
RowHistoryBlock* b;
TransactionID tid;
// this condition is true if there is no history for the row
// or if we are at the end of the history list
if (older(tid,historyID)) return;
// for each column that the reader needs, get any history in
// this block
for (unsigned i = 0; columnList[i]; i++) {
unsigned colNum = columnList[i];
unsigned histOffset = b->cols[colNum];
if (histOffset == 0) continue; // no history for this column
// There is history for this column so get the value,
// overwriting the value that was extracted from the row.
// There may also be earlier history older on in this list
// which will overwrite this value
char* bufPtr = b->buf+histOffset;
char* rowPtr = rowBuf+table->offsets[colNum];
memcpy((void*)bufPtr, (void*)rowPtr, colSize);
}
slot = (HistorySlot*)b;
}
}
Rollbacks
Since a writer updates the cell in the row before the transaction commits, this implies that on rollback, the transaction has to go through all of the updates it made and undo them. One of the big motivations for this entire section was to avoid traversing all of the updates, but it is a lot less of a problem to do this for rollbacks than for commits because in most properly functioning systems, commits are a lot more common than rollbacks. Because of this, just avoiding the traversal on commits still might give a big advantage over ADS.
But why do anything at all for rollbacks? Why not just mark the transactions as rolled-back? Readers would treat the history of the rolled-back transaction just like an active uncommitted transaction, meaning that they would ignore the rolled-back values and take the previously committed value instead. Writers would be similar except that some minor changes have to be made so that a writer can write a change over the history of a rolled-back transaction, which they could not do for a running uncommitted transaction.
The problem with this strategy is that it turns history blocks into records permanently holding current data. For example, suppose a particular cell contains the value 4 and transaction T1 updates the cell to 5. In DDS, 5 is immediately written into the cell and 4 is saved in the history block. If the T1 now gets rolled back, then the value 5 in the cell is no longer the current data because that update is not valid. The current value, 4, is now in the history block. Until some process moves the 4 back into the cell, that history block contains live data.
This is a problem because history gets freed automatically by date. When could we safely free the block containing the 4? At some point, the oldest transaction in the system will be younger than the time of the rollback, but even at this point, every transaction needs to use the history block with a 4 because the value in the cell is still not the current value. The history block with the 4 never expires.
Still, there is a time when the 4 has to be written back into the cell and that is the time at which the page is written to disk. So DDS keeps a list of all pages updated by a given transaction. When the transaction is rolled back, DDS starts writing all of those pages to disk. Then DDS only has to hold onto the history of the transaction until all of the pages updated by the transaction have been written to disk.
The storage allocated for the history of rolled-back transactions cannot be recovered until the page with the update is written to disk. This is no problem for small updates, buf if the transaction failed and had to roll back because, say, the transaction’s history used up so much memory the system ran out, then this could be a problem. But that’s an exceptional case that will slowly correct itself as the pages are written to disk.
Note that to free the memory, the pages don’t actually have to be written to disk, they only have to be prepared for writing to disk. To prepare for writing a page to disk, DDS has to go through the page history
Writing Pages to Disk
Never overwrite existing pages, keep existing pages as the checkpoint
Page buffering tries to avoid writing pages with active history
History is written in serialized form in a separate page so it can be freed independently