This is a continuation of my series on log-based replication for text editors.
Let’s focus on the checkpoint instruction for now. Immediately after successful execution of the checkpoint instruction on both the client and the server, we should be in a safe state where both client and server have exactly the same data. So what are the kinds of things that can go wrong with the checkpoint instruction?
The client might fail to save the file. Although in the previous post I said that the client (like the server) would apply the redo log when it gets the checkpoint message, that isn’t really true. The client has been “applying” the redo log all along as the user edits the in-memory buffer. The current checkpoint on the client is in the in-memory editor buffer, so all the client has to do is save this buffer to disk (I’m assuming that the file is not too large to reasonably keep in memory. In a later post I’ll discuss extending this method to arbitrary-sized files). The client can fail because it can’t save the file to disk or because the client crashes before saving the file.
In this case, the updates on the server are correct (because we are assuming single-failure) so when the problem with the local machine is resolved, we can recover the changes from the machine. We are leaving it up to the designers of the local machine to make sure that this failure does not go unnoticed by the user. Presumably the local machine either crashes or gives the user a message that it failed to save the file. If there is a failure of the local storage device, the user might even want to continue working as long as the network and server are stable and reliably saving changes.
The network might fail to deliver the checkpoint message. In this case, there isn’t any real damage done as long as the checkpoint protocol is designed to handle the case and as long as all of the redo log has been received and properly stored because the server can continue to receive the redo log and it can apply the longer redo log to an earlier checkpoint. However, this complicates the recovery process so I will probably decide to treat it as a server failure.
The server might receive the checkpoint message but fail to save it. Again, we are relying on the system to ensure that either the remote machine crashes or it gives the server a message telling it that the storage device has failed. If the storage device fails then the server can return a message to the client, telling it that there has been a failure. If the remote machine crashes, then there should be some sort of timeout mechanism so that the client can detect this situation and alert the user. However, if there isn’t a second failure, the user can continue working without loss of data.
There is another possible way that the server can fail to save the checkpoint. There may be a software bug in the server itself. The procedure to apply an edit script to a buffer is relatively complex, and the code base in the server to do this is probably different from the code base in the client/editor to do it. Whenever two different pieces of code are trying to implement exactly the same function it’s a good idea to take some steps to detect bugs. In our case, we will consider the actions of the client/editor to always be correct since the user sees what it is doing in real time. If the server produces a checkpoint different from the checkpoint that the client produces, then this is treated as a failure in the server.
We will handle all of these cases by sending a bit more information in the checkpoint instruction, and then describing a recovery process.
There are two extra pieces of information that we will associate with each checkpoint:
- The checksum is a number calculated from the text of the checkpoint. The interesting property of a checksum is that if two different files are related by some sort of editing history, it is extremely unlikely that they will have the same checksum.
- The serial number is a number that increases for each checkpoint so that we can identify which of two checkpoints for the same file is later.
We can create the checksum on the client from the in-memory buffer which will be very fast. Then we send the checksum and the serial number in the checkpoint instruction sent to the server.
If the client fails after sending the checkpoint message, then the server still creates the new checkpoint, verifies the checksum, and saves it with the new serial number. Since we are only handling single failures, we don’t worry about the case where the server or network messes up during this process. After such a failure, the latest checkpoint is on the server and is labeled with a serial number later than any serial number on the client. Recovery consists of copying that checkpoint back to the client.
If the network or server fails after the client sends the checkpoint instruction, then the client successfully saves the latest checkpoint to disk (again, because only a single failure is handled). In this case, recovery consists of copying the latest checkpoint from the client to the server. If the server fails the checksum test then it discards the checkpoint that it generated and this case is treated like a failure of the network or server --the latest checkpoint is on the client.
To recover from any of these errors, the client sends a message to the server with the serial number of the latest checkpoint on the client. The server compares this to the number of the latest checkpoint on the server. If the server’s checkpoint is later then it initiates a copy from the server to the client. If the client’s checkpoint is later than the server initiates a copy from the client to the server.
The recovery operation can be initiated by detection of a failure or just by starting up the client to edit an existing file.