In a previous post, I discussed how to handle failures of the commit message. This post will discuss failures of the editing process.
The editing process is just the time between commit messages. It involves the user typing, the client/editor updating the screen, the client sending redo log messages to the server, the client saving redo log messages, and the server receiving and saving redo log messages.
The failures that can occur during this process are:
The client might crash. If the client crashes then it either loses data or it doesn’t. If the client does not lose data then we can recover the editing state by starting with the latest checkpoint and apply the redo log up to the point of the crash. If the client loses data then we recover from the log on the server. The recovery process has to detect whether there was data loss on the client and to properly apply the redo log either from the client or the server.
The network might not deliver the a message. In this case, the client has the correct data on it and the user can continue to work but the server gets out of sync. This situation has to be detected and corrected before applying a checkpoint.
The server might crash. If the server loses data in the crash, then the data has to be recovered from the client as part of the recovery process. If the server does not lose data, then it still needs to be brought up to date with any work that the user continued to do after the crash.
One way to handle network failures is to use a reliable connection-based network protocol like TCP. These protocols use their own techniques, designed, implemented, and tested by experts to ensure that no information gets lost, and if we could assume a moderately reliable network then that would be my choice. But this application is intended to work on poor networks with unreliable connections, and such networks give fits to TCP.
Instead, I’m going to use TCP only for short special sessions like going through the recovery process, and the rest of the time I’ll assume a network protocol that does not guarantee that all packets get delivered but does offer some assurance that that if a packet gets delivered then it is correct; that is, the packet that gets delivered is the one that the sender sent.
To handle the above situations, we are adding a serial number to the redo log messages and also attaching the serial number of the previous checkpoint. In other words, Each redo log message comes with two serial numbers. One is a checkpoint serial number and the other is a message serial number that increases with each log message after a checkpoint. The message serial number gets reset to 0 after a checkpoint. The checkpoint message also gets a new field --the largest message serial number in the list of log messages for the checkpoint. This way the server can verify that it has all of the redo log that it needs for the checkpoint.
The client increments the message serial number for each log message and sends it to the server, recording the message itself in a list. The server also keeps a list of log messages that it receives so that it knows if a message is missing. Periodically, the server will send the client a list of log messages that it is missing. The client will receive these messages and resend the missing log messages. Notice that since we are dealing with a network failure situation, the server’s message back the client asking for missing messages can get lost and the client’s response to such a message can get lost. We can’t assume that any of it is reliable.
I’m not an expert on designing protocols for unreliable networks, so that’s all the farther I’m going to go with fixing packet losses. When the server receives a checkpoint message, if it does not have all of the packets it needs for the checkpoint, it will wait a few seconds for more packets to come in. If it still does not have all of the messages it needs, then it will send back a failure message to the client.
At this point, the client will attempt to open a TCP connection to the server to fix the problem. If it can’t open the connection or the connection drops while the fix is taking place, then the client acts as if there has been a total network failure. The user can continue to work without the backup of having a server and the client will still keep trying to contact the server every few minutes, but now it is sending recovery messages, not log messages.
A recovery message is a message from the client to the server that says basically: “I am on checkpoint X, log message Y”. The server has three potential replies:
- I am up to date.
- I am missing log messages A,B,C,...
- My checkpoint is older than yours.
- You are missing log numbers from Y to Z
- My checkpoint is newer than yours.
If the client receives reply 1 then it resumes sending redo log messages. If it receives any other reply or the reply times out, then it tries to open a TCP connection to resolve the issue. It resolves the issue by copying the missing data from the source with the newer data to the one with the out-of-date data. If it ends up with a message serial number greater than 0 (that is, there are redo log messages that have not been appied to a checkpoint) then it sends a checkpoint message.
Every time the user opens an existing file, the client begins by sending the server a recovery message to make sure things are up to date.