Thursday, May 30, 2013

Log Forwarding for Text Editors

Log forwarding replication or log-based replication is a technique used in databases to keep two different databases in sync with the same data. This is used for redundancy in the event of failure of one of the database machines and also to spread out work into a cluster.

I have no idea how Google Docs keeps the server up to date with the client window, but I would do it with log forwarding. I'm going to write a series of posts describing how to do this. First I'll start with the basic idea and then in later posts talk about how to handle failures and conflicts.

First a few definitions:
  1. The client is a program running on the local machine --the machine that the user is interacting with. The client handles all of the user interaction and communicates with the server.
  2. The server is a program running on the remote machine that is used to manage a central data repository.
  3. checkpoint is a physical copy of the data. In the case of a replicating the output of a text editor you can think of it as a text file on disk.
  4. redo log --or sometimes just a log-- is a set of instructions that start at a specific checkpoint and make modifications to the data. In editor language this sort of log is sometimes called an edit script. In the software field it is often called a patch.
  5. failure is when the hardware or software does not work as it is supposed to. This can include crashes or the network going down among other things.
  6. conflict is when two clients send updates for the same data to the server and neither update happens strictly before the other one.
In the simple case where there are no failures or conflicts, a client can update the server using the following method:
  1. When the user starts an editing session, he starts with a checkpoint called the current checkpoint. The checkpoint is the same on both the client and the server either because it is a new, empty file, or because it was copied using some reliable copy mechanism.
  2. As the user edits the file, the client constructs a redo log of his changes based on the current checkpoint. The redo log is streamed both to local storage and to the server at the same time.
  3. There is a special checkpoint instruction that can go in the log. At a checkpoint instruction, both the client and server will apply the preceding log to the current checkpoint to create a new checkpoint and the log will be truncated to just after the checkpoint instruction. Then these new checkpoints will become the current checkpoint on both the local and remote machines.
  4. Checkpoint instructions are issued when the user ends an editing session and possibly during the session if the redo log starts getting too large.
The point of this strategy is that you can keep replication going continuously with very little traffic. Most user editing is in small bits --much smaller than the size of the entire file-- so this requires less bandwidth than copying the entire file back to the server. That is particularly important on slow, unreliable networks like you often encounter in hotels and coffee shops.

In future posts, I'll discuss what to do about failures. In the case of arbitrary failures, there isn't really anything that can be done, so we will follow the usual custom of restricting our efforts to the case of a single failure. There is a good theoretical reason for this, namely that the probability that a failure occurs at any time is very small, so the possibility that two failures occur at the same time is very, very small.

There are three basic things that can fail: the client, the server, or the network, and there are several critical steps where one of these failures can happen:
  1. When the client first starts up
  2. When the user is editing a file
  3. When a checkpoint is created
Handling these potential failures will require us to send more data back and forth and will create more steps where failures can happen. Hopefully we will eventually reach a fix point where we can properly handle and recover from all single failures.

No comments:

Post a Comment