Sunday, June 16, 2013

Log Forwarding for Text Editors III


In a previous post, I discussed how to handle failures of the commit message. This post will discuss failures of the editing process.

The editing process is just the time between commit messages. It involves the user typing, the client/editor updating the screen, the client sending redo log messages to the server, the client saving redo log messages, and the server receiving and saving redo log messages.

The failures that can occur during this process are:

The client might crash. If the client crashes then it either loses data or it doesn’t. If the client does not lose data then we can recover the editing state by starting with the latest checkpoint and apply the redo log up to the point of the crash. If the client loses data then we recover from the log on the server. The recovery process has to detect whether there was data loss on the client and to properly apply the redo log either from the client or the server.

The network might not deliver the a message. In this case, the client has the correct data on it and the user can continue to work but the server gets out of sync. This situation has to be detected and corrected before applying a checkpoint.

The server might crash. If the server loses data in the crash, then the data has to be recovered from the client as part of the recovery process. If the server does not lose data, then it still needs to be brought up to date with any work that the user continued to do after the crash.

One way to handle network failures is to use a reliable connection-based network protocol like TCP. These protocols use their own techniques, designed, implemented, and tested by experts to ensure that no information gets lost, and if we could assume a moderately reliable network then that would be my choice. But this application is intended to work on poor networks with unreliable connections, and such networks give fits to TCP.

Instead, I’m going to use TCP only for short special sessions like going through the recovery process, and the rest of the time I’ll assume a network protocol that does not guarantee that all packets get delivered but does offer some assurance that that if a packet gets delivered then it is correct; that is, the packet that gets delivered is the one that the sender sent.

To handle the above situations, we are adding a serial number to the redo log messages and also attaching the serial number of the previous checkpoint. In other words, Each redo log message comes with two serial numbers. One is a checkpoint serial number and the other is a message serial number that increases with each log message after a checkpoint. The message serial number gets reset to 0 after a checkpoint. The checkpoint message also gets a new field --the largest message serial number in the list of log messages for the checkpoint. This way the server can verify that it has all of the redo log that it needs for the checkpoint.

The client increments the message serial number for each log message and sends it to the server, recording the message itself in a list. The server also keeps a list of log messages that it receives so that it knows if a message is missing. Periodically, the server will send the client a list of log messages that it is missing. The client will receive these messages and resend the missing log messages. Notice that since we are dealing with a network failure situation, the server’s message back the client asking for missing messages can get lost and the client’s response to such a message can get lost. We can’t assume that any of it is reliable.

I’m not an expert on designing protocols for unreliable networks, so that’s all the farther I’m going to go with fixing packet losses. When the server receives a checkpoint message, if it does not have all of the packets it needs for the checkpoint, it will wait a few seconds for more packets to come in. If it still does not have all of the messages it needs, then it will send back a failure message to the client.

At this point, the client will attempt to open a TCP connection to the server to fix the problem. If it can’t open the connection or the connection drops while the fix is taking place, then the client acts as if there has been a total network failure. The user can continue to work without the backup of having a server and the client will still keep trying to contact the server every few minutes, but now it is sending recovery messages, not log messages.

A recovery message is a message from the client to the server that says basically: “I am on checkpoint X, log message Y”. The server has three potential replies:
  1. I am up to date.
  2. I am missing log messages A,B,C,...
  3. My checkpoint is older than yours.
  4. You are missing log numbers from Y to Z
  5. My checkpoint is newer than yours.
If the client receives reply 1 then it resumes sending redo log messages. If it receives any other reply or the reply times out, then it tries to open a TCP connection to resolve the issue. It resolves the issue by copying the missing data from the source with the newer data to the one with the out-of-date data. If it ends up with a message serial number greater than 0 (that is, there are redo log messages that have not been appied to a checkpoint) then it sends a checkpoint message.

Every time the user opens an existing file, the client begins by sending the server a recovery message to make sure things are up to date.

Saturday, June 15, 2013

Definitions, Declarations, and Constraints

If you know anything about programming languages, the title of this section probably strikes you as strange. It is common to group definitions and declarations together, but constraints are usually considered an entirely different topic. However, declarations in common programming languages are rather complex creatures and a part of what they do is very much like a constraint.

This section will analyze the three concepts and see how they are related. All of the concepts involve namespaces, so first lets talk about namespaces.

Namespaces

A namespace is any sort of mapping from names to denotations. It may be a package, module, class, function body or block among other things. Anything that maps names to denotations is a namespace. The mapping is also called a binding.

Here is an example from Java

class C {
 int j;
 int f(int x) {int y; y = j; return x+y;}
 static int g() {}
};

In this example there are many namespaces:
  1. The global namespace (or more properly the package namespace) maps C to a class. The phrase beginning with “class C” introduces that binding into the global namespace.
  2. The class C itself is a namespace that maps g to a static function. The phrase beginning with “static int g” introduces that name.
  3. When the class is instantiated into an object obj, then obj will be a namespace that maps j to a location. Note that the phrase “int j” introduces the name j into a runtime namespace when it gets executed, not a static namespace like the previous two definitions.
  4. When f is called, it will create a runtime namespace to hold local variables x.
  5. When control enters the body of f, it will create another runtime namespace in the block, and this namespace is where y is bound.
  6. When g is called it will create a runtime namespace with no members.

There are two ways to view f in the above example. In Java, C++ and many other languages one would say that f is a function that is a member of the class namespace C. When you call the function in an expression like obj.f(3), this is treated as a call to a function that looks like it has only one argument but really has two: obj and 3. You could rewrite the call to something like f(obj,3), and change the definition of f to something like

int f(C o, int x) {int y; y = o.j; return x+y;}

All I did was add o as an argument and replace the unbound variable j with o.j.

That is the C++ and Java way to view member functions. There is another way to view them and that is to say that when the class is created, it creates a closure for each member function and binds the function name to the closure in the object namespace. More on that when I get around to discussing SL5.

Definitions

A definition is a phrase that gives a name a denotation in a namespace. We say that the definition binds the name in the namespace. For example

const x=100;

binds the name x to the number 100 and the definition

function sum(x,y,z) {return x+y+z;}

finds the name sum to a function.

Variable declarations are also definitions because they bind a name to a location:

var x,y;
var z:integer;

Constraints

There is a class of programming language called constraint programming languages that use constraints as the primary programming paradigm (a logic programming language is a type of constraint programming language where the constraints look a bit like predicate logic). Constraints can be useful even if they don’t form the primary programming paradigm, but I don’t know of any languages that use general constraints in the more limited way that I will discuss in this subsection.

A constraint is a condition that is enforced by the language implementation. For example, in a language with constraints you might be able to write something like this:

constraint x < 0;

and this would cause the compiler to always ensure that x is a negative number. It might do this by checking the value every time x changes, or it might use some sort of logic to prove that the algorithm can never assign a non-negative number.

Constraints like the one above primarily apply to variables. After all, what’s the point of constraining the value of a constant?

But there are other constraints like this:

constraint y = 2*x;

where the compiler could implement the constraint by treating y as another name for 2*x. In this case y would not be a variable (a name that denotes a location) because it denotes twice the value of whatever x denotes.

As you can see, a constant definition looks a lot like a special case of a constraint. For example

constraint y = 2;

seems to have the same effect as

const y = 2;

The only difference is that the definition introduces a new name while a constraint does not. A constraint works on names that are already in scope. Otherwise, consider the problem of deciding what names the following constraint would introduce into scope:

constraint x=y and z=2 and y<3

Declarations

A declaration is a phrase that describes the features of a name that a language needs in order to use the name. In statically typed languages, the language cannot use a name unless the type is known, so a name declares the type of the name. In C, for example you have declarations like this:

extern int x;
extern void f(int);

These declarations do not define anything; all they do is tell the following code how to use the names. They tell the following code that x is an integer variable and that f is a function that takes one integer argument and returns no value.

In other cases, a declaration is combined with a definition:

var x:integer;

This phrase does four different things:

  1. Introduces a new name, x, into the namespace of the innermost block.
  2. Binds x to a new location.
  3. Attaches the signature, integer, to x.
  4. Constrains x so that it can only reference an integer.

The difference between 3 and 4 involves overloaded functions and operators. In a language with static types and overloading, the particular function or operator that is chosen for an expression depends on the signature of the expression. But there is nothing inherent in a signature that constrains the type of values put into a variable. There is nothing logically contradictory about telling the compiler “always use floating point computations on x” and then putting an integer value in x. It may be error-prone and machine-dependent, but it isn’t contradictory. In fact, C and C++ allow doing just that.

In C and C++, a declaration

int x;

assigns a signature to a variable and that signature means that an assignment to the variable such as

x = 4.0;

uses a branch that converts the value to an integer before assigning it to the location. This is one way that a language might enforce a constraint, but it isn’t reliable enforcement when the language has other ways to assign a value to a location. In the ultimate programming language, a constraint must be fully enforced. So in C and C++, a type declaration only attaches a signature to a name and does not impose a constraint.

In other languages such as Common Lisp, a declaration only imposes a constraint and does not attach a signature. In Common Lisp if you declare a variable:

(declare (type integer x))

Then try to assign a non-integer to the variable:

(setf x "foo")

it raises an error. This is the constraint-checking in action. However, such a declaration does not attach a signature to x. Signatures have no meaning in Common Lisp because there is no overloading of functions. When you add x to another number:

(+ x 1)

there is only one branch of + and it uses that branch regardless of any declarations (although + is defined to dynamically check the types of its arguments and do the right thing, and this type checking might be in-lined and then the in-lined tests might be solved at compile-time and then removed).

Sunday, June 2, 2013

Log Forwarding for Text Editors II


This is a continuation of my series on log-based replication for text editors.


Let’s focus on the checkpoint instruction for now. Immediately after successful execution of the checkpoint instruction on both the client and the server, we should be in a safe state where both client and server have exactly the same data. So what are the kinds of things that can go wrong with the checkpoint instruction?


The client might fail to save the file. Although in the previous post I said that the client (like the server) would apply the redo log when it gets the checkpoint message, that isn’t really true. The client has been “applying” the redo log all along as the user edits the in-memory buffer. The current checkpoint on the client is in the in-memory editor buffer, so all the client has to do is save this buffer to disk (I’m assuming that the file is not too large to reasonably keep in memory. In a later post I’ll discuss extending this method to arbitrary-sized files). The client can fail because it can’t save the file to disk or because the client crashes before saving the file.

In this case, the updates on the server are correct (because we are assuming single-failure) so when the problem with the local machine is resolved, we can recover the changes from the machine. We are leaving it up to the designers of the local machine to make sure that this failure does not go unnoticed by the user. Presumably the local machine either crashes or gives the user a message that it failed to save the file. If there is a failure of the local storage device, the user might even want to continue working as long as the network and server are stable and reliably saving changes.

The network might fail to deliver the checkpoint message. In this case, there isn’t any real damage done as long as the checkpoint protocol is designed to handle the case and as long as all of the redo log has been received and properly stored because the server can continue to receive the redo log and it can apply the longer redo log to an earlier checkpoint. However, this complicates the recovery process so I will probably decide to treat it as a server failure.

The server might receive the checkpoint message but fail to save it. Again, we are relying on the system to ensure that either the remote machine crashes or it gives the server a message telling it that the storage device has failed. If the storage device fails then the server can return a message to the client, telling it that there has been a failure. If the remote machine crashes, then there should be some sort of timeout mechanism so that the client can detect this situation and alert the user. However, if there isn’t a second failure, the user can continue working without loss of data.

There is another possible way that the server can fail to save the checkpoint. There may be a software bug in the server itself. The procedure to apply an edit script to a buffer is relatively complex, and the code base in the server to do this is probably different from the code base in the client/editor to do it. Whenever two different pieces of code are trying to implement exactly the same function it’s a good idea to take some steps to detect bugs. In our case, we will consider the actions of the client/editor to always be correct since the user sees what it is doing in real time. If the server produces a checkpoint different from the checkpoint that the client produces, then this is treated as a failure in the server.

We will handle all of these cases by sending a bit more information in the checkpoint instruction, and then describing a recovery process.

There are two extra pieces of information that we will associate with each checkpoint:
  1. The checksum is a number calculated from the text of the checkpoint. The interesting property of a checksum is that if two different files are related by some sort of editing history, it is extremely unlikely that they will have the same checksum.
  2. The serial number is a number that increases for each checkpoint so that we can identify which of two checkpoints for the same file is later.
We can create the checksum on the client from the in-memory buffer which will be very fast. Then we send the checksum and the serial number in the checkpoint instruction sent to the server.

If the client fails after sending the checkpoint message, then the server still creates the new checkpoint, verifies the checksum, and saves it with the new serial number. Since we are only handling single failures, we don’t worry about the case where the server or network messes up during this process. After such a failure, the latest checkpoint is on the server and is labeled with a serial number later than any serial number on the client. Recovery consists of copying that checkpoint back to the client.

If the network or server fails after the client sends the checkpoint instruction, then the client successfully saves the latest checkpoint to disk (again, because only a single failure is handled). In this case, recovery consists of copying the latest checkpoint from the client to the server. If the server fails the checksum test then it discards the checkpoint that it generated and this case is treated like a failure of the network or server --the latest checkpoint is on the client.

To recover from any of these errors, the client sends a message to the server with the serial number of the latest checkpoint on the client. The server compares this to the number of the latest checkpoint on the server. If the server’s checkpoint is later then it initiates a copy from the server to the client. If the client’s checkpoint is later than the server initiates a copy from the client to the server.

The recovery operation can be initiated by detection of a failure or just by starting up the client to edit an existing file.


Continued here.