Thursday, May 30, 2013

Log Forwarding for Text Editors


Log forwarding replication or log-based replication is a technique used in databases to keep two different databases in sync with the same data. This is used for redundancy in the event of failure of one of the database machines and also to spread out work into a cluster.

I have no idea how Google Docs keeps the server up to date with the client window, but I would do it with log forwarding. I'm going to write a series of posts describing how to do this. First I'll start with the basic idea and then in later posts talk about how to handle failures and conflicts.

First a few definitions:
  1. The client is a program running on the local machine --the machine that the user is interacting with. The client handles all of the user interaction and communicates with the server.
  2. The server is a program running on the remote machine that is used to manage a central data repository.
  3. checkpoint is a physical copy of the data. In the case of a replicating the output of a text editor you can think of it as a text file on disk.
  4. redo log --or sometimes just a log-- is a set of instructions that start at a specific checkpoint and make modifications to the data. In editor language this sort of log is sometimes called an edit script. In the software field it is often called a patch.
  5. failure is when the hardware or software does not work as it is supposed to. This can include crashes or the network going down among other things.
  6. conflict is when two clients send updates for the same data to the server and neither update happens strictly before the other one.
In the simple case where there are no failures or conflicts, a client can update the server using the following method:
  1. When the user starts an editing session, he starts with a checkpoint called the current checkpoint. The checkpoint is the same on both the client and the server either because it is a new, empty file, or because it was copied using some reliable copy mechanism.
  2. As the user edits the file, the client constructs a redo log of his changes based on the current checkpoint. The redo log is streamed both to local storage and to the server at the same time.
  3. There is a special checkpoint instruction that can go in the log. At a checkpoint instruction, both the client and server will apply the preceding log to the current checkpoint to create a new checkpoint and the log will be truncated to just after the checkpoint instruction. Then these new checkpoints will become the current checkpoint on both the local and remote machines.
  4. Checkpoint instructions are issued when the user ends an editing session and possibly during the session if the redo log starts getting too large.
The point of this strategy is that you can keep replication going continuously with very little traffic. Most user editing is in small bits --much smaller than the size of the entire file-- so this requires less bandwidth than copying the entire file back to the server. That is particularly important on slow, unreliable networks like you often encounter in hotels and coffee shops.

In future posts, I'll discuss what to do about failures. In the case of arbitrary failures, there isn't really anything that can be done, so we will follow the usual custom of restricting our efforts to the case of a single failure. There is a good theoretical reason for this, namely that the probability that a failure occurs at any time is very small, so the possibility that two failures occur at the same time is very, very small.

There are three basic things that can fail: the client, the server, or the network, and there are several critical steps where one of these failures can happen:
  1. When the client first starts up
  2. When the user is editing a file
  3. When a checkpoint is created
Handling these potential failures will require us to send more data back and forth and will create more steps where failures can happen. Hopefully we will eventually reach a fix point where we can properly handle and recover from all single failures.

Monday, May 27, 2013

Trouble with Chrome

When I started this blog, I bought a Samsung Chromebook and have done a lot of my writing in coffee shops and hotel rooms using it. The hardware is very nice (except for the missing DELETE key) but the software has been somewhat disappointing.

I've been doing most of my writing in Google Drive and then copy-pasting into Blogger to post it. This has been less than ideal because I can't find a way to get diagrams and equations to go over. I've tried embedding with HTML, but that hasn't worked out well either.

In any case, I've been making-do because this Chromebook is so nice that I was willing to put up with problems, but then one day a couple of weeks ago I spent five hours writing a post in Google Drive off-line because the coffee shop I was in didn't have a reliable internet. When I went on-line and opened the document, I got a blank page and a message --something to the effect that "your update could not be saved and you should copy-and-past to save it".

What this obscure error message seems to actually mean is "your document is just gone and you are screwed". I looked at revision history and it showed two revisions --both blank. I tried clicking on the page and using undo --nothing. It looks like Google drive just got confused and lost my 5 hours of work.

To add insult to injury, I reported the problem and Google never got back to me. When a company that wants to be your on-line data storage solution can't even send you an email when you report a bug in their software that lost some of your data, it's kind of scary. I don't know if my bug just fell through the cracks or if Google corporate policy just doesn't place a high priority on the integrity of user data, but I suspect it's the latter.

Well, it's not like I paid them for the service and I can't really complain too much that I got what I paid for, but this has left me unwilling to trust Google Drive with my future work. Five hours may not sound like all that much, but that was five hours of my free time when I could have been saving the world from the Zerg or doing something else productive and, frankly, it discouraged me from more writing.

 So I've been researching for an alternative way to use my Chromebook rather than going through Google Drive. I haven't found anything very promising and at this point, I'm thinking of abandoning the Chromebook for another platform that gives me more reliable data storage.

One thing I should clarify is that this was probably a bug in the off-line feature of Google Drive so if I didn't have to work off-line it wouldn't be an issue. But what else is the point of a cheap, small, lightweight computer? Being small and light makes it easy to transport, and being cheap makes it less of a problem if you lose or damage it while traveling. The purpose of these computers is to use them in hotels, coffee shops and other places where internet service is often unreliable or missing.

That's the point of these computers, so I don't understand why Google seems to view off-line mode as an unimportant side feature. Off-line mode and fast, reliable synchronization to the internet over slow, unreliable connections should be one of the primary features of the Chrome infrastructure.

Oh, well. I guess I've just found my next project --an editor that runs on Chromebooks and does sync right.

Friday, May 17, 2013

Assertions and Assumptions


An assertion is a boolean expression that asserts that some condition is true at a point in the program. For example:

assert x > 0;
do while x > 0;
 ...
od

This piece of code has a while loop and it asserts that just before entry of the loop, x will be greater than 0, which implies that the loop always executes at least once. The assertion should be understood as a claim by the programmer that the condition is true, along with some sort of verification by the language that the assumption is correct.

Typically the assertion is verified at runtime by a runtime test. This test will slow down the program, so it is also typical to allow these tests to be turned off for an optimized build. Unotbainabol ought to do a better job than that. If the programmer has some sort of proof in mind that x > 0 at that point in the program then why can’t the language figure it out at compile time? Of course in the real world there are limits to what can be done with automated theorem proving, but even in the real word compilers ought to do better than they do.

Automatic Verification

One way to do a better job of proving assertions at compile time is to get help from the programmer. The compiler should be able to ask the programmer to explain his reasoning and then check the explanation. This might involve an iterative process where the compiler would try to compile a program with assertions and then output warnings for each one that it can’t prove. The programmer would then add more assertions that would enable the compiler to follow his reasoning.

For example assume the code fragment above appears in this function:

function f(x: int)
 x := x+1;
 assert x > 0;
 do while x > 0;
   ...
 od
end

The compiler would attempt to compile the above code and produce a warning: “file foo, line 25: can’t prove assertion x > 0”.

We could then go in and add a precondition to the function:

function f(x: int)
 precondition x >= 0;
 x := x+1;
 assert x > 0;
 do while x > 0;
   ...
 od
end

A precondition is a special assertion that goes at the beginning of a function. When we add this condition, we are hoping that the compiler is smart enough to figure out that if you start in a state where x >= 0 and then you add one to x, then in the new state, x > 0. This may seem pretty trivial, but it’s a very special case. In general proving even simple things about arithmetic is quite difficult. But since, Unobtainabol is an ideal language rather than a real language, this is not a problem; it is a research opportunity.

So why did we use the keyword “precondition” rather than “assert”? To see what is special about preconditions, we can do another build and get the warning: “file bar, line 122: can’t prove precondition x >= 0 on function f”. What’s going on here? The function f is defined in file foo. How does adding a precondition to function f create a warning in a completely different file, bar?

Well, we look at the line with the warning and we see this:

f(y);

The precondition on function f gets tested, not in the function, but at the point of the call. The compiler sees a condition x >= 0 on the formal parameter x and it tests that condition on the actual parameter y. The test fails because the compiler can’t prove that y >= 0. Preconditions make it possible for the compiler to prove conditions across function calls without doing interprocedural analysis.

So let’s say that the function call occurs in this sequence:

y := g(z);
f(y);

We can fix the warning if we have a way to say that g always returns a nonnegative number. We do this with a postcondition on g:

function g(z)
 postcondition result >= 0;
 ...
end

A postcondition, like a precondition, gets evaluated at the point of the call. The name “result” is a special variable that we can use in postconditions to assert things about the return value of a function.

Checks and Assumptions

One problem with the iterative process described above is that in some cases it may prove too much trouble. The programmer may want just a C-style assert that is checked during testing and then assumed to be true during production runs. In other cases, the programmer may want the condition to always be tested, even in production.

For these purposes, Unobtainabol has two new forms of assert statements. An assertion followed by assume tells the compiler to simply assume that the condition is true, but to add a runtime check for debug code if the condition can’t be proven. An assertion followed by check tells the compiler to always verify the condition in both debug and production code if the compiler can’t verify the condition at compile-time. Here are some examples:

precondition check x>0;
postcondition assume x>0;
assert check x>0;

Optimization Based on Assertions

Let’s look at that original assertion again:

assert x > 0;
do while x > 0;
 ...
od

What is interesting about this example is that if the compiler knows that x > 0, then it ought to be able to skip the test the first time through the loop. In fact, the compiler ought to be able to change this loop into a test-at-end loop.

I don’t know of any real-world languages where this is done because in real languages assertions are viewed as either runtime tests (in debugging code) or as non-entities (in optimized code). In order for the compiler to take advantage of this optimization opportunity, it would have to know that the condition is true either by proving it or because the programmer used assume.

It might seem reckless to let the compiler do optimizations based on assume assertions because the programmer might be wrong. However if the programmer is wrong, then the program is not working correctly anyway, so the danger is not all that much greater.

However, the danger is greater in an important way that has to do with undefined behavior. Undefined behavior is what you get when the language does not specify what happens in certain circumstances. In the C library standard, for example, it specifies that if you call free() twice on the same pointer without getting it back from malloc() in between, then the result is undefined. The language doesn’t say what will happen. Maybe the system will catch the error and give you a nice error message. Maybe the program will immediately crash or go into an infinite loop. Maybe everything will seem to be working but you get mysteriously wrong answers out. You just don’t know.

In higher-level languages there is no undefined behavior. Everything has a defined result, even errors. This is, in fact, a big part of the reason why higher-level languages tend to be slower than lower-level languages. There are always runtime tests needed to make sure that nothing undefined happens.

Now, if the compiler optimizes based on an assumption and that assumption turns out to be false, then sometimes the result will be undefined. There is generally no way to add tests to control what happens in this case because if you put in runtime tests to verify the assumption, then it defeats the purpose of the optimization.

So the question on whether to optimize based on assumptions depends on whether Unobtainabol will have any undefined behavior or not --and that’s a good question with arguments on both sides. I think the perfect language should fulfill both the performance needs of those who want undefined behavior and the security needs of those who do not. For this reason, Unobtainabol should have a compile-time option to say whether it will have any undefined behavior or not. Each compiled module must have a flag saying whether it has undefined behavior so the decision of whether to link that module can be made based on the build flags.

In general, we might want to have two versions of some libraries where the presence of undefined behavior can make a significant performance difference.

Sunday, May 12, 2013

Saturday, May 11, 2013

the super branch


In the following I talk as if branches are strictly statements (phrases executed for their side effects that do not return values) but in many languages such as Lisp and SQL branches are expressions (phrases that return values). In Unobtainabol they are expressions also, but it is easier to talk about them as statements so I’ll keep using the statement terminology.

Branches are statements that let the user chose which of a set of statements to execute. They include the traditional if statement and case statement among others. Multibranches are branch statements that can branch more than two ways in a single statement. The traditional if statement is not a multibranch but it is often extended with an elseif (or sometimes elif) feature that turns it into a multibranch. That is, a multibranch if statement would look something like this:

if condition1 then stmt1
elseif condition2 then stmt2
elseif condition3 then stmt3
...
else stmtn
fi

In the multibranch if statement, the computer goes down the list of conditions, choosing the first one that is true and executing the corresponding statement. If none of the conditions is true, then it executes stmtn.

Case statements have a less consistent form across languages. There are a wide variety of keywords used and some languages restrict the matching expressions in various ways. I’ll show a slightly modified form of the SQL version here because it is one of the least restrictive:

case expr
when match1 then stmt1
when match2 then stmt2
...
else stmtn
esac

In this sort of construct, the computer evaluates expr and then goes down the list of match expression to find the first one that matches (evaluates to the same value as) expr. When it finds a match, it executes the corresponding stmt. If no match is found it executes stmtn.

Since SQL allows any expression for the match expressions, you can get the effect of a multibranch if statement by using

case true
when condition1 then stmt1
when condition2 then stmt2
...
else stmtn

In fact SQL supports this usage by not requiring you to write true. You can just write case and then a sequence of when clauses.

Now, Unobtainabol is not really designed with the goal of mashing as much functionality as possible into single control structures, despite what you may think after superloops. The point of superloops is that combining that functionality added a lot of features that would not be there is we did just traditional loops. With multibranches, the situation is similar. There are many types of branches, and by combining them all into one construct, we hope get a feature that is more powerful than the individual features, while still having the simple forms that users expect.

There is complication in talking about multibranches because there are at least two kinds. With a multibranch if statement, the branches have boolean expression and the one that succeeds is the one that is true. With a case statement, the branches are expressions and the one that succeeds is the one that matches the control expression. To make it easier to talk about branches in general I’m going to refer to them all generally has having conditions that succeed or fail.

Up to now, I’ve been saying that the computer goes down the conditions in order and executes the statement associated with the first successful one. That is the most common sort of multibranch but there are other possibilities. Here are some options for what order to do the testing in:

  1. the conditions are tested in the given order
  2. the conditions are tested in some arbitrary fixed order (possibly for performance reasons)
  3. the conditions are tested in parallel
  4. the conditions are searched for the “best” matching condition if several can match

Each of these evaluation methods has certain advantages. Method 1 is the simplest. Method 2 avoids putting arbitrary constraints on the compiler that the algorithm doesn’t require. Method 3 can  be much faster on certain hardware and is useful for expressing certain kinds of algorithms. Method 4 is useful for lots of things, including as a way to implement polymorphism.

In addition, there are different possibilities for which statement to execute:

  1. Execute just one statement: If some condition succeeds then execute its corresponding statement. Otherwise execute the else statement.
  2. Execute the corresponding statements of all conditions that succeed. If no condition succeeds then execute the else statement.
  3. Execute the corresponding statement of the earliest condition that succeeds and then the following statements in the given order.

In addition, there are different kinds of checks that you might want a compiler to test for you in some circumstances:

  1. Are the cases all distinct so that only one can succeed?
  2. Do the cases cover all possibilities or is it possible that none of them will succeed?

And finally, there are different kinds of matching such as unification and typecase, but that is a whole different topic that I’ll skip for now.

So here is what a multibranch might look like in the perfect programming language:

superbranch-stmt:
 case superbranch-body esac |
 if superbranch-body fi

superbranch-body:
branch-control branch-clause*

branch-control:
order-control result-control condition-control

order-control: ordered | unordered | (empty: default=ordered for if, unordered for case)

result-control: first | set-function | (empty: default=first for ordered, any for unordered)

condition-control: expression | term | (empty: default=true)

Here are some notes on the syntax:
  • A multibranch that uses the if ... fi syntax is called an if statement and one that uses the case ... esac syntax is called a case statement.
  • Case statements are checked by the compiler. If the compiler cannot prove that at least one of the conditions will succeed, then it is an error. You can always avoid this error by adding an else.
  • Case statements that do not have an explicit result-control are checked by the compiler to see if at most one condition can succeed. If the compiler cannot prove that at most one condition can succeed it is an error. You can avoid this error by adding an explicit result-control or by changing it to an if statement.
  • If there is no branch-control in a branch-body then the first when is optional. This means that you case use a more traditional syntax like this:

if expression then stmt1 else stmt fi

For a multibranch if statement, instead of elseif, you use when:

if expression then stmt
when expression then stmt
when expression then stmt
else stmt
fi

  • else” acts exactly like “when e then” where e is some expression that always succeeds in that particular superbranch. Depending on the controls, it may be allowed and useful to have multiple else statements.
  • A set-function is a function like max, min, sum, etc. that takes a set of arguments and returns a result. The idea is inspired by SQL. Any is a branch function that just returns some arbitrary member of the set. If you use a set function that picks the “best” result in some sense and if the order-control is unordered, then the compiler can try to optimize the output by doing a direct search for the best result.
  • first is different from any or any other set function because first prevents execution of any more conditions after a success, while with a set function, all conditions would be executed. The compiler could not always optimize any in this way because if there are side effects in the conditions, it would alter the behavior.
  • With unordered and a set function, there is no restriction on the order of evaluating the conditions, and all conditions will be evaluated so this combination allows for parallel execution of the conditions. That’s why the the default result-control for unordered is any rather than first. And since the default order-control for case statement is unordered, a regular case statement with unspecified result-control and unspecified order-conrol can be executed in parallel. And of course, the compiler can also use parallel execution whenever the conditions have no side effects.

Tuesday, May 7, 2013

Paramater Passing III


In previous postings, I have dealt with the processing-ordered sequence of parameter passing mechanisms in excruciating detail. In this post I will discuss a number of other factors besides how much processing is done to the arguments.

Pass-by-value-result

First, let’s look at aliasing. Aliasing is when two different variables denote the same location. It is what happens when you call a function with pass-by-reference:

var y;
function f(ref x)
 x := 1;
 do while x <= y;
   x := 2*x;
 od
end
f(y);

The variables x and y will alias each other during the call to f. This can have unexpected consequences for the writer of the function. The apparent purpose of  f is to set its actual parameter to the least power of 2 greater than the global variable y. Instead, it produces an infinite loop when the user calls f(y).

Because of various problems like this, pass-by-reference has a bad reputation, and deservedly so. Fortunately, there is an alternative that offers nearly the same functionality without the unexpected interactions. The alternative is pass-by-value-result. Actually, pass-by-value-result is a combination of pass-by-value and pass-by-result.

For a pass-by-result parameter, the value of the actual parameter is ignored and the formal parameter is considered undefined at the beginning of the function (unless it is declared with an initial value in the formal parameter list). The formal parameter should be assigned a value during the function call, but this value does not get put back into the actual parameter until the end of the function. Pass-by-result is usually indicated with the out keyword.

So let’s rewrite the function above to use pass-by-result:
var y;
function g(out x)
 x := 1;
 do while x <= y;
   x := 2*x;
 od
end
g(y);

Just changing that keyword from ref to out has a dramatic result. Now the formal parameter x is strictly a local variable, not aliased to the global y. When you call g(y), the function will assign to its local variable, x, the least power of 2 greater than y. Then, since x was declared as pass-by-result, at the end of the function, it will assign this value back into y, changing y into the least power of 2 greater than its previous value. Which is probably exactly what the programmer intended.

Pass-by-value-result uses pass-by-value when the function is called and pass-by-value when it returns. In other words, with pass-by-value-result, you use the value of the actual parameter to initialize the formal parameter, and then at the end of the function, you assign the value of the formal parameter back to the actual parameter.

Languages with pass-by-value result typically use the keywords in for pass-by-value, out for pass-by-result, and in out for pass-by-value-result. The keyword in is usually optional when there is no out keyword. Therefore you get the following:

function f(x)        // pass-by-value
function f(in x)     // pass-by-value
function f(out x)    // pass-by-result
function f(in out x) // pass-by-value-result

Pass-by-copy

To reiterate, pass-by-value means that the formal parameter gets assigned the dereferenced value of the actual parameter just like you would do in a normal assignment. Pass-by-reference means that the formal parameter becomes an alias for the actual parameter which must evaluate to a location. But there has been an unfortunate tendency for programmers to think that pass-by-value means “the function cannot change the actual parameter” and pass-by-reference means “the function can change the actual parameter”.

Because of this misunderstanding, a lot of programmers think that there is something confusing about passing objects by value. Objects are mutable value, values that can change over time. Because objects can change, it makes sense that a routine that can access the object can change the object. But this doesn’t make it pass-by-reference. Consider

function f(o:Object1)
 o.do_something();
end
var o = new Object1();
f(o);

The call to do_something() has some effect on the object o. It changes it in some way. But when you return from the function f, o still references the same object that it did before. What f(o) did was make a change in the object referenced by o. The function call did not change the variable o to reference a different object. Compare to

function g(ref o:Object1)
 o := new Object1();
end
var o = new Object1();
g(o);

Now g(o) actually changes the reference of o to make it refer to a different object. That’s what you can do with pass-by-reference and what you cannot do with pass-by-value on an object.

However, many languages such as C and C++ don’t make a clear distinction between objects and what I call abstractions --immutable values like numbers. In particular, a struct is a mutable value, an object, but when you pass a struct to a function, it can’t be changed. This is because C and C++ copy the struct rather than passing a reference to it. The truth is that C and C++ have a lower-level implementation-inspired concept of values that is based on memory, representations, and pointers, rather than dividing the world of values up into objects and abstractions as we do in the real world.

However, we can interpret the C behavior using the concepts of object and abstraction as follows: we say that C structs are objects that don’t use pass-by-value like numbers and arrays do. Structs use pass-by-copy instead.

Pass-by-copy is a parameter passing mechanism that only applies to objects, not to abstractions. It makes no sense to copy an abstraction; you can’t make a copy of the number five. There is just one number five and it isn’t the same as anything else. You can make a copy of a representation of the number, but the representation is not the number. So if you pass a value to a formal parameter that is declared as pass-by-copy, it should be treated as pass-by-value.

Pass-by-copy is actually a useful mechanism to consider for Unobtainabol. It can be used in situations where you do not want to change the object and it is not convenient or not possible to to send the object itself. For example in a remote procedure call to another machine, you might want to copy the arguments that are objects.

Since pass-by-copy is a sort of variation of pass-by-value, it suggests the existence of a similar variation of pass-by-result, where the value of the formal parameter at the end of the call would be copied back into the actual parameter. We might call such mechanism pass-by-copy-result and pass-by-copy-value-result.

Pass-by-reference-const

In C++, objects are structs. And as I said above, structs are a special case; they do not use pass-by-value as other C++ types do, instead they use pass-by-copy. This doesn’t work very well if you want the called function to modify the object because it only can modify a copy of the object that you pass it, leaving the original untouched.

C gets around this unfortunate circumstance by using pointers. In C++ you can do the same, but pointers are a rather low-level and hazardous feature so the makers of C++ wanted to reduce the places where they had to be used. So C++ has call-by-reference, and when you want to pass an object to be modified in C++, you can pass it by reference.

In many cases, a function takes an object as a parameter but does not change the object in any way. It is useful, both to compiler writers and to users of the function, to be able to guarantee this to the caller of the function. You can do this with pass-by-copy, but pass-by-copy is expensive for large objects. So C++ encourages the use of pass-by-reference-const, which is pass-by-reference with a const declaration to prevent changing the thing referenced.

In fact, it is common among programmers who know what they are doing to use pass-by-reference-const as an optimization, not only for objects but also for abstractions with a large representation. For example if you wrote an immutable set class to represent fixed sets, you would have to make sure that no callee ever changed it but you also would want an efficient way of passing these things around. So you could pass them by reference-const.

This is the sort of low-level hackery that is not necessary in Unobtainabol. In Unobtainabol, you don’t pass something by reference unless you really want to create an alias for some reason. You don’t worry about efficient parameter passing because the language translator handles that. An abstraction that has large representations will be passed around as a pointer under the covers and the programmer never has to think about it; he just passes it by value.

Unobtainabol does have a way to declare that a formal parameter that takes an object will not modify the object --such things are useful for writing clean code. But the programmer doesn’t have to worry whether it gets passed by copy or by pointer in those cases. The language translator picks a good way to do it.

Unobtainabol might even implement under the covers a form of pass-by-copy-on-write, which passes an object by value, but with the protocol that if the callee decides to change the formal parameter, a copy is made first.