## Friday, September 14, 2018

### the probability of the unknown

John. C. Wright (incidentally one of the finest science fiction and fantasy writers of our time) discusses a problem called the Carter Catastrophe. Mr. Wright argues that the problem statement is nonsense. I argue in opposition that the reasoning of the problem is correct as far as it goes, but that it offers almost no information. Mr. Wright comes to his conclusion from the position that probability is about repeatable events, and if he were right about that, his conclusion would be correct, but although he has a lot of brilliant scientists and mathematicians who agree with him, I do not.

I am one of a minority who believe that probability is, in fact, a form of logic. Traditional logic is based on two possibilities called truth values: True and False. Although this is by far the most common kind of logic, there are others. Some logics use three truth values: True, False, and Unknown. Others use four: True, False, Unknown, and Contradictory. Probability is just another form of logic that has a range of values from True (probability 1) to False (probability 0) and all shades of unknown in between.

Mr. Wright actually wrote a great science fiction book involving a non-two-valued logic. The book is The Null-A Continuum, and "Null-A" stands for non-Aristotlean logic. Aristotle's logic was the usual True/False variety, and the name was no doubt intended for similarity to non-Euclidean geometry.

So, given the view of probability as a form of logic, what is the probability of a statement x that you have absolutely no information about? Clearly you have no reason to prefer one value over another, so the probability is 1/2, which means completely unknown. Now let me give you a little more information about x:
x is the statement: when I throw this 6-sided die, I will roll a 1
Now you have the additional information that there are six mutually exclusive possibilities. Which one should you favor? Well, you have no  reason to prefer one possibility over the other, so the probability is 1/6. If you are a frequentist like Mr. Wright, you can't actually give a number until someone tells you that the die is a fair one, but as a matter of logic, if all you know is that one of six possibilities is true, and you have no other information at all, then the probability of any of the possibilities is 1/6.

Notice, however, that once we have the additional information, we changed the probability from 1/2 to 1/6. The 1/2 is no longer relevant because we have more information (this is one of the ways that probability is less nice than traditional logic; traditional logic is monotonic in the sense that adding more information cannot change your answer).

With this background, let's get back to the Carter Catastrophe, which makes sense if logic is probability. Suppose you have no more information than this:
Event e has not happened in the last hour but will happen eventually.
Let's say that the time from one hour ago to the next time e happens is t. We have no information at all about where now sits in the time range from 0 to t, so any time in that range is equally likely. This means that, for example, the chance that were are in the top half (or the bottom half) of the range is 1/2. In fact, the chance that we are in any 1/n of the range is 1/n.

This is the sort of reasoning that the Carter Catastrophe is based on, and it is logically sound, but the issue is that it is logically sound given that we have almost no information. Any other probability based on any more information at all will be a better estimate. You can't argue that something is likely based on almost no information when there is actual information to look at.

Frequentists like Mr. Wright object to assigning probabilities like this for non-repeatable events, but there is a semi-frequentist interpretation: if someone were to pick a bunch of random problems of this low-information type and offer you bets based on the problem, then reasoning with logical probability is the best strategy you can use to maximize your wins (or minimize your loses), assuming the questions were not deliberately chosen to defeat probability logic.

## Sunday, March 4, 2018

### Schema Evolution

A schema is a collection of data definitions. There are three different places where schemas are used: to define the data structures within a single program, to define the data structures within a database, and to define the data structures that are communicated between different programs, typically over a network.

In the two external cases: databases and communications, there is an interesting problem associated with schemas: how do you change them? Software is evolving all the time and the schemas have to evolve with it. In the case of databases this creates a problem when a new version of software comes out with a new schema. The existing data is in the old schema but the software wants to use the new schema, what do you do? In the case of communications, there is a similar problem. It often isn't practical to change all software at once. A single entity may not even own all of the software. When two different pieces of software are using two different versions of the schema, how do they communicate?

I'll be focusing on Protocol Buffers which is oriented towards communication and Avro which is oriented towards databases, although either can be used for both purposes. If you don't know what these systems are, I suggest you read this excellent but somewhat dated article which discusses how they each treat schema evolution. What I want to discuss here is not how actual systems work, but how an ideal system would work. Since Unobtainabol is the perfect programming language, it has to have perfect schema evolution.

Typically, a schema is a collection of record definitions where a records is a collection of named fields and each field can have a set of properties. For example here is one record definition:
record Person {
name: string title_case default 'Anonymous',
phone,
phone_type: string,
sex allowed: {'M', 'F'},
children: int optional,
}
Each fields is defined with a name followed by a (possibly empty) list of properties, followed by a comma. The properties illustrated above include type, default value, input conversions (name is converted so that each word in the name is capitalized), a list of allowed values, and an indication that a field is optional.

We can consider the following four categories of changes:add a field, remove a field, change the properties of a field, or change the name of a field. In the following V1 is a program that uses the record definition above, and V2 is a program that uses a modified definition. We'll discuss what happens when V2 sends messages to V1 and receives messages from V1. The case where V1 or V2 is a database is more complicated so we will defer it.

### Deleting a Field

When V2 receives a field from V1, it receives a new field that it does not know what to do with. With Protocol Buffers, it will just ignore the unknown field. The problem with that solution is that getting a field that you don't recognize may indicate other problems, and the existence of those problems is hidden from you. It would be better to get some sort of warning when this happens. The simplest way to handle this is for V2 to know that a field has been deleted, and we can encode this in the record definition by adding a version tag like this:
record Person v2 {
name: string title_case default 'Anonymous',
phone,
phone_type: string,
sex: allowed {'M', 'F'},
children: int optional deleted v2,
}
The changes are highlighted in bold. This record definition should be read as saying that it is version two of the record, and that this version differs from the previous version by deletion of the children field. When V2 is compiled, it not only knows the current version of the record, it also knows that the previous version had another field that is no longer present. When it then sees this field, it can safely ignore it.

The situation when V2 sends a record to V1 is more complicated. When V2 sends a record, it does not send the children field because it does not have a value for the field. Since V1 thinks the field is optional, this situation might have a trivial solution if we represent records in such a way that it is impossible to tell the difference between a missing optional field and a field that simply does not exist. This is essentially how Protocol Buffers works.

However, this solution only works for fields that were optional before they were deleted. What if we deleted the phone_type field? Within Google, they strongly recommend making all fields optional for just this reason. In fact, in proto3, all fields are optional--there are no required fields. Google's attitude on this was formed from a single incident where deleting a non-optional field led to significant downtime for the production search engine. Careers were in jeopardy, Panic ensued. Since that frightening experience, Google has become so committed to this model that proto3 doesn't even provide a way to know when a value was missing; it just silently adds a zero or an empty string. I hardly think it's worth pointing out the disasters that can occur from using bad data like this. Even worse, there is nothing to indicate where the problem is. The results of using bad data can percolate up the system for a long ways before causing a problem that seems completely unrelated to the message.

The ideal schema evolution does not silently insert bad data into your programs. Suppose we do delete phone_type and suppose V1 is the component responsible for sending SMS messages to the users, and that it needs the phone type to know if the phone can receive SMS. Google's strategy would turn phone_type into an empty string, which could cause V1 to crash. A better response is a reply message to V2, warning it that there is a problem in the communication. Maybe it's actually a bug in V2 that needs to be fixed.

To really solve the problem, we need to be able to supply a default value for required fields that have been deleted. There are two cases, either the field already had a default value, in which case it is very unlikely that there can be a problem using it, or you have to add a default value. Adding a default value at least requires the programmer to think about how older programs are using the field.

I'll also discuss three different places to generate the default value:

1. in V2 which was the last component compiled, so it can know about all previous schemas.
2. in the protocol itself, relying on a registry of schemas.
3. in V1, from information provided by V2 or by a registry.

Before we discuss these options, let's go over what sorts of defaults we need to support. Sometimes you can just pick a constant for a default. Suppose we are removing phone_type because we are dropping the SMS service from our application. Then we want to just always say that a phone is a land line so that V1 does not send any messages:
record Person v2 {
name: string title_case default 'Anonymous',
phone,
phone_type: string default 'landline' deleted v2,
sex: allowed {'M', 'F'},
children: int optional,
}
In this example, I assume that adding a default is always allowed, so you can add a default in the same change where you delete the field.

But what if there is no constant value that  makes a proper default? What if we are removing the phone type because we now have a way to figure out whether or not we can send an SMS based on the phone number? We need an expression to calculate the default, and the expression has to reference another field of the record:
record Person v2 {
name: string title_case default 'Anonymous',
phone,
phone_type: string
default {get_phone_type(phone)} deleted v2,
sex: allowed {'M', 'F'},
children: int optional,
}
V1 may not have the function get_phone_type. This and similar issues make it impractical to try to get V1 to calculate its own default values. That leaves us with possibilities 1 and 2: either there is a schema registry and the protocol itself supplies the default value, or V2 supplies the default value.

The problem with a schema registry is that it is a very heavy-weight solution. Network services are hard to set up and maintain--always much harder than the documentation seems to suggest. Also, they are another point of failure in your system. There are applications for a schema registry, but it should be possible to solve schema evolution without one.

The only choice left is to make V2 responsible for everything. V2 has to have the code necessary to send and receive messages for every past version of the schema that might still be in operation. When V2 and V1 set up communication, they have to exchange their schema versions. The one with the older version then relies on the one with the newer version to do all necessary translation (except for possible optimizations that might push work to V1 to reduce message size).

to be continued