Sunday, March 4, 2018

Schema Evolution

A schema is a collection of data definitions. There are three different places where schemas are used: to define the data structures within a single program, to define the data structures within a database, and to define the data structures that are communicated between different programs, typically over a network.

In the two external cases: databases and communications, there is an interesting problem associated with schemas: how do you change them? Software is evolving all the time and the schemas have to evolve with it. In the case of databases this creates a problem when a new version of software comes out with a new schema. The existing data is in the old schema but the software wants to use the new schema, what do you do? In the case of communications, there is a similar problem. It often isn't practical to change all software at once. A single entity may not even own all of the software. When two different pieces of software are using two different versions of the schema, how do they communicate?

I'll be focusing on Protocol Buffers which is oriented towards communication and Avro which is oriented towards databases, although either can be used for both purposes. If you don't know what these systems are, I suggest you read this excellent but somewhat dated article which discusses how they each treat schema evolution. What I want to discuss here is not how actual systems work, but how an ideal system would work. Since Unobtainabol is the perfect programming language, it has to have perfect schema evolution.

Typically, a schema is a collection of record definitions where a records is a collection of named fields and each field can have a set of properties. For example here is one record definition:
record Person {
  name: string title_case default 'Anonymous',
  phone,
  phone_type: string,
  sex allowed: {'M', 'F'},
  children: int optional,
}
Each fields is defined with a name followed by a (possibly empty) list of properties, followed by a comma. The properties illustrated above include type, default value, input conversions (name is converted so that each word in the name is capitalized), a list of allowed values, and an indication that a field is optional.

We can consider the following four categories of changes:add a field, remove a field, change the properties of a field, or change the name of a field. In the following V1 is a program that uses the record definition above, and V2 is a program that uses a modified definition. We'll discuss what happens when V2 sends messages to V1 and receives messages from V1. The case where V1 or V2 is a database is more complicated so we will defer it.

Deleting a Field

When V2 receives a field from V1, it receives a new field that it does not know what to do with. With Protocol Buffers, it will just ignore the unknown field. The problem with that solution is that getting a field that you don't recognize may indicate other problems, and the existence of those problems is hidden from you. It would be better to get some sort of warning when this happens. The simplest way to handle this is for V2 to know that a field has been deleted, and we can encode this in the record definition by adding a version tag like this:
record Person v2 {
  name: string title_case default 'Anonymous',
  phone,
  phone_type: string,
  sex: allowed {'M', 'F'},
  children: int optional deleted v2,
}
The changes are highlighted in bold. This record definition should be read as saying that it is version two of the record, and that this version differs from the previous version by deletion of the children field. When V2 is compiled, it not only knows the current version of the record, it also knows that the previous version had another field that is no longer present. When it then sees this field, it can safely ignore it.

The situation when V2 sends a record to V1 is more complicated. When V2 sends a record, it does not send the children field because it does not have a value for the field. Since V1 thinks the field is optional, this situation might have a trivial solution if we represent records in such a way that it is impossible to tell the difference between a missing optional field and a field that simply does not exist. This is essentially how Protocol Buffers works.

However, this solution only works for fields that were optional before they were deleted. What if we deleted the phone_type field? Within Google, they strongly recommend making all fields optional for just this reason. In fact, in proto3, all fields are optional--there are no required fields. Google's attitude on this was formed from a single incident where deleting a non-optional field led to significant downtime for the production search engine. Careers were in jeopardy, Panic ensued. Since that frightening experience, Google has become so committed to this model that proto3 doesn't even provide a way to know when a value was missing; it just silently adds a zero or an empty string. I hardly think it's worth pointing out the disasters that can occur from using bad data like this. Even worse, there is nothing to indicate where the problem is. The results of using bad data can percolate up the system for a long ways before causing a problem that seems completely unrelated to the message.

The ideal schema evolution does not silently insert bad data into your programs. Suppose we do delete phone_type and suppose V1 is the component responsible for sending SMS messages to the users, and that it needs the phone type to know if the phone can receive SMS. Google's strategy would turn phone_type into an empty string, which could cause V1 to crash. A better response is a reply message to V2, warning it that there is a problem in the communication. Maybe it's actually a bug in V2 that needs to be fixed.

To really solve the problem, we need to be able to supply a default value for required fields that have been deleted. There are two cases, either the field already had a default value, in which case it is very unlikely that there can be a problem using it, or you have to add a default value. Adding a default value at least requires the programmer to think about how older programs are using the field.

I'll also discuss three different places to generate the default value:

  1. in V2 which was the last component compiled, so it can know about all previous schemas.
  2. in the protocol itself, relying on a registry of schemas.
  3. in V1, from information provided by V2 or by a registry.

Before we discuss these options, let's go over what sorts of defaults we need to support. Sometimes you can just pick a constant for a default. Suppose we are removing phone_type because we are dropping the SMS service from our application. Then we want to just always say that a phone is a land line so that V1 does not send any messages:
record Person v2 {
  name: string title_case default 'Anonymous',
  phone,
  phone_type: string default 'landline' deleted v2,
  sex: allowed {'M', 'F'},
  children: int optional,
}
In this example, I assume that adding a default is always allowed, so you can add a default in the same change where you delete the field.

But what if there is no constant value that  makes a proper default? What if we are removing the phone type because we now have a way to figure out whether or not we can send an SMS based on the phone number? We need an expression to calculate the default, and the expression has to reference another field of the record:
record Person v2 {
name: string title_case default 'Anonymous',
phone,
phone_type: string
  default {get_phone_type(phone)} deleted v2,
sex: allowed {'M', 'F'},
children: int optional,
}
V1 may not have the function get_phone_type. This and similar issues make it impractical to try to get V1 to calculate its own default values. That leaves us with possibilities 1 and 2: either there is a schema registry and the protocol itself supplies the default value, or V2 supplies the default value.

The problem with a schema registry is that it is a very heavy-weight solution. Network services are hard to set up and maintain--always much harder than the documentation seems to suggest. Also, they are another point of failure in your system. There are applications for a schema registry, but it should be possible to solve schema evolution without one.

The only choice left is to make V2 responsible for everything. V2 has to have the code necessary to send and receive messages for every past version of the schema that might still be in operation. When V2 and V1 set up communication, they have to exchange their schema versions. The one with the older version then relies on the one with the newer version to do all necessary translation (except for possible optimizations that might push work to V1 to reduce message size).

to be continued

No comments:

Post a Comment