Saturday, January 4, 2014

Secrets of the ANTs Data Server 08 --Query Preparation

The ADS had a fast engine but another reason for its speed was that it compiled queries down to machine code rather than using an interpreter like most other SQL engines of the day.

C Compiler

The basic tradeoff between a compiler and an interpreter is that compilers tend to produce faster runtimes, but they are less portable because you have to write a machine-specific back end for each machine that you want to compile code for. Normally, this would have restricted a small company like ANTs to running on just one or two architectures but ADS got around this by compiling queries to C and then compiling the C using a portable C compiler. In the end, as I recall, we had Windows and Linux on x86 and x86-64, Solaris on SPARC and AIX on PowerPC. Six architectures is not bad for an application like ADS and a company with less than 15 programmers.

ADS needed a special version of the C compiler and ended up having to port the compiler to a couple of other architectures, so in hind sight the savings of using C may not have been that great over direct code generation. However, the C compiler was designed to have a highly portable back end so this work was not as expensive as it had traditionally been.

The reason that ADS needed a special version of the C compiler was that a regular C compiler was too slow for compiling user-entered queries. When the user types in a query to the client and then hits Enter, he expects to get the answer back quickly. It was just not tolerable to wait for a normal C compiler where the intermediate results would be written to disk one or more times and then read back by later phases.

The ADS version of the C compiler worked entirely in memory. It read the C program from an in-memory buffer, did the entire compilation in memory, and then wrote the resulting executable code to an in-memory buffer.

Prepare Caching

Although the in-memory compiler was very fast for a C compiler, it was too slow for interactive queries in a high-performance engine. ADS supported the ODBC notion of preparing queries where the user would compile a query once and then call it many times, perhaps with different arguments. A prepared query is called a “prepare”.

Prepares solved the problem of too-slow compilation for queries that used it, but this isn’t really a complete solution. Some SQL programmers may not use prepares when they should. In other cases, there is an intermediate between the SQL programmer and the SQL engine (a report generator, for example) and these intermediate systems may not be able to effectively use prepares.

Therefore, ADS would cache prepares. There were two prepare caches: one in the client driver and one on the server. The driver would prepare every query. If the client program called for immediate execution, then the driver would prepare the query and then execute it immediately. In all cases, the driver would create a hash of the query text and cache it with the prepare token (the token returned by the server to name the prepare for the driver). If a textually identical query came in again, the driver would just use the previous prepare.
Similarly, the server would cache prepares and share them between clients. If two different clients would send the same query, it would only be compiled once.

The trick to this scheme is that the exact same textual query is seldom sent more than once, but the same query format is often repeated. For example a ticketing application might do multiple queries to look up flights that look like this:

select * from flights where time between D1 and D2

Each query would have different values for D1 and D2, so they would not be textually identical. ADS dealt with this by having the driver extract literals from the queries before preparing them and turning the literals into parameters.

One aspect of this approach was that if you have two queries like

select * from t where x = 1

and

select * from t where x = 2.3

then they would not map to the same query. The first one has an integer argument and the second one has a float argument (ADS did not have a proper decimal type). In general, you can’t just eliminate the first query and assume it into the second by converting the integer to a float because float arithmetic can produce different results from integer arithmetic.

No comments:

Post a Comment