Saturday, April 27, 2013

Dave's Foreign Data --generating foreign SQL and optimizing subqueries

Dfd is a foreign-table facility that I wrote for Postgresql back before Postgresql had Foreign Data Wrappers (fdws). It is based on table functions and mostly operates as a plugin between the parser and the planner. See previous posts for an overview, a description of single-table optimization, and a description of whole-query optimization.

In this post I'm going to describe the generation of SQL code for foreign servers from Postgresql parse trees. This is a non-trivial problem because the foreign server is not Postgresql in my application. This means that I'm generating programs from one language using parse trees from a slightly different language.

Translating Clauses and Literals

I found a Postgresql module for generating programs in the source code (I believe it was for saving triggers, constraints, and such things in system tables). At the query level, SQL is well-enough standardized that the code that Postgresql generated was pretty close to what I needed, at least at the level of clauses. Every SQL has the basic SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY form. This saved me a lot of work by already handling all of the basic clauses. All I had to do was trim out some of the generating functions for nodes that the foreign server does not support and do a bit of target-specific code generation.

I also had to write specialized code to generate some of the literals using the special rules required by the foreign server.

If I were doing this again, I believe I would instead extend the table-driven method that I used for functions and operators (described below) to handle clauses and literals. Such a mehod would have been more extensible and would have made converting to later versions of Postgresql easier.

Translating Functions and Operators

The rewrite-Postgres-source-code strategy works for syntax because SQL is so well-standardized, but it works less well for functions and operators which are much less standardized. To handle functions and operators, I wrote a table-driven translation facility. There are dfd functions that you call to tell dfd how to translate various functions and operators. The most general are
dfd_translate_function(_schema text, _name text,
  _sig text[], _precedence int, _translation text)
 dfd_translate_binop(_name text,
  _argtype1 text, _argtype2 text,
  _precedence int, _translation text)
dfd_translate_unop(_name text,
  _argtype1 text, _precedence int, _translation text)
These functions specify translations for functions, binary operators, and prefix unary operators in order.
The parameters are

  • _schema --the schema that a function is defined in
  • _fname --function name or operator symbol
  • _sig, _argtype1, _argtype2 --parameter type information as type names.
  • _precedence --a small integer representing the precedence of the operator on the foreign server that is used to properly insert parenthesis. Precedence doesn't matter for funcion-call syntax so we just use 0 in those cases. Note that this is the precedence on the foreign server --we don't care about precedence on the local server because the Postgresql parser has already handled that for us.
  • _translation --a string representing how to translate the function. The translation uses % as an escape symbol for parameters. So %1 is the first argument to a function, the left operand to a binary operator, or the single operand to a unary operation. %2 is the second argument to a function or the right operand to a binary operator. Higher numbers are higher-numbered function parameters.

Here are a couple of examples:
dfd_translate_binop('+', 'int8', 'int8', 5, '%1 + %2')
dld_translate_unop('|/', 'float', 0, 'sqroot(%1)')
Notice that you can translate operators into function calls. You can also translate function calls into operators (which is why there is a precedence argument for dfd_translate_function). Also, there are no parenthesis around %1 and %2 or around the entire expression as you might expect. The _precedence argument is used to handle parenthesizing automatically.

For my particular application, I had no need to handle trinary operators  (like between) or suffix operators but it would be fairly easy to add them.

Notice that these translation tables also handle decisions about whether or a particular Postgresql expression is exportable (can be translated and sent to the foreign server). If there is a translation registered, then it can be translated, otherwise it can't. For functions and operators, there was no need to write special code to determine whether a given expression could be exported or not as was done with the clauses.

Translating Foreign Subqueries

If a foreign subquery appears in another query, nothing special really needs to be done. The query is simply generated in place just like it would be for a main query.

There is the special case for a correlated subquery on a foreign table. Dfd considers this sort of query an error by default, although there is an option that will cause it to simply generate the subquery in-line as it would for an uncorrelated subquery.

This is a judgment call and one that I'm still not sure of. My thought was that a query like that is going to call the foreign server for each row returned by the local query and that is almost certainly a mistake. Usually the programmer doesn't really want to do that. On the other hand, I'm not a big fan of systems that try to protect the user from himself.

Translating Local Subqueries inside Foreign Queries

Sometimes an expression would be exportable to a foreign server except that it contains a subquery on a local Postgresql table. Dfd handles this depending on several things. A correlated subquery is just considered non-exportable so it gets evaluated locally instead of sent to the foreign server. For example, suppose that t1 is a foreign table with integer column x1 and and t2 is a local table with integer column x2. Then
select x1 from t1
  where x1>0 and exists (select * from t2 where x1=x2)
gets translated as
select x1
  from dfd_call('f', 'select x1 from t1 where x1>0')
    as tmp(x1 int)
  where exists (select * from t2 where x1=x2)
If the subquery is uncorrelated then it is evaluated before the foreign call and included in the foreign call as a literal. For example
select x1 from t1
where x1=(select x2 from t2 where y2=0)
would get translated to
select x1 from dfd_call('f',
  'select x1 from t1 where x1=' ||
   dfd_literal((select x2 from t2 where x2=0)))
  as tmp(x1 int)
The subquery is evaluated as part of a table-creating-expression and the result gets translated into a string representing a literal for the foreign server. This literal then gets concatenated into the query string producing a foreign query something like this
select x1 from t1 where x1=7
A predicate subquery such as EXISTS would be translated similarly except that it would be converted to a boolean literal.

If the subquery is uncorrelated and set-returning, then the same sort of translation is used, but the set returned by the subquery is turned into an IN-list. For example
select x1 from t1
  where t1 in (select x2 from t2 where x2=0)
 is translated as
select x1 from dfd_call('f',
  'select x1 from t1 where x1 in (' ||
  (select string_agg(dfd_literal(x2)) from t2
    where x2=0) || ')')
  as tmp(x1 int)
The subquery is modified to return an IN-list. After evaluation of the string expression, the query sent to the foreign server will look something like this
  select x1 from t1 where x1 in (2,5,7,10)
where the numbers in (2,5,7,10) are the results of the local subquery.

In the example above, I use the new string_agg function from Postgres, although in my implementation I had to write my own.

Notice that this form of sending the IN subquery is analogous to choosing a join order. The above translation chooses to execute the local query first and then "join" to the foreign query. It could instead have been treated like a correlated subquery and simply not considered exportagble. This would produce the translation
select x1 from dfd_call('f', 'select x1 from t1')
  where x1 in (select x2 from t2 where x2=0)
This translation would send less data to the foreign server but bring more rows back from the foreign server. The choice of which translation to use should really be made in the planner rather than in a pre-planning plugin, but that option was not available and still is not available with the new fdw facility.

Translating Joins

As described in the post on foreign-n queries, there is no effort to optimize joins. The tables are just treated as individual foreign queries. However some kinds of joins could be optimized like uncorrelated subqueries, sending an IN list to the foreign server. That's for the next version.

No comments:

Post a Comment