Sunday, April 21, 2013

Dave's Foreign Data --Introduction

Postgresql (aka Postgres) is a really impressive open-source project. It is a relational database management system that supports a large subset of standard SQL and also has lots of cool extra features as well as amazing extensibility. For example, you can write stored procedures in about a dozen different languages. You can create new aggregate functions, types and index implementations among other things. The latest versions also have a foreign-table interface which lets you create a table in a Postgresql database which is a sort of alias for data that exists somewhere else.

One of the cool features that Postgres has is table functions. You can define a function that returns a sequence of rows rather than just a value. You can call a table function in a contexts where a table is expected as long as the table is not being modified (as in an update statement, for example). The function then gets called when Postgresql needs rows and it returns the results of the function.

 I wrote a foreign-table interface for Postgresql a few years ago before Postgresql had a foreign-table feature so I had to fake it by writing a plugin. I'll call my project dfd (Dave's Foreign Data) to distinguish it from the new official Postgresql feature fdw (Foreign Data Wrappers). Dfd is different from fdw in a number of ways. First, dfd only works for a foreign SQL database. It would be a major effort to make dfd work for a non-SQL data source. By contrast, fdw is designed to be more general and there are wrappers written for such things as flat files and web sites. Second, when you install a foreign database in dfd, it automatically creates Postgresql foreign tables for all of the tables in the foreign database (because dfd was intended to let Postgresql be the query front end for a SQL data warehouse). In fdw, the user has to manually create foreign tables. Third, dfd contains optimizations that are not available by default or not even possible with fdw.

Dfd works as follows. There are two table functions
dfd_call(connection text, query text)
dfd_stub(connection text, tablename text)
The dfd_call() function sends a query to a remote database and retursn the results like any normal table function. This function is used for implementing foreign tables as well as allowing the Postgresql user to send pass-thru queries to the foreign database. The dfd_stub() function is a dummy that should never be called. It does not do anything other than report an error.

To represent a foreign table, dfd creates a view that calls dfd_stub(). For example, suppose the foreign database db1 has the following table:
create table foo(x int, y text)
then dfd links to this table by creating roughly the following view:
create view foo as
  select * from  dfd_stub('db1','foo')
   as tmp(x int, y text)
The part following the second "as" specifies the row type of the foreign function. This is required by Postgresql when a table function was not declared with a row type (as an aside, this is an interesting way to support late binding in a language with static types and I'll probably steal the idea for Unobtainabol).

Now, since dfd_stub() is a dummy function, you can't call the view as it is. Dfd contains a pluggin that goes into Postgresql between the parsing phase and the planning phase. This plugin looks for calls to dfd_stub() and modifies the syntax trees that contain such calls. One of the things it does is replace the call
dfd_stub('db1','foo')
  as tmp(x int, y text)
with a call like
dfd_call('db1', 'select ... from foo ...')
  as tmp(...)
Where the ...'s represent whatever special code is generated to implement the foreign call. For example, if x is not used in the outer query, it will leave x out of the projection list and the corresponding type list.

The first implementation of dfd only used a plugin (written in C) and some PL/PGSQL code. Unfortunately, it turns out that Postgresql table functions don't return a row at a time. In order to support cursors, Postgresql evaluates a table function all at once, storing the rows in a buffer then returns them one at a time as required by the plan executor. In an application like dfd, this creates too much first-row latency to be tolerable, so we had to modify Postgresql to return rows one at a time for table functions.

Of course modifying the Postgresql source code causes problems whenever you want to upgrade to the next version of Postgres, so for the next version of dfd, I'm hoping to exploit the fdw mechanism. To do that, I'm going to have to find a way to do some optimizations that fdw does not support.

More on dfd here.

No comments:

Post a Comment