Tuesday, April 30, 2013

Parameter Passing I

Parameter passing is a remarkably complex subject for something that appears so simple on the surface. First there is the terminology. When you call a function as in f(x+1,5), you pass arguments or actual parameters. When you define a function, as in

function f(x,y) return x+y; end

the names x and y are called formal parameters or sometimes just parameters when ambiguity is not a problem. So

argument = actual parameter = expressions you see at the function call
parameter = formal parameter = names you see at the function definition

Usually I will use “argument” and “parameter” unless it is especially important to avoid ambiguity and then I will use “actual parameter” and “formal parameter”.

So with that terminology under our belt, let’s see what’s so hard about parameter passing. Suppose you have a definition of a function

function f(x,y)
 var i;
 i := x+x;
 return i+y;

and you have a call of that function f(3,4). What could be more simple than to substitute 3 in place of each x and 4 in place of the y to get

i := 3+3
return i+4;

which returns 10. How can that be complicated?

Well, let’s start with that 3. What if instead of 3, we declare another variable i and the function call is

i := 3;

The ++ operation is from C. It first increments i and then returns its value. Now if we follow the proposed substitution, we get the first line of the function as

i := ++i+++i;

The intention is that this should be parsed as (++i) + (++i) but if we are using pass-by-text (which is the method used by the C preprocessor) then we might instead parse it as ((++i)++)+i because ++ can be a suffix operator as well as a prefix operator.

But let’s assume that we don’t have pass-by-text, instead we have pass-by-phrase. A phrase is a parsed syntactic unit. You can think of it as an abstract syntax tree. With prhases the parsing is done before the call so we don’t have to worry about weird parsing interactions. This is similar to the parameter-passing method used in the macros of Common Lisp. With this mechanism, the parsing would be correct and the function body, after substituting parameters is

var i;
i := (++i) + (++i);
return i+4;

But this bit of code is declaring i as a new variable then calling ++i before assigning to the variable. That’s a reference to an undefined variable. What happened to the 3 that was assigned to i before the call? That assignment doesn’t count in the function body because i in the function body is a different variable than it was at the point of the call.

OK, what we need is something like a phrase but where the names have already been bound so they can’t be captured by the declarations in the function body. What we need is a thunk. A thunk can be thought of as a bit of compiled code along with some sort of namespace, but in abstract it is a phrase where the variables are already bound and cannot be captured in another context.

Several languages starting with Algol 68 have had a pass-by-thunk parameter passing mechanism, although most of them called it pass-by-name. I prefer “pass-by-thunk” because “pass-by-name” doesn’t make any sense.

So let’s take a look at what would happen with pass-by-thunk. The arguments would be converted into thunks which means into something like a function call with no arguments:

function thunk1() return ++i; end
function thunk2() return 4; end

except that i in thunk1 is an alias for the i at the calling point. Now the body of the function would be something like

var i;
i := thunk1() + thunk1()
return i+thunk2();.

This body evaluates thunk1() twice, which increments i twice so the result is 4+5+4=13. But surely what we really expected f(++i,4) to do is increment i once and return 4+4+4=12, so we aren’t done yet.

The problem seems to be that the thunk evaluated ++i twice. What if we defined the thunk to only evaluated it once? This is pass-by-lazy-thunk because the thunk is only evaluated when it’s need and then only once. We define it like this:

function lazythunk1()
 static var tmp := ++i;
 return i;

What this function does is execute ++i the first time it is called, but after that it just keeps returning i. This solution gives us the answer we want. It increments i once and returns 4+4+4=12.

But we aren’t done yet. Let’s try out another function

function g(x,y)
 if y < 0 then return x;
 else return y;

and lets call this function with the same arguments g(++i,4). Using the lazy thunk, this will produce the following body

if y < 0 then return lazythunk1();
else return lazythunk2();

In this code y is greater than 0 so lazythunk1() never gets evaluated and i is never incremented. But surely when the user wrote g(++i,4) he expected that after this call i would be greater by one.

So here’s another try. Let’s avoid all of this confusion by just saying that we will evaluate the arguments at the point of the call and pass to the function the results of that evaluation. The formal parameters of the function, then would not become names for the argument expressions of the function; they would become names for whatever those expressions evaluate to. This seems to handle both of the problem cases above. In both cases, the ++i gets evaluated once at the point of the function call and x gets bound to 4, the value returned by the expression. But now consider

function h(x,y)
 x := x+1;
 return x+y;

When the function is called as h(i,4), what will happen? Well, if x becomes a name for the result of the expression i, this is ambiguous. Since i is a variable it denotes a location rather than a value. So we could say either that i evaluates to a location or that it evaluates to the contents of the location. The issue is whether the evaluation that we do on the parameter stops before or after dereferenicng. If the evaluation stops before dereferencing, then the formal parameter x becomes an alias for the location denoted by the actual parameter i. This is called pass-by-reference or pass-by-var.

If we use pass-by-reference, we have a problem again. When the user calls h(i,4) the formal parameter x becomes an alias for the actual parameter i so when x is incremented by 1, so is i. But the user is not going to expect i to be modified when he calls f(i,4).

The final alternative is to say that the evaluation of the arguments goes all the way through to dereferencing any reference results so that the formal parameters get bound to value only. This is called pass-by-value and it turns out to give the expected semantics.

Since Unobtainabol wants to give users the expected results, should it just use pass-by-value? The problem with that is that pass-by-value is not powerful enough. Pass-by-value is great for calling a function that takes a set of argument values and manipulates them in some way, but sometimes you want something more powerful.

I'll discuss that in a future post.

Monday, April 29, 2013

the super loop

I have a confession to make. I really prefer the C syntax with braces over the keyword-bracketed do...od type of syntax that I chose to talk about Unobtainabol. The only reason that I went with the keyword-bracketed syntax was to support this notion of a super loop that has been tasking me for several years now. Even more embarrassing, the super loop is really more of a syntactic curiosity than a real language feature and I believe that language design really ought to have moved on from syntax by this point. However, the super loop is neat so I'm throwing out principle and adding it to Unobtainabol.

Here is the basic intuition behind the super loop. There are various versions of while and until loops in various languages, some of which test at the top and others of which test at the bottom. Of course, since you sometimes need to test in the middle, these languages also include a special break statement for breaking out of loops from the middle. In addition, many languages also have a special iterator loop to iterate through collections. My thought is, why not make the whole thing more elegant (one of the goals of Unobtainabol) by having a unified do-anything loop? So here is the Unobtainabol super loop:
super-loop: do  loop-statementod
The "+" following loop-statement should be read as "one or more of loop-statement". A super-loop will loop endlessly through the list of loop-statement+ unless a loop-statement causes an exit. Any declarations made in the body of the super-loop is active from the point of declaration to the end of the super-loop. A loop-statement is a regular statement or one of the following special statements called a loop-controller:
loop-statement: statement | loop-controller
  loop |
  then |
  finally |
  while condition; |
  until condition; |
  for generator-assignment+; |
A loop-controller has a special significance to the execution of the super-loop as follows:
loop: any statement that appears before loop is executed only once before the first iteration begins. The statements before loop typically include declarations and initializations. No other loop-controller may appear before loop. Notice that loop is optional. It is only needed if you have some initialization.
then: any statement that appears after then and before another loop-controller is executed only once after the last iteration of the loop and only on a normal exit from the loop (that is, not a thrown exception). The statements after then typically include finalization and cleanup code. No loop-controller other than then or finally may appear after a then.
finally: any statement that appears after finally and before another loop-controller is executed only once after the last iteration of the loop on either a normal or abnormal exit. The statements after finally typically include finalization and cleanup code. No loop-controller other than then or finally may appear after a finally. If there are multiple instances of then and finally the statements following them are all executed in order on a normal exit, but on an abnormal exit, only the statements following a finally are executed.
while condition: execution of a while statement evaluates condition. If condition is false, then a normal exit from the loop occurs immediately.
until condition: execution of an until statement evaluates condition. If condition is true, then a normal exit from the loop occurs immediately.
Before describing the effect of the for statement, I have to explain what a generator-assignment is:
generator-assignment: var-declaration = generator | variable := generator
A generator is an expression that can return a sequence or set of results (more on that in a future post). Basically, a generator assignment is a declaration of a variable that is initialized with a generator or a variable declared previously that is assigned a generator expression. In the following description any such variable is called a generated variable. The for works like this:
for generator-assignment+: on the first iteration of the super-loop, the generated variables are set to the first generator value. At each subsequent iteration, the values are updated by the generators using a first-in-first-out cross-product evaluation of the generator-assignment expressions. When all generators are exhausted, a normal exit from the super-loop occurs.
Notice that the for can give you the effect of a nested loop because it acts like a nested loop over its generators, but no other loop-controller gives this effect. You can have multiple while, until, and for statements but they are all evaluated sequentially. To get the effect of a nested loop other than with multiple generator-assignment expressions in a single for, you have to nest the loops.
Here are some examples of the super-loop in action. I'll use ... to indicate a list of arbitrary statements. Here is how you loop forever (say as the main loop of a daemon process):
These three all do the same thing:
  var i:int=0
loop while i < 10;
  i := i+1;
  var i:int=0
loop until i >= 10;
  i := i+1;
do for var i := 0 to 9;
These two do the same thing again, but with the test at the end
  var i:int=0;
while i < 10;
  i := i+1;
  var i:int=0;
until i >= 10;
  i := i+1;
This one does something similar to the test-at-end loops, but i is not visible in the body of the loop
for var i := 0 to 9;
This loop works like a nested loop, iterating for every combination of i, j, and k having the values from 1 to 10
  var i,j,k
loop for i := 1 to 10, j := 1 to 10, k := 1 to 10
Don't pick on me for not showing the commas between the generator-assignments in the syntax. I didn't want to complicate it. Notice that you need the loop statement even though the only thing that comes before it is a declaration with no initialization. Otherwise the variables would be declared anew each time through the loop and the generators would loose the variables they are generating into.

The real power of super loops comes with combinations of loop controllers:
  var x :=
<something that has to be cleaned up>;
loop for const i := 1 to 10;
  const y := f(i,x);
while y < 10;
  const z := g(i,y,x);
until z > 10;
<clean up x>
By the way, Unobtainabol still has a break statement. This is needed since while and until are part of the super-loop syntax and can only occur at the top level. Sometimes you need to break out of a loop from inside a nested compound statement.

Sunday, April 28, 2013

static and dynamic dispatch

Cecil is a very cool language from the University of Washington. Cecil has had a big impact on my view of classes, types and dispatching, some of which I’ll discuss here and some of which will be left for a future post. In this post I’ll talk about static and dynamic dispatch, starting with the venerable concept of overloaded operators.

Many languages have overloaded operator symbols. Typically, the symbol + can refer to either an integer operation or a floating-point operation depending on the types of the operands (I’m going to ignore the difference between single-precision and double-precision in this discussion). In the expression x+y, if both x and y are int, then the compiler generates an integer arithmetic instruction. If both are float, then it generates a floating-point arithmetic instruction. Most languages also handle mixed operands; iIf one operand is an int and the other is a float then the compiler generates a floating-point instruction and generates code to convert the int to a float.

There are two ways to look at such operators. Either we can say that there are two different operations with the same operator symbol, or we can say that there is one operation with two different implementations or branches. I’ll use the second terminology because it makes clear that there is a special relationship between the different branches --they aren’t independent things that just happen to have the same name.

Some languages such as C++ also allow you to declare overloaded functions such as:

float power(float,float);
int power(int,int);

Given these declarations, when C++ sees an expression power(x,y) it will chose an implementation based on the signatures of of x and y.

The decision of which branch of a multiply-defined function to use is called dispatch. When dispatching is done at compile time based on the statically-defined signatures of the arguments, then this is called static dispatch.

Dynamic dispatch is possible as well. With dynamic dispatch you choose the branch of the function based on the runtime types of the arguments. Some object-oriented languages such as Cecil have multi-methods rather than overloaded methods. Multi-methods are similar to overloaded methods but they are dynamically dispatched. The choice of which branch to call is based on the run-time types of the arguments rather than the signatures. In a strongly, statically typed language where signatures (the types of expressions) correspond directly to sorts (the types of values) this just means that you choose the branch at compile-time rather than run-time. You pick the same branch either way, you just choose it at a different time.

But in languages where a signatures can match several sorts (including any language with subclassing), this is no longer true. Consider the following example: there are two classes X and Y where Y is a subclass of X. There is a function defined with two branches:


Now consider

x:X = new Y;

where x is declared with signature X but signature. The signature of x doesn’t tell you what it’s runtime sort will be because the signature X matches two different sorts: X and Y. At runtime x will actually contain a value of sort Y. So what branch will the compiler call for f(x)? If it uses static dispatching then it will choose the first branch f(a:X) because the statically-declared signature of x is X. If it uses dynamic dispatching then it will choose the second branch f(a:Y) because the dynamic type of x is Y at the point of the call. C++ has static dispatching; Cecil has dynamic dispatching.

Cecil also has optional type declarations. The type declarations are primarily there to provide type safety rather than optimization because in general the declarations do not allow the Cecil compiler to do static dispatching. Cecil has inheritance, and as discussed above, in a language with inheritance (or other ways that a signature can match multiple sorts) static dispatch does not chose the same branches as dynamic dispatch.

Unobtainabol has both dynamicness and verifiability as goals, so it uses dynamic dispatching with optional type declarations like Cecil. However, Cecil treats all parameters symmetrically and treats the message receiver as just another parameter. This violates another goal of Unobtainabol --the principle of least surprise. In a method call like this:

obj.foo(x, y)

doesn’t it look like obj should be treated differently from x and y in deciding what branch gets executed? Doesn’t it look like obj is a more important determiner? I’m not sure exactly how Unobtainabol is asymmetric because branch-choosing algorithms are a complex subject. But I’m pretty sure that I want the message receiver to be a dominating factor in the algorithm because that is what most users will expect.

Unobtainabol has another goal to consider here --low-levelness. The user ought to be able to control whether dynamic or static dispatch is used. For a clue on how to do this, let’s look at the partial support that C++ has for dynamic dispatch. In C++ you can declare a method (i.e.: member function) as virtual. When a method is virtual the compiler uses dynamic dispatch on the message receiver. For example:

struct X {virtual f(int i); virtual f(float i);};
struct Y : public X {virtual f(int i); virtual f(float i);};
X* x = new Y;
int i = 0;

Even though the static type of x is X, the compiler will call the branch of f associated with Y because f is virtual and the dynamic type of x is Y.

Dynamic dispatch only works for the message receiver, x. C++ still uses static dispatching on all of the other arguments (such as i in this example).

Unobtainabol has a similar feature except that it is more flexible, potentially applying to any collection of parameters. Also, the Unobtainabol feature takes dynamic dispatch as the default rather than static dispatch. In Unobtainabol, you can declare a method as static which means that it uses static dispatch on the message receiver. Similarly, for any routine you can declare any collection of arguments as static, meaning that the compiler uses static dispatch on that argument.

Unobtainbol has another goal: predictability. Since dynamic dispatch is the default, a function that uses static dispatch has to be called with the curly-brace syntax: f{x,y}. This syntax is used in Unobtainabol to warn that user that something unusual is going on and he better look at the signature of the function.

As I mentioned above, dispatch algorithms are complex and an algorithm that selects a branch in a way that is human-understandable and human-predictable based on an arbitrary combination of static and dynamic dispatch may be a bit hard to come up with. But since Unobtainabol is a perfect language rather than a real one, this is not a problem; it’s a research topic.

Saturday, April 27, 2013

Dave's Foreign Data --generating foreign SQL and optimizing subqueries

Dfd is a foreign-table facility that I wrote for Postgresql back before Postgresql had Foreign Data Wrappers (fdws). It is based on table functions and mostly operates as a plugin between the parser and the planner. See previous posts for an overview, a description of single-table optimization, and a description of whole-query optimization.

In this post I'm going to describe the generation of SQL code for foreign servers from Postgresql parse trees. This is a non-trivial problem because the foreign server is not Postgresql in my application. This means that I'm generating programs from one language using parse trees from a slightly different language.

Translating Clauses and Literals

I found a Postgresql module for generating programs in the source code (I believe it was for saving triggers, constraints, and such things in system tables). At the query level, SQL is well-enough standardized that the code that Postgresql generated was pretty close to what I needed, at least at the level of clauses. Every SQL has the basic SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY form. This saved me a lot of work by already handling all of the basic clauses. All I had to do was trim out some of the generating functions for nodes that the foreign server does not support and do a bit of target-specific code generation.

I also had to write specialized code to generate some of the literals using the special rules required by the foreign server.

If I were doing this again, I believe I would instead extend the table-driven method that I used for functions and operators (described below) to handle clauses and literals. Such a mehod would have been more extensible and would have made converting to later versions of Postgresql easier.

Translating Functions and Operators

The rewrite-Postgres-source-code strategy works for syntax because SQL is so well-standardized, but it works less well for functions and operators which are much less standardized. To handle functions and operators, I wrote a table-driven translation facility. There are dfd functions that you call to tell dfd how to translate various functions and operators. The most general are
dfd_translate_function(_schema text, _name text,
  _sig text[], _precedence int, _translation text)
 dfd_translate_binop(_name text,
  _argtype1 text, _argtype2 text,
  _precedence int, _translation text)
dfd_translate_unop(_name text,
  _argtype1 text, _precedence int, _translation text)
These functions specify translations for functions, binary operators, and prefix unary operators in order.
The parameters are

  • _schema --the schema that a function is defined in
  • _fname --function name or operator symbol
  • _sig, _argtype1, _argtype2 --parameter type information as type names.
  • _precedence --a small integer representing the precedence of the operator on the foreign server that is used to properly insert parenthesis. Precedence doesn't matter for funcion-call syntax so we just use 0 in those cases. Note that this is the precedence on the foreign server --we don't care about precedence on the local server because the Postgresql parser has already handled that for us.
  • _translation --a string representing how to translate the function. The translation uses % as an escape symbol for parameters. So %1 is the first argument to a function, the left operand to a binary operator, or the single operand to a unary operation. %2 is the second argument to a function or the right operand to a binary operator. Higher numbers are higher-numbered function parameters.

Here are a couple of examples:
dfd_translate_binop('+', 'int8', 'int8', 5, '%1 + %2')
dld_translate_unop('|/', 'float', 0, 'sqroot(%1)')
Notice that you can translate operators into function calls. You can also translate function calls into operators (which is why there is a precedence argument for dfd_translate_function). Also, there are no parenthesis around %1 and %2 or around the entire expression as you might expect. The _precedence argument is used to handle parenthesizing automatically.

For my particular application, I had no need to handle trinary operators  (like between) or suffix operators but it would be fairly easy to add them.

Notice that these translation tables also handle decisions about whether or a particular Postgresql expression is exportable (can be translated and sent to the foreign server). If there is a translation registered, then it can be translated, otherwise it can't. For functions and operators, there was no need to write special code to determine whether a given expression could be exported or not as was done with the clauses.

Translating Foreign Subqueries

If a foreign subquery appears in another query, nothing special really needs to be done. The query is simply generated in place just like it would be for a main query.

There is the special case for a correlated subquery on a foreign table. Dfd considers this sort of query an error by default, although there is an option that will cause it to simply generate the subquery in-line as it would for an uncorrelated subquery.

This is a judgment call and one that I'm still not sure of. My thought was that a query like that is going to call the foreign server for each row returned by the local query and that is almost certainly a mistake. Usually the programmer doesn't really want to do that. On the other hand, I'm not a big fan of systems that try to protect the user from himself.

Translating Local Subqueries inside Foreign Queries

Sometimes an expression would be exportable to a foreign server except that it contains a subquery on a local Postgresql table. Dfd handles this depending on several things. A correlated subquery is just considered non-exportable so it gets evaluated locally instead of sent to the foreign server. For example, suppose that t1 is a foreign table with integer column x1 and and t2 is a local table with integer column x2. Then
select x1 from t1
  where x1>0 and exists (select * from t2 where x1=x2)
gets translated as
select x1
  from dfd_call('f', 'select x1 from t1 where x1>0')
    as tmp(x1 int)
  where exists (select * from t2 where x1=x2)
If the subquery is uncorrelated then it is evaluated before the foreign call and included in the foreign call as a literal. For example
select x1 from t1
where x1=(select x2 from t2 where y2=0)
would get translated to
select x1 from dfd_call('f',
  'select x1 from t1 where x1=' ||
   dfd_literal((select x2 from t2 where x2=0)))
  as tmp(x1 int)
The subquery is evaluated as part of a table-creating-expression and the result gets translated into a string representing a literal for the foreign server. This literal then gets concatenated into the query string producing a foreign query something like this
select x1 from t1 where x1=7
A predicate subquery such as EXISTS would be translated similarly except that it would be converted to a boolean literal.

If the subquery is uncorrelated and set-returning, then the same sort of translation is used, but the set returned by the subquery is turned into an IN-list. For example
select x1 from t1
  where t1 in (select x2 from t2 where x2=0)
 is translated as
select x1 from dfd_call('f',
  'select x1 from t1 where x1 in (' ||
  (select string_agg(dfd_literal(x2)) from t2
    where x2=0) || ')')
  as tmp(x1 int)
The subquery is modified to return an IN-list. After evaluation of the string expression, the query sent to the foreign server will look something like this
  select x1 from t1 where x1 in (2,5,7,10)
where the numbers in (2,5,7,10) are the results of the local subquery.

In the example above, I use the new string_agg function from Postgres, although in my implementation I had to write my own.

Notice that this form of sending the IN subquery is analogous to choosing a join order. The above translation chooses to execute the local query first and then "join" to the foreign query. It could instead have been treated like a correlated subquery and simply not considered exportagble. This would produce the translation
select x1 from dfd_call('f', 'select x1 from t1')
  where x1 in (select x2 from t2 where x2=0)
This translation would send less data to the foreign server but bring more rows back from the foreign server. The choice of which translation to use should really be made in the planner rather than in a pre-planning plugin, but that option was not available and still is not available with the new fdw facility.

Translating Joins

As described in the post on foreign-n queries, there is no effort to optimize joins. The tables are just treated as individual foreign queries. However some kinds of joins could be optimized like uncorrelated subqueries, sending an IN list to the foreign server. That's for the next version.

Thursday, April 25, 2013

Functions, Procedures, Predicates, and Methods

Many programming language have what they call functions. These are pieces of code with formal parameters that you can call with arguments from different places in the code. Functions can return a value but in many languages it is also possible for a function to return no value. It can be called just to generate side effects.

In some languages such as Pascal they distinguish between functions and procedures. Functions return values and procedures do not.

Other languages have neither functions nor procedures but do have methods. Methods are like functions attached to an object. You call a method on an object as a way of sending message to the object.

Still other languages such a Prolog have none of the above but do have predicates. A predicate does not return a value or have a traditional side-effect. What is does is impose constraints on its arguments.

What these things all have in common is that they are ways to encapsulate a bit of computation or evaluation. Sometimes it is useful to distinguish between them, but in a lot of cases, I want to make a point about all of those things at once or about several of them at once, and it just gets confusing trying to say what I am talking about. In the past I've used the word "operation" to refer to all of those things at once, but that seems a bit awkward and can be confused with "operator".

Which reminds me, some language have a fifth kind of encapsulated computation that they call operators. In C++ operators are just functions that use operator syntax instead of function syntax.

From here on I'm going to try to use "routine" to refer to all of these things for encapsulating computation --the five I've mentioned and any others that come up. Sometimes it may not include predicates which are a bit different from other kinds of routines, but I'll try to make it clear in context. "Routine" is short, convenient, and ambiguous, which is what I'm looking for. To purists, it will imply an imperative style of programming, which isn't what I intend, but unless someone can come up with something better, that's what I'm going with.

Wednesday, April 24, 2013

Why Isn't Everything in Unobtainabol an Object?

You may have noticed that in a previous post I allow for sorts that are not classes. And of course this implies that there are values that are not objects. So you may be wondering why everything isn't an object like it is in a lot of the newer sexier programming languages. Well, making everything an object may be good enough for  mere mortal programming languages, but it just isn't good enough for the Ultimate Programming Language.

Object-oriented programming has a lot to recommend it, especially for large complex projects with multiple programmers, but it's not perfect and Unobtainabol is perfect.

Other programming languages have gone for purity in other directions. There are languages where everything is a list or everything is a set or everything is a relation or everything is a map or everything is an array. Well, lists, sets, relations, maps, arrays, and objects are all very powerful and general data abstractions. All of those things are very powerful, and technically capable of representing everything all of the others. But each one of those abstractions has it own strengths and weaknesses and when you commit to a pure solution, you get the corresponding strengths but also the corresponding weaknesses. By contrast, if you allow for a multitude of abstractions, you get all of the strengths and the weaknesses tend to cancel out.

Now you could argue that it is possible to actually implement all of those other things using objects. And that's true (just as it's possible to implement objects using those other things). You have two choices when you do this. Either you hide the object-oriented implementation from the user, in which case you have just added those things to the language, or you left the object-oriented implementation showing, in which case you haven't gotten away from the weaknesses of the object-oriented paradigm.

Unobtainabol fully embraces all paradigms that have proven widely useful in programming languages and mathematical notation. It doesn't try to restrict the programmer to some tiny subset of problem-solving techniques on the grounds that those techniques are formally adequate. Merely "adequate" isn't good enough for Unobtainabol.

Dave's Foreign Data --translating foreign-1 queries

As I described in a previous post, foreign-1 queries are queries that contain just one table, it being a foreign table. Foreign-1 queries allow for much more optimization than do foreign-n queries. With foreign-1 queries, dfd potentially sends all of the work to the foreign database.

Suppose there is a simple foreign table t(c1 int, c2 text) and dfd sees the query
select c2, sum(c1) from t where c1 > 0
  group by c2 having c2 like 'foo%'
If this were optimized like foreign-n queries then the following query would be sent to the foreign server:
select c2, c1 from t where c1 > 0
and this would bring back over the network all of the rows in the table where c1 > 0. But the entire query can be sent to the foreign server to execute the GROUP BY and aggregation on the cluster and only send back the groups and sums.

It isn't always possible to send the whole query, so dfd breaks down queries into a sequence of stages:

  1. projection list and WHERE clause
  2. aggregation and GROUP BY
  4. LIMIT

Dfd will send the longest initial sequence of stages that the foreign server can support. If at any stage, it cannot handle the whole stage, then it does as much of that stage as it can and then bails on further optimization. For example, in the query
select c2, sum(c1) from t where c1&1=1
  group by c2 having c2 like 'foo%' order by 1
The foreign server does not have the bitwise-& operator so it cannot send the entire projection list and WHERE clause to the foreign server. Since it cannot do a complete job at stage one, it doesn't even try to optimize further stages. Dfd will send the following query to the remote server:
select c2, c1 from t
and let the GROUP BY, HAVING, and ORDER BY all be executed locally. If the query had been
select c2, sum(c1) from t where c1>0
  group by c2 having c2<'fff' and c2~='f*g*' order by 1
where the foreign server can handle the projection list and WHERE clause as well as the aggregation and GROUP BY, but not the entire HAVING clause, it would send the query as follows
select c2, sum(c1) from t where c1>0
  group by c2 having c2<'fff'
and give up on the rest of the HAVING clause and the ORDER BY clause.

The foreign server does not handle OFFSET at all, so basically if it can handle everything up to stage 4 and then it sees an OFFSET, it has to bail on the OFFSET and ORDER BY. Actually, it's a bit more complicated than that. There are some interesting relationship between LIMIT, OFFSET, and ORDER BY that dfd exploits to get the most work from the foreign cluster.

Clearly, fdw cannot handle these foreign-1 optimizations because fdw works at the level of a single plan node while these foreign-1 optimizations all completely rework the plan. This sort of optimizations has to be handled as a query rewrite, which happens before the planning phase.

For a discussion how SQL is generated and how subqueries are optimized, go here.

Tuesday, April 23, 2013

Types, Sorts, and Signatures

The original purpose of types was to restrict expressions in such a way as to prevent Russel’s Paradox. Bertrand Russel discovered his paradox just before publication of his book Principles of Mathematics, and not wanting to delay publication too much, he introduced the mechanism of types as a solution in a hastily written appendix to the book (by the way, “Principles of Mathematics” is not just the English translation of “Principia Mathematica”. Principia Mathematica was a later work written by Russell and Whitehead).

Ever since that hastily-written appendix there has been some ambiguity over whether types belong to expressions or to values.The difference is not an insignificant one. Expressions are syntactic objects, part of a language. Values are the things that the language is about. So if types go with expressions then they are things that can be determined statically by examining the text of the program. If types go with values then you can’t assign types statically in any powerful language. You have to figure it out dynamically.

The ambiguity arises from the fact that a kind of type really applies to each kinds of thing. A value has a sort, which is the kind of thing that it is. A variable has a signature which describes the sorts of values that the variable can hold. Lets expand this notion of a signature to apply to any expression so that the type of a value is called a sort and the type of an expression is called a signature. We’ll let “type” retain its current ambiguous status.

There is more to the difference between a sort and a signature than just what they apply to. Consider a class fifo that is a subclass of the abstract class aggregate. An expression can have the signature fifo or the signature aggregate, but there can be no values of sort aggregate because it is abstract.

Expressions can have even more complex types. Unobtainabol has type expressions with the “or” operator, |. A type expression “int|float” would be read “the sort is either int or float”. This feature lets you implement linked lists without explicit pointers. For example:

class Link {
data: int;
next: Link|Null;

This says that the next element in the list is either another Link or is the special value Null. There is no need to refer to pointers, you can just use Link and let the type system handle pointer issues for you.

So in Unobtainabol we can declare a variable with any of a set of types and an expression that uses that variable can also have any of a set of types. For example in

var x: int|float;
y := x+1;

the expression x+1 can be either an int or a float but the value of the expression doesn’t have an ambiguous type. Whatever is in the variable x is one or the other. There are no values that are “int or float” in some undecided state (well ... maybe in advanced Unobtainabol). So the type of an expression can be broader than the type of a value can be.

Here is some new terminology for Unobtainabol:

sort: the type of a value. A sort consists of a set of possible values and set of basic operations on those values. Every non-virtual class is a sort as is every primitive type such as int and float.

signature: the type of an expression. Signature is a statically determined restriction on the sorts of values that an expression can evaluate to.

class: The type of a namespace (more on that in a future post).