If you have ever done serious work in SQL, you know how unwieldy and unreadable the queries can get with deeply nested subqueries and other complex structures. You can create temporary tables to make the query look cleaner, but then it might run slower. Pig Latin is a query language that addresses this problem by breaking out the individual query elements and letting you assign them to variables. For example instead of the SQL query
SELECT DISTINCT x,y FROM <table1> WHERE x<y ORDER BY x LIMIT 10
in Pig Latin you could write something like this
t1 = LOAD <table1>;
t2 = FILTER t1 BY x<y;
t3 = DISTINCT t2
t4 = FOR EACH t3 GENERATE x,y;
t5 = ORDER t4 BY x,y;
t6 = LIMIT t5 10;
The table variables t1, t2, etc. are not temporary tables. They are more like SQL views that get optimized in with the final DUMP query.
When queries are deeply nested and complex, this ability to break out the components can be very useful. But why restrict the language to such simple constructs? This reminds me of the old FORTRAN where you could not have expressions in an IF statement. An IF statement could only do a simple relational test such as less than, equals, etc. and could only test against variables, not against expressions, for example
i = k+1
IF i .EQ. j
IF i+1 .EQ. j
One of the big advantages of other languages like Algol was the ability to put a full boolean expression in that IF statement:
IF i+1=j OR i+2=j THEN
This allowed for much cleaner, less wordy programming. Yet Pig Latin seems to be following the old FORTRAN mistake of limiting what the arguments to a construct can be.
Another thing about Pig Latin that I think could be improved is to design the operators so that they can be run together to make an expression that looks something like a SQL query. For non-nested queries, SQL queries give a nice linear expression except for the SELECT clause. The SELECT is an odd duck that combines projection --which logically goes at the end, after ORDER BY-- and aggregation which logically goes with the GROUP BY clause. It’s actually position doesn’t fit either function that it serves. And with analytic functions, the SELECT is even more overloaded in what it does.
Here is an alternative way to design a query language that lets the query writer choose whether to break things out like Pig Latin or combine components to look like SQL.
<table> PROJECT <expression-list>
Projects <expression-list> from <table>. In other words, acts like the SQL query
SELECT <expression-list> FROM <table>
However, no aggregate functions are allowed.
<table> WHERE <expression>
Returns the rows from <table> for which <expression> is true. In other words, behaves like the SQL query
SELECT * FROM <table> WHERE <expression>.
<table> GROUP BY <aggregation-list>
Groups <table> by the expressions in <aggregation-list> that do not contain aggregate functions, and calculates group aggregates with the aggregate function. In other words, behaves like the SQL query
SELECT <aggregation-list> FROM <table> GROUP BY <columns>
where <columns> is a list of all of the expressions in <aggregation-list> that do not contain aggregate functions. If <aggregation-list> is “*” then this does a DISTINCT on the table. If <columns> is empty, then the aggregate functions are calculated over the whole table returning a 1-row result.
<table> ORDER BY <order-list>
Orders the rows of <table> according to <order-list>, producing an ordered table as output.
<table> LIMIT <expression>
Returns the first <expression> rows of <table>.
Now you can write compound expressions like this:
<table1> WHERE x<y GROUP BY * ORDER BY x PROJECT x,y LIMIT 10;
which looks a lot like SQL. Or you can break out the individual parts an assign them to table variables like Pig Latin does.
A few additional comments:
- In order to make the above 1-line query work, you have to define the precedence of the various keyword operators appropriately.
- There is no need for a HAVING operator because the WHERE operator will work just as well after a GROUP BY as before it.
- There are various ways to deal with the SQL analytic functions, but I haven’t come up with a really good solution yet. I’ll write more when I do.
- A complete language of this type also needs something way to do joins, for example an operator like <table1> JOIN <table2> ON <expression>.
- We also need UNION, EXCEPT, and INTERSECT, or some equivalent.
- In order to make this work well, we need to define an ordered table which is output by the ORDER BY clause such that the ordering continues through some subsequent operators (such as analytic functions).