Note added March 11, 2015: This post is a very speculative stream-of-consciousness,sort of plan for a parsing system. The ideas have been modified and refined and implemented as Classp--a classier way to parse. One major change came from reading Daniel Yellin's dissertation. He cleared up a major confusion that I had in this post. I was thinking that when you invert a formatter, you have to invert all of the attribute-evaluation functions. But no, you only have to invert the functions that actually write out the language. The attributes should have the same values no matter which direction you do the translation. So constraints are no longer needed; I think we can get by with just attribute evaluation, or maybe the more general concept of lazy evaluation of single-assignment variables.
If you have ever used Yacc or other parser generators, you have probably seen code like this:
Expression ::= Expression '+' Expression
{
$0 = makeNode(PLUS, $1, $2)
}
This expresses a context-free grammar production and an “output” for the production. After you run these productions, you end up with a parse tree and you typically write functions that traverse the parse tree setting various values like the types of the node, the meanings of names, and eventually the translation of the entire tree.
Attribute grammars are an extension of this concept, making it more general and more formal. An attribute grammar consists of a context-free grammar, a set of attributes associated with the productions, and a set of rules for assigning values to the attributes. Instead of explicitly creating a parse tree, with an attribute grammar you can use attributes to calculate fields. For example
ATTRIBUTE LanguageType type
ATTRIBUTE CodeSegment code
...
expr:Expression ::= expr1:Expression '+' expr2:Expression
{
expr.type = calculateType(expr1.type, expr2.type);
expr.code = "...";
}
This has always seemed to me to be a very nice idea, but it hasn’t proven very popular as a way to create programming languages. There are probably various reasons for this, but in my own experience, the reason I have chosen not to use an attribute grammar system is because they were just too complicated. Even if I could figure out how to make a translator based on an attribute grammar work, I wasn’t confident that people who inherited the project from me would be able to figure it out.
The main difficulty with attribute grammars revolves around the way that it is intertwined with parsing. The grammar productions are the basic entity and all else is organized around them.
I think there is a way to reorganize the idea of attribute grammars, making them much more accessible and also making them useful for problems that have nothing to do with parsing. I call this idea the constraint attribute graph.
A constraint attribute graph consists of a set of nodes with attributes. A node is very much like an object in any object oriented language except that it is immutable. The attributes of a node are very much like the members of an object, except that they are constraint variables (similar to single-assignment variables). Nodes classes can inherit from or extend other node classes just as in object-oriented languages.
Unlike the productions of an attribute grammar, the nodes of a constraint attribute graph are not bound to a grammar production. For example:
class Expression {
LanguageType type;
CodeSegment code;
Expression args[];
Syntax source;
}
...
class Plus extends Expression {
expr.type = calculateType(arg0.type, arg1.type);
expr.code = "...";
source = "$arg0.source + $arg1.source";
}
This may look more complicated than the attribute grammar, but that’s because I left a lot out of detail in the attribute-grammar example. In general, this way of specifying things is more compact and more understandable to programmers familiar with C++, Java and similar languages. Since array references with constant indexes such as arg[0] occur so often, I use the abbreviation arg0 for such expressions.
The source attribute is especially interesting. I’m using a string notation much like in shell scripts, where a string surrounded by double quotes (") can have embedded expressions proceeded by dollar signs ($). The source attribute can be used to write a pretty-printed version of the tree, and can be used to generate a context-free parser for the language. Of course to generate a parser, the field has to follow certain restrictions which I won’t detail here.
What is especially interesting is that since source is just another attribute rather than a part of the overarching structural framework, it is possible to have multiple attributes of type Syntax and use these attributes to translate from one syntax to another.
Attribute grammars have different kinds of attributes (inherited and synthesized) with certain restrictions on how attributes can be assigned in order to compile efficient attribute evaluators. Constraint attribute graphs can take advantage of such attribute evaluators when they apply but I think of the constraints much more generally. The expression
X = Y
is not an assignment of Y to X; it is a constraint that X and Y are equal. In actual operation there are several possibilities:
- If X is unknown and Y is known, then this sets X to Y.
- If X is known and Y is unknown then set Y to X.
- If both X and Y are known, then this verifies that they are equal. It is an error if they are not equal.
- If both X and Y are unknown then this constraint has essentially no effect. Internally, it constrains X and Y to be equal, but since there are no other constraints on either variable it has no outward effect.
Notice that X is known if anywhere in the program there is a constraint setting X to a value. Constraint programs are not evaluate sequentially, but all at once. Also, I’m fudging by assuming that the only kind of constraint is setting one thing equal to another. There may be more complex constraints, although I can’t think of a need for them at the moment.
I’ll call the language of constraint attribute graphs CAG just because it is convenient to have a name. To give a flavor of how CAG works, here is a fragment of a compiler/translator written in CAG for a language like C with a (reversible) translation to a Lisp-like syntax. If you have ever written a compiler, you will be surprised at how easily some annoying problems can be solved in CAG.
We begin with declaring the notion of a clause in the language. A clause is a phrase that is syntactically complete. That is, it would be parsed as a unit.
class Clause {
Syntax lispSource, cSource;
Instruction asm[]; // generated code
String errors[]; // list of error messages
SymbolTable names = Default(..names);
};
Although in principle, any attribute can be either an input or an output, we will see that asm and errors have some one-way constraints --constraints that CAG only knows how to evaluate in one direction-- so that those attributes can only be outputs. The attributes lispSource and cSource can be either inputs or outputs. The type Syntax is like a string but it has some extra features to make it convenient as a parsing string. Among other things, CAG ensures that only parseable constraints are used on attributes of type Syntax.
The symbol table attribute names is given a default value. This means that if a descendent class does not give it a value, then it takes the value ..names which is names in the parent object.
If a node does not have a parent, then it is an error not to explicitly set names.
class Program extends Clause {
Declaration decls[];
cSource = concat(a:decls){"$a.cSource\n\n"};
lispSource = concat(a:decls){"$a.lispSource\n\n"};
errors = concat(a:decls){a.errors};
asm = concat(a:decls){a.asm};
names = New;
};
Declaration is a class representing declarations. The reduction operator concat takes a list arr and for each element a, concatenates the expression in the braces. The attribute decls is a list of declarations which, in both C and Lisp, is a concatenation of the children separated by \n\n. Due to some additional magic of Syntax, CAG knows that it can ignore whitespace on input so the \n\n is only for output.
The errors and asm definitions are similar except that they are lists rather than strings, so concat uses list concatenation rather than string concatenation.
The constraint names = New basically just says that names is not inherited from the parent but is an individual symbol table. This can be thought of as assigning a new empty table, but keep in mind that all objects are immutable so if names were really empty, then it would always be empty because it can’t change. What it is is a symbol table with that is, at this point, unconstrained.
To translate from C to Lisp, do this:
Program p;
p.cSource = inputStream;
errorStream = p.errors;
if (p.errors == []) {outputStream = p.lispSource;}
to compile, replace the last line with
if (p.errors == []) {outputStream = p.asm;}
Other combinations are possible. Both cSource and lispSource can be inputs or outputs.
class Declaration extends Clause {
TypeExpression typeExpr;
Expression initExpr;
LanguageType type = typeExpr.type;
Symbol name = typeExpr.name;
Symbol result;
names[name] = SymbolTableEntry(type, result);
if (initExpr) {
result = initExpr.result;
asm = initExpr.asm;
cSource = "$typeExpr.cSource = $initExpr.cSource;"
lispSource = "(declare $typeExpr.cSource $initExpr.cSource)"
} else {
result = newVar();
asm = [];
cSource = "$typeExpr.cSource;"
lispSource = "(declare $typeExpr.cSource)"
}
};
A Declaration has two children, typeExpr and initExpr. The type is parsed in typeExpr and the initialization expression, if any in initExpr. We pull the parsed type and name from typeExpr and add a name to the symbol table by expressing a constraint on an entry in the symbol table. This way of managing symbol tables and other set-like structures will require some care in the implementation because we have to make sure that no lookup in the symbol table fails if it the name might still be added later.
We constrain the syntax and assembly based on whether there is an initialization expression or not. The attribute result is the assembly language variable where the named value is kept (I’m glossing over some differences between variable declarations and function declarations here). This result is put into the symbol table along with the type. If there is no initialization then the address of the result is just a newly allocated variable. Otherwise we output the assembly to initialize the variable and return the result of that.
class Expression extends Clause {
children Expression arg[];
// type is the derived type of the expression.
LanguageType type;
// outType is the type that the parent expects.
LanguageType outType = Default(type);
String localAsm[];
if (type == outType) {
asm = localAsm;
else {
asm = append(localAsm, convert(result, type, outType));
}
// the output destination in generated code.
Symbol result = newVar();
// The lisp function symbol for the expression
Symbol lispFunction;
lispSource
= "($lispFunction "
+ concat(a:arg, ' '){a.lispSource}
+ ')';
};
The children declaration tells CAG that the members of arg are to be treated as children of the current node. It has various effects.
Implicit type conversions are often a pain. Here they are handled in a few lines by using constraints to determine the result type and the expected type and automatically add the conversion. We only have to declare the conversion here, not in any of the subclasses. However, in the subclasses we have to be careful to constrain localAsm rather than asm.
We give the syntax of Lisp expressions at this level because all Lisp expressions have the same syntax (except for special forms), differing only in the function symbol. We can’t do this for C because each individual expression has different syntax
The concat operation takes a second argument for strings that it uses as a separator.
class IntegerLiteral extends Expression {
integer value;
// Two constraints. Typically one ends up being an input
// and one an output.
value = parseCInt32(cSource);
value = parseLispInt32(lispSource);
type = int32Type;
localAsm = ["LoadI($result, $value)"];
};
Here we have two different constraints on value. This is not a problem because the functions are reversible. In other words, in the constraint
value = parseCInt32(cSource);
Either value can be the unknown and cSource the known, or value can be the known and cSource the unknown. In normal operations, Either cSource or lispSource will be known and the other will be either unknown.
A reversible function is not quite the same as an invertible one. If g is the inverse of f, then by definition, g(f(x)) = x and f(g(y)) = y. Reversible funcitons have a weaker requirement to handle the case where there can be multiple values, say x1 and x2 that are all equivalent in some sense and all map to the same value: f(x1) = f(x2) = y. This function doesn’t have an inverse because g(y) can only have one value, so it can’t be the case that g(y) = x1 and g(y) = x2. But if g always maps to some value that maps back to g, then it is reversible. It means that maybe g(f(x1)) = x2 instead of x1, but x1 and x2 are “sort of” the same so it doesn’t matter. The requirement for a reversible function f is that f(g(f(x))) = f(x) and f(g(y)) = y.
There are two ways to define a reversible function. One way is to encode the function in a reversible language. The other is to define it in a non-reversible language such as C++ or Java but with multiple parts: one that evaluates the function as usual, one that evaluates the reverse function, and one that just ensures the implied constraint is true if both values are known.
class Plus extends Expression {
type = arg0.outType = arg1.outType
= widen(arg0.type, arg1.type)
switch (type) {
case typeInt32:
localAsm = "addI($result, $0, $1)";
case typeDouble;
localAsm = "addF($result, $0, $1)";
}
cSource = "$0 + $1" LEFT 4
lispFunction = '+'
};
The Plus node shows an all-equal constraint in action. Since widen is not a reversible function, the function call will always be the input and the variables all outputs. We set asm to an integer operation or a float operation based on the types of the operands. Recall that if automatic type conversion is needed then it will be handled by the code in Expression.
We also see more features of the Syntax type. In the syntax for attribute <a>, if there is an array <arr> declared as children, then we can use $<n> to refer to <arr>[<n>].<a>. In other words it gives us the numbered child and the same attribute that we are defining.
Another thing that we get from Syntax is the ability to specify precedence and associativity for a production using notation like LEFT 4.
Notice that we don’t have to declare the syntax for Lisp, just the function symbol.
One of the ugly things in most parsers is how functions and operators are defined. It would be nice if you could just list all of the functions and operators in one list with all of their properties, but operators have have at least some of their properties in the grammar and for functions you don’t want to do that in most cases because it would blow up the the grammar too much. So you have a separate list of functions and their properties and sometimes a list of operator properties that are not in the syntax. And sometimes an extra list of operator precedence and associativity.
With CAG, you can achieve the ideal of just one list that gives all properties of all functions and operators, from precedence to grammar to generated code. First, we’ll need a node to define function syntax:
class FunctionCall extends Expression {
Symbol cFunction;
Symbol asmFunction;
cSource
= "$cFunction("
+ concat(a:arg, ', '){a.cSource}
+ ')';
lispFunction = Default(cFunction);
localAsm = Default([pushArg(arg0.result),
pushArg(arg1.result),
call(asmFunction)]);
};
Now we can list built-in functions like this:
class strcmpCall extends FunctionCall {
arg.length = 2;
arg0.type = arg1.type = typeCharStar;
type = typeInt32;
cFunction = 'strcmp';
asmFunction = '_strcmp';
};
If we have an expression that looks like a function call but has an asm instruction to execute it, then we can override the default localAsm.
We can even separate overloaded functions. To show this, I’ll split the above Plus node into it’s overloaded components:
class FloatPlus extends Expression {
arg0.type == typeDouble || arg1.type == typeDouble;
type = arg0.outType = arg1.outType = typeDouble;
localAsm = "addF($result, $0, $1)";
cSource = "$0 + $1" LEFT 4
lispFunction = '+'
};
class IntPlus extends Expression {
arg0.type != typeDouble && arg1.type != typeDouble;
type = arg0.outType = arg1.outType = typeInt64;
localAsm = "addF($result, $0, $1)";
cSource = "$0 + $1" LEFT 4
lispFunction = '+'
};
Wouldn’t it make things easy if all of the functions and operators and all of their properties were listed in this form in a single file? Imagine adding a new optimization where you need some new information about certain operators. Wouldn’t it be nice to find them all in one place and in one form?
In a real compiler, I believe the optimizations can also be encoded using these sorts of constraints. That’s why I call these things constraint attribute graphs rather than constraint attribute trees --because the optimization phases of a compiler usually work on general graphs rather than just trees.