Thursday, March 19, 2015

Lexical Analysis in Classp

Is there a higher-level way to specify lexical analysis comparable to the Classp way of specifying parsing? I don't have a good answer for that yet, but here are some thoughts: first, Classp allows the specification of parsers without writing a grammar. I'm not sure it makes sense to want the corresponding feature for lexical analysis: specification of scanners without writing a grammar or regular expression.

Second, the class patterns of a Classp program specify both a parser and its inverse, a formatter. The lexical analyzer should also use some sort of invertable notation.

Third, the result of parsing in Classp is an abstract syntax tree. I think that the result of scanning should be a primitive type rather than an AST (even a leaf node).

Rascal

There is a very interesting domain-specific language named Rascal, specialized for the writing of parsers and other meta-programming tasks. Rascal has goals similar to those of Classp, but unlike Classp, Rascal is based on grammars. I have been arguing that grammars are a poor formalism for designing a parser around, but if you are going to start with grammars, Rascal seems like a good solution. They have added some very nice features to get around the problems that grammars cause.

In Rascal, they have several different kinds of syntax definition including: syntax, lexical, and layout. The layout syntax is where you specify white space, comments, and other items that should be ignored when reading syntax. The difference between syntax and lexical is that syntax skips over the layout and lexical does not. Lexical does simple character-by-character matching.

Morphology

I like the syntax/lexical distinction but I'm not crazy about the terminology because "syntax" is a noun but "lexical" is an adjective. "Morphology" is the noun that refers to the structure of lexical elements so I'll use that instead. Now we can define a string literal like this:

class StringLiteral {
  char chars[];
  morphology('"' (chars*) '"');
}

There are a couple of problems with the above definition. First, there are no escape characters allowed by the morphology. Second, a StringLiteral is an abstract syntax tree node, but we really just want it to be a string. Let's deal with the second first.

Rules

We use "class" to introduce a declaration of an AST node and the related pattern. If we want an arbitrary type, let's use the word "rule". A rule declaration looks somewhat like a class declaration except that it doesn't have attributes. Names declared in the body are more like local variables; they only exist while the rule is being evaluated. The value formatted or parsed by a rule is not an AST node but a general type. Let's modify the string literal class into a rule:

rule StringLiteral: string {
  char chars[] = explode(self);
  morphology('"' (chars*) '"');
}

The difference between classes and rules in Classp is much the same as the difference between classes and functions in C++. Just as a class syntax says how to format or parse the class, the rule morphology says how to format or parse a lexical element (although you could also have a class with morphology or a rule with syntax).

The above rule should be read as declaring a formatting function that  takes a string as argument, "explodes" the string into an array of characters, and then prints out a double quote, the array of characters, and then another double quote.

At the same time, the rule declares a parsing function that is the inverse of the formatting function. To parse a string literal, recognize a double quote, then read an array of characters, then recognize another double quote. Finally, implode the array of characters into self as the result.

Escape Characters

We can handle escape characters by making a special rule to format characters in a string literal:

rule StringLiteralCharacter: char {
  morphology(self
    {'"' -> '\\"'
    |'\n' -> '\\n'
    |...});
}

The three dots indicate that we have to list every single character and its translation (or we could introduce a special syntax to say "every other character gets mapped to itself"). The rule is fairly clear: to write a double quote, you write a backslash followed by a double quote. To parse a double quote, you reverse the process.

With this definition, we can now modify string literals to handle escape characters:

rule StringLiteral: string {

  StringLiteralCharacter chars[] = explode(self);
  morphology('"' (chars*) '"');
}

You may be wondering how the system knows how to explode a string into an array of StringLiteralCharacter. The answer is that it doesn't have to. A StringLiteralCharacter is just a char with formatting and parsing code attached, so Classp only has to know how to explode a string into an array of characters.

Numbers

rule Digit: unsigned {
  morphology(self{0->'0'|1->'1'|2->'2'...|9->'9'});
}

This rule represents both a formatting function and parsing function for digits. As a formatting function it takes an integer as an argument and outputs the string representation of the integer as specified by the the case pattern. As a parsing function, it takes an input stream and looks for one of the 10 characters, returning the integer (reverse) associated with the character.

Keep in mind that syntax declarations in Classp are viewed primarily as representing a formatter for the class rather than a parser. We get the parser by inverting the formatter. So lets continue to use that same model with lexical rules as we try to define whole numbers (non-negative integers) in terms of digits.

rule WholeNumber: unsigned {
  Digit d = self % 10;
  optional WholeNumber prefix = self / 10;
  morphology({self >= 10 -> prefix} d);
}

Let's go through this: we have a whole number n. In the decimal representation for n, the least significant digit (the one on the right) will be n%10 and the rest of the number (that is, the number that you get if you strip off the right-most digit) will be n/10.

We've introduced a new conditional pattern here:

{condition1 -> pattern1 | condition2 -> pattern2 ...} 

outputs the pattern that goes with the first true condition in the list. If no condition is true then it outputs nothing (This kind of pattern is not currently in Classp, but probably will be soon).

With this background we can read the above rule as follows: "If self >= 10 then write the portion of self other than the last digit. Write the last digit."

This technically works; but whole reason that Classp exists is to improve on grammar-based methods, and at first blush this doesn't look to me like an improvement on grammar-based methods. We don't want a new form of lexical analysis just to be different. If it's not better than the current grammar-based methods, we should just include some sort of grammar-based lexical analysis into Classp.

So while I think the proposed morphology declaration is a good way to go, I'm not sure about rule declarations yet.

No comments:

Post a Comment