Sunday, March 29, 2015

A Set is Not a Collection

The usual introduction to sets says that a physical collection or a list of items is a good example of a set, maybe even the archetype of a set. But that is not really what mathematical sets are. A physical collection is a very atypical sort of set if it is a set at all, and a list is an even worse example.

These descriptions can make set theory seem arbitrary and capricious. What is the point of the intersection operation? If you have that bag of stuff over there and I have this bag of stuff over here, then it doesn’t make sense to talk about the bag of stuff that is in both bags. There isn’t anything that is in both bags. If you think of a set as a list of items, then union seems odd. If you have a list of things over there and I have a list of things over here, and we combine the lists, then some things might appear in the combined list twice. Why take out the duplicates? Why doesn’t it matter that something was originally in both lists?

If you understand what a set really is, then the answers to these questions are obvious. Before I say what a set really is, here is a little history of logic: Classical logic was based on a form of argument called a syllogism. The propositions of a syllogism are called categorical propositions because they express a relationship between two categories.

A category is a sort of general classification usually expressed as a plural noun or an adjective. For a given category, each individual either fits the category or does not. For example “Men” is a category that all men fit. “Blue” is a category that a thing fits if and only if it is blue.

A categorical proposition expresses a relationship between categories:

  1. No men are blue.
  2. Some men are wise.
  3. Some wise are not blue.

As you can see, sentence 3 is a logical conclusion from sentences 1 and 2. This is called a syllogism. A syllogism consists of three categorical propositions where the third is a conclusion drawn from the first two. If A and B are categories, then there exactly are four categorical sentences comparing them:

  1. All A are B
  2. Some A are B
  3. No A is B
  4. Some A are not B

There are exactly 24 possible syllogisms, some of which are valid and some of which are not. The literature on this topic is voluminous from Aristotle until Boole. Around the time of Boole, mathematicians were starting to chafe under the limits of this system of logic.

One of the problems is that there is no place for logical connectives. You can have a compound category such as “blue and wise”, “blue or man”, etc, but the form of the syllogism didn’t really allow for exploring the relationships of these compounds. Perhaps an even more pressing problem was the issue of extensionality.

So what is extensionality? Well, if you know set theory you may read Sentence 4 as “A is a subset of B”, sentence 5 as “the intersection of A and B is non-empty”, etc. But that is not always the right interpretation because categories are not extensional.

A set is entirely defined by its members, but a category is not. There are two aspects to a category: the things that fit the category, and what the category says about those things. Consider the sentence:

  1. The chairman of the board is the president of the company.

This sentence is equating two categories: “chairman of the board” and “president of the company”. If you think in terms of set theory then you would say that for this statement to be true, there has to be an individual who is both chairman of the board and president of the company. That is the extensional interpretation, but it is not only interpretation. Suppose the company charter has a rule that says that whoever is president of the company is automatically chairman of the board. If the current president/chairman leaves office, does the sentence become untrue until they appoint a new president? Of course not. In this intensional interpretation the sentence is about the categories themselves and not about what fits the categories.

When there is an individual who fits the category “the chairman of the board”, this individual is called the extension of the category. When there is no individual who fits the category we say that the extension of the category is empty. If there are multiple individuals who fit a category, then they are all in the extension of the category. And that is what a set is: a set is the extension of a category.

All of the features of set theory come from this interpretation. If you have two categories, C1 and C2, with corresponding extensions, S1 and S2, then

S1 union S2 is the extension of the category “C1 or C2”
S1 intersect S2 is the extension of the category “C1 and C2”
complement S1 is the extension of the category “not C1”

Mathematical propositions express categories, so if F(x) is a mathematical proposition about x then { x : F(x) } is an extension-taking mechanism. It says, “give me the extension of F(x)”.

It took around 2,000 years for mathematicians and logicians to decisively separate out the extension from the intension of a category to create set theory. This was a huge, titanic advance in mathematics and logic. Since then set theory has taken over and left intensionality mostly unstudied and even unremarked. Not to take away from the importance of extensional set theory, but I think this trend to ignore intensionality altogether is unfortunate because the intensional approach has a lot of interesting and useful characteristics.

Thursday, March 26, 2015

The Seed7 Programming Language

I just discovered an interesting programming language called Seed7. This language has two unusual features similar to Unobtainabol that I haven’t written about yet. First, it has  first-class types. Since you can pass types as arguments to functions, this mostly supersedes features like C++-style templates or Java generics. Unobtainabol has first-class types but it actually has less use for them than C++ or Java (or Seed7) because Unobtainabol has fully dynamic types. You can, for example, easily define a generic list object in Unobtainabol just by giving the type of the element as Any. But there are applications where you want to ensure type consistency and first-class types will help with that.

Second, Seed7 it has extensible syntax. You can declare a new statement type using keywords and operator symbols. I still haven’t decided exactly how the perfect language would offer syntax extension, but Seed7 suggests a good start.

In Seed7 you can specify a syntax like this:

syntax expr: .loop.().until.().do.().end.loop   is -> 25;

The actual syntax being defined is between “syntax expr:” and “is -> 25”. The “is -> 25” indicates the precedence and associativity. The dots separate tokens. The “()” tokens represent argument positions. The other words, “loop”, “until”, etc. are the syntax being declared. This declaration is for a loop with an “until” in the middle. Sort of a poor-man’s super loop.

You need a second declaration to define the meaning:

const proc: loop
             (in proc: statements1)
           until (ref func boolean: condition) do
             (in proc: statements2)
           end loop is func

I don’t know why the syntax declaration is separate from the function header. Maybe it has to do with include files or something (an issue Unobtainabol doesn’t have to worry about).

Notice that to make this work, Seed7 must have call by name (which I prefer to call “call by thunk”). You can’t evaluate statements1 or statements2 before the call, but instead must evaluate them inside the function whenever they are used.

I didn’t see a reference to call by name or call by thunk in the language manual. Rather than an independent parameter passing method, Seed7 treats call by thunk as just a matter of automatic type conversion. If the type of the formal parameter is “proc”, then the corresponding actual parameter is passed by thunk. I’ve thought of handling call by thunk this way in Unobtainabol, but then to be consistent, you should use the same mechanism to indicate pass by var. Seed7 is not consistent in this way. It hardly could be because Seed7 is a cyber materialist language (see below), which doesn’t readily view pass by reference as a matter of types.

One unusual feature of Seed7 that is not in Unobtainabol is its parameter passing method. It looks like an attempt to get the performance benefits of the C++ parameter passing methods without the horrible syntax and confusing details. In Seed7 you write

const func integer: f(in <type>: val)

This is roughly equivalent to one of two C++ declarations:

int f(<type> val)

or

int f(const <type>& val)

depending on whether <type> is “big enough” to be worth passing by const ref. Essentially, Seed7 takes out of the user’s hands the decision of whether to pass something by const reference or just pass it by val and puts this decision in the hands of the programmer who wrote the type definition. Arguably that’s a better place to put the decision, but this hides the potential aliasing that is going on, and that’s potentially a bigger headache than having to worry about the “const” and “&” every time you write a function declaration. Seed7 does have other modes that you can use rather than just whatever “in” gives you, so it is possible for the user of a type to overrule the designer of the type.

Two of the additional modes are “inout” and “in var”. “Inout” gives actual pass by reference (not pass by value result as you might expect from the name). “In var” is a bit unusual. It is like “in” except that you can modify the formal parameter inside the function (just “in” without “var” prohibits this). But it appears that this fature is unlike pass by value in C++ or Java. In those languages, you can pass in a mutable object by value and the called function can modify it, effecting the actual parameter’s value. Apparently in Seed7, the called function cannot modify the actual parameter at all unless it is it is pass by reference. If I’m reading this right, it would suggest that Seed7 is a cyber materialist language--the first I’ve seen, or at least the first I’ve noticed.

Monday, March 23, 2015

Memory management in Unobtainabol

It’s been a while since I posted about Unobtainabol, the perfect (and therefore non-existent) programming language so here are some thoughts on perfect memory management.

There are two major categories of memory management: manual and automatic. Manual memory management is when the programmer does explicit allocation and freeing of memory as in C++. Automatic memory management includes garbage collection, reference counting, and other techniques where the computer does the memory management for you.

It is commonly believed that manual memory management is faster than automatic memory management, but that’s not true in general; some programs work better with manual memory management and some work better with automatic memory management. What is true is that there are certain use cases where a specially tailored manual memory management system can be much faster than a generic automatic system.

So should Unobtainabol have both manual and automatic memory management? Well, that’s a problem because Unobtainabol is a safe language--there is no undefined behavior--and I don’t know of any way to do safe manual memory management, except by adding a level of indirection and other overheads that would greatly reduce the benefits.

So instead, Unobtainabol has specially tailored automatic memory management that can be used to improve the performance of programs that do not do well with the generic memory management system.

Traditional Regions

There are two forms of automatic memory management that are used in practically every programming language and have been for decades because they are so simple, effective, and efficient. The first form is that all of the memory associated with a running process is reclaimed when the process exits. There is no need for the programmer to manually delete all of the objects allocated in the program; when the program exits, they will automatically be reclaimed by the operating system.

The second form is stack management. When a function is called, it pushes a frame onto the execution stack. It allocates local data on the frame, and when the function exits, everything in the frame is reclaimed at once. This is a very low overhead system because you can add as much local data as you want (within stack-size limits) and there is no additional cost per byte for the data.

In C++ you can allocate an object on the stack and then pass a reference to the object in a call to another function. However, this practice is not safe because there is no way for C++ to ensure that such a reference does not outlive the stack frame in which it is created. If the reference is a pointer for example, the pointer could be put into an object that is on the heap. That’s why safe languages like Java prevent references into the stack. In Java you cannot allocated objects on the stack, only local variables, and there is no way to get a reference to a local variable. But allocating objects on the stack can give a big performance win in some cases.

Generic Regions

The stack frame and the global process space are examples of regions. More generically, a region is an abstract place and time where objects live. What region an object lives in is determined by the way that the object is created. A region does not have to be a fixed location in memory; it is an abstraction. It is a sort of biosphere for objects; the objects can only live as long as their biosphere exists.

Every object and region has a lifespan, from the moment it is created until the moment it is destroyed. We say that one lifespan L1 bounds another lifespan L2 if the beginning of L1 is before the beginning of L2 or at the same time, and the end of L1 is after the end of L2 or at the same time. If an object O lives in a region R, then the lifespan of R bounds the lifespan of O.

Regions can contain other regions. If region R1 contains a region R2 then this means two things: first, every object that lives in R2 also lives in R1, and the lifespan of R2 is bounded by the lifespan of R1.

If r is a reference to an object O, and O lives in R, then we say that r is a reference into R. A reference into a region R cannot be stored anywhere that is not in region R. This includes ephemeral references such as register contents.

These rules have two important consequences for storage management. First, a region can be garbage collected independently from other regions. Since there can be no references into a region from outside, you can always garbage collect the objects in a region just by looking at the region. Second, when you are done with a region, you can just drop it, secure in the knowledge that there are no references outside the region that are still referencing objects inside the region. Of course you still have to run destructors if there are any.

To make regions practical, we need a few more things: we need a way to create regions, we need a way to use regions, and we need a way to ensure the safety of regions. Those issues are covered in the following sections.

Executing in a Region

Code can execute in one or more regions. This means that references to the region are allowed in that code and that the region’s lifespan cannot end as long as the code is executing in the region. When code first begins to execute in a region, we say that it enters the region. When it stops executing within the context of a particular region we say that it exits the region.

Every block that introduces a reference also introduces a new, unnamed region with a lifetime that is bounded by the execution of the block. We call this the local region. A local variable lives in the local region. It can hold references to the local region and to all other regions that the code is executing in.

Every program executes entirely within global, the region associated with the global process space. Every thread executes within thread, a region associated just with that thread. Every call to a function or most other routines executes within frame, a region dedicated just to that call.

We also treat the machine itself as another region, host, that contains all of the global regions for all processes, as well as persistent regions that can outlive any process. Unobtainabol has persistent objects and distributed network objects that look to the programmer just like regular objects --but that’s for another post.

A region can be explicitly entered:

enter <region-list> begin <block> end

The local region of <block> is contained in all of the regions in <region-list> as well as any regions that the enter statement itself executes in. A local variable in <block> lives in every region in <region-list>.

Creating a Region

A region is declared much like a function:

region <region-name> ( <region-attributes> ) ;

The scope of <region-name> is the same as any other name in the block. The lifespan of <region-name> is as long as any code can reference it. For example, if the block is a class body then there will be a different region for each object instantiated from the class, and the life of the region will be the same as the life of the object.

Every object automatically has a region associated with it named this.region. The object itself is not allocated in this.region, but this.region can be used to create internal objects. For example, if a class intended to implement a set using a balanced tree structure, then this.region could be used to allocate all of the tree nodes. When the set is deleted, the lifespan of this.region ends also, and all of the tree nodes are automatically deleted along with it.

Here are some possible region attributes:

  1. Incremental garbage collection so there are no long interrupts.
  2. Reference counting instead of garbage collection because you expect a small number of large objects with no cycles.
  3. No recycling (such as garbage collection or reference counting) because most objects, once they are created, will be live for the remaining lifespan of the region.
  4. A fixed list of type or block sizes to be allocated, so that we can use free lists. This makes storage management more efficient if we are using any recycling method that does not copy or compact because we don’t have the usual overheads that you get with general allocation strategies that need to worry about memory fragmentation.

Enforcing Safety

The restriction that a reference to an object in region R can only be stored in R is enforced statically by a form of type checking. In Unobtainabol, statically-checked types are optional and regions are no exception. However, if you don’t use explicit regions, then you will not get the performance benefits.

When you declare an object, you can declare the region that it lives in:

local TreeNode var: t1;   // stack frame
global TreeNode var: t2;  // global process space

“TreeNode” is the sort, “local” and “global” are the names of regions, and “var” says that we are declaring a traditional programming language variable rather than, say, a constant or something else. In Unobtainabol there is less need for variables than in languages like C++ or Java so the constant is the default. You have to specify if you want an assignable variable.

Unobtainabol separates the concerns of variable lifespan and variable scope (what code can see the variable). “local” and “global” refer to regions and therefore to lifespans, and have nothing to do with variable scope. In Unobtainabol, you use “global” much like you use “static” on local variables in C++. Of course we do have to be careful not to allow the scope of a variable to include code that can execute outside of the variable’s lifespan.

Declaring a variable as belonging to a certain region does not automatically let you pass that variable around and use it anywhere. In order for code to do anything with a reference into region R it first has to first enter R.

This comes up when you pass a reference to a function. If R is a region then you can declare a function using R references:

function f(R TreeNode var: t1) -> R TreeNode;

This says that f takes a single R parameter and returns an R result. The function f can only be called by code that is currently executing in R. Also, f automatically executes in every region needed to access its parameters and return value.

The local region is interesting because of the way that it interacts with function calls and therefore with parameters. Suppose you have a function

function f(local TreeNode t) -> local TreeNode;

The local region is a different region at the call site than it is in the function body. So which “local” are we referring to in the formal parameters list? Well, there isn’t much use to having “local” refer to the body of the called function because there can’t be any references in the called function’s region before or after the call.

Instead, we say that “local” in a parameter list or return value refers to the local region at the call site. This rule actually does have a useful application. It lets you write a function such as

function biggest(local List of TreeNode var ls) -> local TreeNode;

which returns a value from an aggregate, and you can safely pass in a local aggregate because the declarations prevent the function from doing anything unsafe like store it in a longer-lived region.

However, what you can’t do with this is return a list of objects from the list that was passed in, because you have no way to make a list in the same region as the objects. To allow this, we introduce a special region caller which is the same region as local in the parameter list. Any routine that has a return value or a returnable argument (such as call by var or call by value return) declared as “local”, has access to the special region caller. The called function can allocate space in this region and the caller is responsible for managing that space on return.

Thursday, March 19, 2015

Lexical Analysis in Classp

Is there a higher-level way to specify lexical analysis comparable to the Classp way of specifying parsing? I don't have a good answer for that yet, but here are some thoughts: first, Classp allows the specification of parsers without writing a grammar. I'm not sure it makes sense to want the corresponding feature for lexical analysis: specification of scanners without writing a grammar or regular expression.

Second, the class patterns of a Classp program specify both a parser and its inverse, a formatter. The lexical analyzer should also use some sort of invertable notation.

Third, the result of parsing in Classp is an abstract syntax tree. I think that the result of scanning should be a primitive type rather than an AST (even a leaf node).

Rascal

There is a very interesting domain-specific language named Rascal, specialized for the writing of parsers and other meta-programming tasks. Rascal has goals similar to those of Classp, but unlike Classp, Rascal is based on grammars. I have been arguing that grammars are a poor formalism for designing a parser around, but if you are going to start with grammars, Rascal seems like a good solution. They have added some very nice features to get around the problems that grammars cause.

In Rascal, they have several different kinds of syntax definition including: syntax, lexical, and layout. The layout syntax is where you specify white space, comments, and other items that should be ignored when reading syntax. The difference between syntax and lexical is that syntax skips over the layout and lexical does not. Lexical does simple character-by-character matching.

Morphology

I like the syntax/lexical distinction but I'm not crazy about the terminology because "syntax" is a noun but "lexical" is an adjective. "Morphology" is the noun that refers to the structure of lexical elements so I'll use that instead. Now we can define a string literal like this:

class StringLiteral {
  char chars[];
  morphology('"' (chars*) '"');
}

There are a couple of problems with the above definition. First, there are no escape characters allowed by the morphology. Second, a StringLiteral is an abstract syntax tree node, but we really just want it to be a string. Let's deal with the second first.

Rules

We use "class" to introduce a declaration of an AST node and the related pattern. If we want an arbitrary type, let's use the word "rule". A rule declaration looks somewhat like a class declaration except that it doesn't have attributes. Names declared in the body are more like local variables; they only exist while the rule is being evaluated. The value formatted or parsed by a rule is not an AST node but a general type. Let's modify the string literal class into a rule:

rule StringLiteral: string {
  char chars[] = explode(self);
  morphology('"' (chars*) '"');
}

The difference between classes and rules in Classp is much the same as the difference between classes and functions in C++. Just as a class syntax says how to format or parse the class, the rule morphology says how to format or parse a lexical element (although you could also have a class with morphology or a rule with syntax).

The above rule should be read as declaring a formatting function that  takes a string as argument, "explodes" the string into an array of characters, and then prints out a double quote, the array of characters, and then another double quote.

At the same time, the rule declares a parsing function that is the inverse of the formatting function. To parse a string literal, recognize a double quote, then read an array of characters, then recognize another double quote. Finally, implode the array of characters into self as the result.

Escape Characters

We can handle escape characters by making a special rule to format characters in a string literal:

rule StringLiteralCharacter: char {
  morphology(self
    {'"' -> '\\"'
    |'\n' -> '\\n'
    |...});
}

The three dots indicate that we have to list every single character and its translation (or we could introduce a special syntax to say "every other character gets mapped to itself"). The rule is fairly clear: to write a double quote, you write a backslash followed by a double quote. To parse a double quote, you reverse the process.

With this definition, we can now modify string literals to handle escape characters:

rule StringLiteral: string {

  StringLiteralCharacter chars[] = explode(self);
  morphology('"' (chars*) '"');
}

You may be wondering how the system knows how to explode a string into an array of StringLiteralCharacter. The answer is that it doesn't have to. A StringLiteralCharacter is just a char with formatting and parsing code attached, so Classp only has to know how to explode a string into an array of characters.

Numbers

rule Digit: unsigned {
  morphology(self{0->'0'|1->'1'|2->'2'...|9->'9'});
}

This rule represents both a formatting function and parsing function for digits. As a formatting function it takes an integer as an argument and outputs the string representation of the integer as specified by the the case pattern. As a parsing function, it takes an input stream and looks for one of the 10 characters, returning the integer (reverse) associated with the character.

Keep in mind that syntax declarations in Classp are viewed primarily as representing a formatter for the class rather than a parser. We get the parser by inverting the formatter. So lets continue to use that same model with lexical rules as we try to define whole numbers (non-negative integers) in terms of digits.

rule WholeNumber: unsigned {
  Digit d = self % 10;
  optional WholeNumber prefix = self / 10;
  morphology({self >= 10 -> prefix} d);
}

Let's go through this: we have a whole number n. In the decimal representation for n, the least significant digit (the one on the right) will be n%10 and the rest of the number (that is, the number that you get if you strip off the right-most digit) will be n/10.

We've introduced a new conditional pattern here:

{condition1 -> pattern1 | condition2 -> pattern2 ...} 

outputs the pattern that goes with the first true condition in the list. If no condition is true then it outputs nothing (This kind of pattern is not currently in Classp, but probably will be soon).

With this background we can read the above rule as follows: "If self >= 10 then write the portion of self other than the last digit. Write the last digit."

This technically works; but whole reason that Classp exists is to improve on grammar-based methods, and at first blush this doesn't look to me like an improvement on grammar-based methods. We don't want a new form of lexical analysis just to be different. If it's not better than the current grammar-based methods, we should just include some sort of grammar-based lexical analysis into Classp.

So while I think the proposed morphology declaration is a good way to go, I'm not sure about rule declarations yet.