Sunday, September 8, 2013

Compilers, Interpreters and Portability

Once upon a time there were three different ways to implement a programming language, and typically a given language was implemented in only one of these ways. Compiled languages were translated into machine language. An assignment statement would pretty reliably get translated into a piece of code ending in a memory store or a register load. An addition expression would pretty reliably get translated into a piece of code ending in a machine ADD instruction.

But compilation was a lot harder to do for languages such as Lisp that didn’t have fixed declared types. In Lisp an addition expression requires some runtime decision making to decide exactly what it means at that point in the code on that particular execution. So we had interpreted languages which were translated into code for a virtual machine and then we had an interpreter which would execute this virtual machine code.

And finally there were the scripting languages which were not translated at all. Instead an interpreter would read the text representation of the program and interpret it directly.

Over time a lot of these distinctions have faded. A lot of scripting languages are now translated into a VM language and interpreted. A lot of languages that started out as interpreted languages now have compilers and those that still have interpreters sometimes also have just-in-time (JIT) compilers which compile the VM code into machine code. There are even interpreters that interpret compiled languages like C for use in debuggers and related tools.

The tradeoff between compilers and interpreters is that compiled code tends to be faster while interpreted code tends to be more portable. Interpreted code is more portable in the sense that you have to write just one portable program --the interpreter itself-- and then every machine that has an interpreter can execute the VM code, which is the same for all machines.

Interpreters typically have a execution loop which reads each VM instruction, figures out what the instruction means, and then does what the VM instruction says to do. It isn’t unusual for it to take dozens of machine instructions to execute one VM instruction. On the other hand, VM instructions tend to do more than machine instructions do so the interpreted language needn’t be dozens of times slower than compiled. As a rule of thumb, interpreted programs tend to be about an order of magnitude slower than compiled programs.

JIT is designed to give the best of both worlds --you build a compiler into the interpreter and then you can have portable compiled code. But taking a VM that was designed for an interpreter and compiling it is not an ideal solution because there are different requirements for a VM language that is to be interpreted vs. one that is to be compiled. A VM language that is to be interpreted should be compact so that the VM code itself does not take up so much of the data cache. The VM should be designed so that instructions can be analyzed quickly and efficiently so that the interpreter loop can be fast. Finally, the VM language should be high-level, letting you do a lot of work with a smaller set of instructions. This helps out with the data cache and also means that you have fewer instructions to interpret so there is less overhead from the interpreter loop. Also, a high-level interpreted language means that more work is done as part of the interpreter code which is compiled and therefore faster.

By contrast, a VM that is meant to be compiled has none of those requirements. A VM for compiling should be low-level to support low-level optimization and this is in conflict with the requirement for interpreted VMs to have high-level code. Also, the requirements for compact code and quick analysis of instructions suggests that you need a stack-based VM or a VM with a small fixed set of registers. This is not compatible with a compilable VM --at least not a portable one, which can be compiled for machined with greatly varying numbers of registers and greatly varying register characteristics.

So a VM designed to be portable and to be compiled rather than interpreted looks a lot different from a VM designed to be interpreted. I know of two different VM architectures designed to be portable and to be compilable. One project called Tendra was based on a format called ANDF --the Architecture-Neutral Distribution Format. Unfortunately, Tendra never gathered much of a following and it has largely died out.

A more recent and very active project is LLVM which is gaining steam very rapidly. In my view, this sort of technology should have been adopted a decade ago but there was just no stopping the juggernaut that was Java and the JVM. Still, things are looking up with LLVM and I have high hopes for more efficient portable programs in the near future.