Tock — A Compiler for Concurrent Languages, written in Haskell
Occam and Tock
The CSP calculus has inspired quite a few languages and frameworks; CHP is just one of them. Go is a recent example, but probably the oldest example is the occam programming language. Occam was developed for the Transputer processor, which has since died out. The efforts of research groups at Kent and Southampton made sure that occam survived the death of its original host platform. Today, a much-modified version of the original compiler, now named KRoC, will compile and run occam code on x86 Linux and various other platforms. Its basis, the original compiler, was written around twenty-five years ago. In C — eugh. Nowadays there are better languages to write a compiler in — such as Haskell.
My colleague Adam Sampson began playing with writing an occam compiler in Haskell using modern technologies like nanopass compilation (where the compiler consists of a large number of small and simple passes), Parsec (for, um, parsing) and generic programming (for performing automatic descents of the abstract syntax tree during the passes). At some point I joined in, and between us we wrote Tock, a compiler for occam and other concurrent languages, written in around twenty-thousand lines of Haskell.
Occam is a difficult language to compile, for several reasons. Its grammar has ambiguities that can only be resolved with semantic knowledge. For example, c ? x is either an input from channel c into variable x, or it is an input from channel c that must match the tag (think Haskell data-type constructor) x. It is not clear which it is until you have determined what x is, which typically happens after parsing. This is exactly the sort of problem that would occur if Haskell did not require data constructors to start upper-case and variables to start lower-case (without this restriction, what would x be in “let x = y”: variable or constructor?).
Occam is layout-based, using 2-space indents. This requires some pre-processing to insert indent/outdent markers. We use alex to tokenise and insert the markers, then use Parsec-2 to parse a stream of tokens into an abstract syntax tree (AST) after running a pre-processor stage on the token stream. To help resolve the ambiguities, we accumulate a symbol table in the parser, which is not nice — we would like to iron that out.
We started out using Scrap Your Boilerplate (SYB) as our generic programming method for implementing traversals of the AST in our nanopasses. The advantage of using generic programming for traversals is that you can write the code you’re interested in (e.g. apply this check to all parallel constructs in my AST) without writing the code you’re not interested in (if you find an if statement, here’s how to descend into it to find the parallel constructs). SYB turned out to be too slow for us and we ended up constructing our own generic programming framework, Alloy. It’s available on Hackage, and there is a paper (and a talk) if you’re interested.
Occam permits some powerful static safety checks. In every framework based on occam’s principles (including CHP) it is not possible to statically ensure that one-to-one channels are not going to be used wrongly in parallel (e.g. that the writing end will not be written to by two processes in parallel). The occam compiler does check this.
The algorithm is actually quite simple to implement in the AST. Wherever you find a PAR construct (which runs its branches in parallel), descend into all the branches and make a set of pairs of channel name and direction (input or output) that are used or passed as parameters to processes. Check all the sets against each other, and if any pair is present in more than one set, issue an error. With generic programming, making the set of pairs is just a generic query, and the whole thing is fairly straightforward. A similar check is performed to ensure that variables obey the Concurrent-Read Exclusive-Write (CREW) rule: that in parallel you either have no-one accessing a variable, only readers, or exactly one writer.
Another safety check involves arrays. Occam permits arrays of variables and arrays of channels, and performs safety checks on both. These checks are particularly tricky when combined with replicated parallel constructs. These are like parallel for loops; you can write PAR i = 0 FOR 10, and the code beneath will be replicated 10 times in parallel with indexes 0 through 9. The old occam compiler statically checked this by only allowing statically-known replication counts, and using brute force to expand out all the code and perform lots of known-constant checks on the array indices. Tock uses the Omega test (a form of Presburger arithmetic) to perform the same checks for dynamic replication counts. I spent a while pouring over the Omega Test paper to understand it (and produced some slides on it to help others understand it) and then implemented it in the compiler. I am still impressed by what it can do. Take this code, for example:
PROC p() INT a: PAR i = 0 FOR 10 SEQ a[i * 10] := 0 SEQ j = 1 FOR 10 a[(i*10) + j] := j :
This code declares an array of 100 INTs. The PAR i = 0 FOR 10 line runs the code below it 10 times in parallel, each with a different value of i (0 through 9, inclusive). This code is initialising an array of 100 items, by running these 10 parts in parallel, each of which initialises the beginning of a strip (indices 0, 10, 20, etc) to zero, and the rest of the strip (e.g. indices 1 through 9, 11 through 19) to be 1 to 9. In the above code, there is an error arising from incorrect indexing. The tock compiler spots this and gives the following error:
unsafe2.occ: 1: PROC p() 2: INT a: 3: PAR i = 0 FOR 10 Here-------^ 4: SEQ 5: a[i * 10] := 0 6: SEQ j = 1 FOR 10 7: a[(i*10) + j] := j 8: : tock: Error: unsafe2.occ:3:7 Indexes of array "a" ("i * 10 + j" and "i * 10") could overlap when: 1 <= i' <= 9 , j = 10 , 1 <= j' <= 10
That error message could perhaps do with being more helpful, but the important part is the most specific: when j is equal to 10, there can be an overlap between array indices being written to in parallel. The problem is that j shouldn’t be able to reach 10: the SEQ j = 1 FOR 10 line should be SEQ j = 1 FOR 9. If you are doing parallel imperative programming with arrays, this is a very useful error for the compiler to be able to spot. Once the line is corrected, tock can tell that it is safe, and will compile the file.
Tock is really a source-to-source compiler. Rather than producing native code, it produces C code (using the CCSP library) or C++ code (using the C++CSP library). This means that the code is as portable as the concurrency libraries themselves (almost every platform has a C/C++ compiler), and means that we get the C compiler to do all the optimisation of straight-line code for us, rather than writing that ourselves. I actually started work on a CHP backend for Tock, which I guess is fairly silly: compiling a low-level imperative language into a high-level functional language is not typically a good idea, but it was interesting. It involved lots of do blocks and let clauses to directly implement the imperative aspects in a sort of single-static assignment form crossed with a hint of continuation-passing. I’d like to finish it one day — but it’s a lot of work for no particular gain.
State of Play and Build System
Tock is just about feature complete. It lives in a public darcs repository if you are interested in it. The build system uses autoconf and automake, which is unusual for a Haskell project — most Haskell programs use cabal. There are several things that Tock needs that I don’t think cabal provides by itself:
- Two-stage compilation; due to our generics library, we must build one Haskell program and run it in order to generate more Haskell source that is then compiled alongside more source files to form Tock.
- C Tests; since we compile C using Tock, we must find out details about the C compiler (e.g. size of void* in C, or accepted flags) to give to Tock during the build process.
- Program availability; there are certain programs we require (e.g. gcc, alex), so we need to test for them during configuration.
That list is shorter than last time I looked at converting Tock to cabal, so perhaps it may now be possible to switch to cabal — I gather we might be able to keep our autoconf script for the tricky parts and use cabal for the rest. Any links on doing so would be appreciated! If we can get Tock cabalised, we can get it on Hackage, which would make it easier for people to download and use.
Almost all of our dependencies are in the Haskell Platform, so it would be nice to be able to say: install the Haskell Platform, install a concurrent run-time (e.g. CCSP), then issue “cabal install tock” and you’re done. What would also be good would be to get people working on the compiler. For example, Adam and I have wondered if we could add a Go frontend to our compiler, and thus bring the safety checks to bear on the Go language (and potentially to compile it down to our fast concurrency libraries). Unfortunately, neither Adam nor I have had time to work on Tock for a while now, but it would be a shame to let it stagnate.