Recently I have been optimising the core algorithms in CHP. Barrier synchronisations and channel communications in CHP use the same “event” algorithm that supports all of CHP’s features, including choice between events and conjunctions of events. Conjunction really makes the choice algorithms difficult. Here’s a quick example to illustrate. Some terminology: each process wanting to synchronise is “offering” a set of offers — each offer-set is a choice (disjunction) of offers, only one of which should be completed. Each offer is a conjunction of a several events (if you’re not using conjunction, this will be a single event). Here is a test (written in a testing EDSL) from CHP:
testD "Links 3" $ do [a, b, c, d, e] <- replicateM 5 $ evt 2 p <- offer [b] q <- offer [b&c&d&e] r <- offer [c&a, d&a] s <- offer [e] t <- offer [a] always $ ??
So each of a–e is a two-party event, and there five processes making offers (each list is an offer-set, each list item is a different offer, each offer is a single event or conjunction of them). What should be the result? The event b can complete if p’s single offer and q’s single offer are taken, but for that to work, we’ll also need to be able to complete c, d, and e, in that case. If we look at process r, c can complete if a does (so we’ll need process t), but then d can’t complete as well, because only one of r’s offers may complete.
So the answer is that those offer-sets can’t complete yet, and all processes will have to wait. We want to be able to figure that out quickly, especially in the most common-case where each process is offering just one offer, with just one event (i.e. they want to perform a single channel communication), and the other common case where a process is offering two or three single-event offers. So we want speed — but we also need correctness. Having a bug in such a core algorithm would make CHP very unstable. So far it seems that CHP has been correct in its algorithm, which is why I have been hesistant to optimise it before now. Ideally I would like to prove the algorithm correct but, for now, tests will have to do.
I knew from a recent exchange with a user of the library that CHP was horribly slow when offering choices between appreciable numbers of offers. Ten offers by one process was slow, and twenty would virtually halt the program, which wasn’t right at all. So I set up a simple set of micro-benchmarks; a single channel communication, and others with one “stressed” process offering to communicate with 2, 5, or 10 other processes (who were all eagerly waiting to communicate to the stressed process, and nothing else). None of these things should be slow. I developed my recently-released Progression library to help me see what difference the changes would make.
The old algorithm would look through all the connected offer-sets available (reading them from all the STM TVars) and then see if it could find a completion. I noticed that the offer-sets were being searched in an arbitrary order, rather than by the most likely to complete; a judicious sort at the right point provided surprising speed-up:
On the one hand, I’m pleased to have made such a gain so easily, and on the other I’m embarrassed that such a gain was so readily achievable, and hadn’t been done before. But I wasn’t done yet. The algorithm was needlessly expensive, and this resulted from the way I had decomposed the problem. I had one function that read in the value of all the offer-sets (in the STM monad), one that trimmed these offer-sets to iteratively remove offers that couldn’t complete (a pure function), one that searched for remaining possible completions (also pure) and one that wrote the results to the appropriate places (again, in the STM monad).
This approach previously seemed to break the problem down nicely (and separated the effectful parts from the non-effectful parts), but it’s horribly slow if you take a step back. All offer-sets are read in, even if the first is ready to complete. So you may touch far more TVars than you needed to, then spend time pruning the offers when the first was obviously ready to complete. So I rewrote the algorithm; I now have two (mutually recursive) functions in the STM monad that combine the reading in of offer-sets and the searching (without explicit iterative pruning). This yielded great benefits over the previous algorithm for larger choices; here is the difference between the three versions so far on a logarithmic scale:
(As a side note: what horrific time-complexity must my previous algorithm have had to display this curve on a log-scale?!) To my surprise, the details of the algorithm became shorter after I rewrote it to be faster. This was perhaps because I created a new monad to write my algorithm in (a caching, backtracking search monad) which pulled out some of the common logic. With a few extra tweaks, I was able to reduce the speed even further. Here is the final optimisation I settled on, compare to that initial rewrite (back on a linear scale):
See how the performance is now roughly the same for each simpleChoice benchmark, no matter how many offers are being chosen between (because typically in this benchmark, the first one is always ready). I remain a little uneasy about releasing these changes. I have tested the algorithm a lot, but it’s still a rewrite of the core of CHP. I’ll release this soon as CHP 2.0.1 (there is no change in the API, hence the minor version increment). It should be generally faster, but please let me know if you have any odd behaviour that may indicate a bug. (I also plan to do some more optimisation in the future, but this batch will do for now.)
I don’t want to go into all the gory details of what I did to optimise the library, as that would involve explaining the full algorithm — which be a much longer post than I have time for. The changes that brought the most gain were algorithmic. Inserting the sort to change the search order made a big difference, and rewriting the algorithm to exit the search as soon as possible (and touching as few TVars as possible) made most of the remaining difference. After that, I was making changes such as adding a little caching of certain calculations to make sure that events and offers were never unnecessarily processed twice (by using a StateT monad on top of STM). My monad is a backtracking monad (well, sort of — I use an Applicative Alternative instance to explicitly denote searching of several possible paths, with something a lot like a MaybeT monad) but the caching is preserved across the backtracking as you would hope. Another change I made (but may revisit) was switching from using Map and Set from the standard libraries to using ordered lists for the same purpose. In my common cases these maps and sets will be 1-3 items, which is why I was able to see some speed-up from a simpler structure. I originally put these ordered lists in a newtype, but it turned out that removing the newtype wrapper also gained me some speed-up.
It has not escaped my notice that supporting conjunction in CHP’s events algorithm is really an instance of the “other” CSP: a Constraint Satisfaction Problem. But it does have some interesting performance characteristics (e.g. reading from less events can help a lot) and some useful aspects that I can optimise on (e.g. if one process was offering event “e” anywhere in its offers, and you choose one without “e”, “e” can no longer complete; this is not as obvious as it sounds), which is why I have my own algorithm for resolving the choices.
I have recently been working on optimising CHP. In order to optimise a program, you need a set of benchmarks/tasks for which you want to improve performance. Then, you need to record how long the original version takes to complete the benchmark, record how long a changed version takes to complete the benchmark and compare the two to see if your change was sufficiently beneficial. This process is tedious, and staring at two columns of numbers is not always instructive, so I constructed a small Haskell library to help with optimisation. I’ve called it Progression, and it is now available on Hackage. Progression allows you to specify some benchmarks, run them, and graph their performance against each other; so what you get is a graph like this:
Each line is a different version of my CHP library, and each group of points is a different benchmark. (The versions were made in the order purple, blue, green, red; I ultimately stuck with the green version.)
Progression uses the excellent Criterion library for doing the benchmarking, so it is used in a similar fashion to Criterion. You construct a small wrapper program that defines/imports your benchmarks and passes them to the Progression library to be run. For example, here is the end of my file containing my CHP benchmarks:
runAll = defaultMain . bgroup "" . map (uncurry bench . second runCHP_) main = runAll [("SimpleChannel", simpleChannel) ,("SimpleChoice 2", choice 2) ,("SimpleChoice 5", choice 5) ,("SimpleChoice 10", choice 10) ]
My runAll function turns the second part of each pair from a CHP () type into IO (), then I map the (Criterion) function bench over the list, use the (Criterion) function bgroup to join them all together, then pass them to the (Progression) function defaultMain. I compile this file into an executable.
When you run the program, you can either pass in settings via the command-line arguments, or if they are not present, you will be prompted with an interactive tab-completing prompt (thanks to the easy-to-use Haskeline library). There are three main settings:
- which benchmark groups you want to run (if you only have one group, you won’t be prompted),
- what to label the recording you are about to make, and
- which previous labels you want to compare against in the graph.
Once you’ve entered these two or three items, the benchmarks will all be run by Criterion, the results stored into a CSV file (using the handy txt-sushi library; this way you can easily manually inspect the data or further process it in a spreadsheet program), which is then combined with the previous stored CSV files and fed to GNUplot. (Unfortunately the gnuplot binding in Haskell is not in great shape at the moment, so I’m using a plain system call to run it — if you want graphs, make sure you have GNUplot installed on your system.) The program creates a graph like the one shown at the beginning of the post — by default in the file plot.png (but again, configurable by command-line option). If, later on, you want to just graph previous results against each other, you can do that by running the program with “-m graph”. On the graph you get points plotted at the means (staggered for each benchmark to aid readability and comparison) and error bars for the bounds that Criterion gives — by default these are the 95% confidence intervals, but that is configurable, too (by passing through an option to Criterion).
1. Make sure gnuplot is installed on your system and available in your path. Gnuplot is very likely to be in your package manager if you’re on Linux.
2. cabal update && cabal install progression
Note that by default, Criterion uses the Chart library for plotting (a feature of Criterion that Progression does not use), which is in turn dependent on gtk2hs, which can be problematic for some people to install. If you get messages about cairo not being installed and you don’t want to install gtk2hs, you can install Criterion without this support for plotting (and thus without the gtk2hs dependency). The command cabal install criterion -f-Chart should do this (the minus between f and Chart is crucial), but unfortunately it seems after that, that you must install progression manually by downloading it and running Setup.lhs (I had hoped cabal install progression would work but that seems to attempt to satisfy the Chart dependency even though criterion is by then installed).
I’ve now uploaded the progression repository onto patch-tag. You can get a copy using darcs get http://patch-tag.com/r/twistedsquare/progression if you want the very latest version or wish to contribute some patches.
Issues with the 0.1 release
I realise that the graph is technically invalid. I shouldn’t connect the points with a line because the X-axis is discrete, not continuous. However, without the line (i.e. with just points and error bars) it’s much less readable at a glance, and a bar chart with error bars didn’t seem too readable either when I tried it. The graph display still isn’t perfect though; it works best when you have benchmarks that take roughly similar times, and if you make one huge saving on one benchmark, as I did (a factor of about 100), this throws off the whole display. Normalising all the times, so that one of the versions has all its times normalised to one, would partially fix the problems. Also, if you run a lot of benchmarks, the CSV files do start to litter the directory; I wonder if I should store them in a subdirectory.
That’s the summary of Progression. I hope that Progression helps people who are optimising Haskell programs (especially alongside profiling and tools such as ThreadScope, that can help pinpoint possible candidates for optimisation). But the work-flow is not perfect. Often, not all the benchmarks are written and complete when you start benchmarking; you both make changes and add new benchmarks as you work. Currently, when graphing, Progression ignores any benchmarks for which it does not have data for every recording being graphed (this should perhaps be fixed). Ideally, you would re-run the old version of your library/program (for example, the original version before you began optimising) with the benchmarks to get some data. For this, we need access to an old version. All developers store their projects in version control (and with Haskell, that’s typically darcs) so we could automatically pull out the old version from there rather than making the programmer do it themselves.
So perhaps what we would do is dig through the version history for the latest tag starting “OPT:” (which is the “original” for this current round of optimisation), then re-run that version with the latest benchmarks. In fact, why restrict it to just the original? We could look back to find the last tag with “OPT:”, then work forwards from there, looking for patches, marking out those starting “BENCH:” (the ones that introduce new benchmarks). We would then try and create versions of your program from the OPT tag forwards, with all the different versions since then, but always featuring all the BENCH patches. This would give us complete benchmark data for all old versions, and would also mean you wouldn’t have to tell Progression what the labels are, it could just use the patch names/dates. We could also try different combinations of patches (if you’ve recorded A, B, and C, where B depends on A, try A, A&B, A&B&C, A&C) to see how different independent changes combine together to find the fastest combination.
I’m not sure how much work that would be, but it sounds like it might form quite a useful tool. Or, it might turn out to be too over-the-top and awkward — it might be best to just stick to Progression’s current straightforward design. Comments are welcome below, as are any questions or feedback about the current release of Progression.