I have recently been working on optimising CHP. In order to optimise a program, you need a set of benchmarks/tasks for which you want to improve performance. Then, you need to record how long the original version takes to complete the benchmark, record how long a changed version takes to complete the benchmark and compare the two to see if your change was sufficiently beneficial. This process is tedious, and staring at two columns of numbers is not always instructive, so I constructed a small Haskell library to help with optimisation. I’ve called it Progression, and it is now available on Hackage. Progression allows you to specify some benchmarks, run them, and graph their performance against each other; so what you get is a graph like this:
Each line is a different version of my CHP library, and each group of points is a different benchmark. (The versions were made in the order purple, blue, green, red; I ultimately stuck with the green version.)
Progression uses the excellent Criterion library for doing the benchmarking, so it is used in a similar fashion to Criterion. You construct a small wrapper program that defines/imports your benchmarks and passes them to the Progression library to be run. For example, here is the end of my file containing my CHP benchmarks:
runAll = defaultMain . bgroup "" . map (uncurry bench . second runCHP_) main = runAll [("SimpleChannel", simpleChannel) ,("SimpleChoice 2", choice 2) ,("SimpleChoice 5", choice 5) ,("SimpleChoice 10", choice 10) ]
My runAll function turns the second part of each pair from a CHP () type into IO (), then I map the (Criterion) function bench over the list, use the (Criterion) function bgroup to join them all together, then pass them to the (Progression) function defaultMain. I compile this file into an executable.
When you run the program, you can either pass in settings via the command-line arguments, or if they are not present, you will be prompted with an interactive tab-completing prompt (thanks to the easy-to-use Haskeline library). There are three main settings:
- which benchmark groups you want to run (if you only have one group, you won’t be prompted),
- what to label the recording you are about to make, and
- which previous labels you want to compare against in the graph.
Once you’ve entered these two or three items, the benchmarks will all be run by Criterion, the results stored into a CSV file (using the handy txt-sushi library; this way you can easily manually inspect the data or further process it in a spreadsheet program), which is then combined with the previous stored CSV files and fed to GNUplot. (Unfortunately the gnuplot binding in Haskell is not in great shape at the moment, so I’m using a plain system call to run it — if you want graphs, make sure you have GNUplot installed on your system.) The program creates a graph like the one shown at the beginning of the post — by default in the file plot.png (but again, configurable by command-line option). If, later on, you want to just graph previous results against each other, you can do that by running the program with “-m graph”. On the graph you get points plotted at the means (staggered for each benchmark to aid readability and comparison) and error bars for the bounds that Criterion gives — by default these are the 95% confidence intervals, but that is configurable, too (by passing through an option to Criterion).
1. Make sure gnuplot is installed on your system and available in your path. Gnuplot is very likely to be in your package manager if you’re on Linux.
2. cabal update && cabal install progression
Note that by default, Criterion uses the Chart library for plotting (a feature of Criterion that Progression does not use), which is in turn dependent on gtk2hs, which can be problematic for some people to install. If you get messages about cairo not being installed and you don’t want to install gtk2hs, you can install Criterion without this support for plotting (and thus without the gtk2hs dependency). The command cabal install criterion -f-Chart should do this (the minus between f and Chart is crucial), but unfortunately it seems after that, that you must install progression manually by downloading it and running Setup.lhs (I had hoped cabal install progression would work but that seems to attempt to satisfy the Chart dependency even though criterion is by then installed).
I’ve now uploaded the progression repository onto patch-tag. You can get a copy using darcs get http://patch-tag.com/r/twistedsquare/progression if you want the very latest version or wish to contribute some patches.
Issues with the 0.1 release
I realise that the graph is technically invalid. I shouldn’t connect the points with a line because the X-axis is discrete, not continuous. However, without the line (i.e. with just points and error bars) it’s much less readable at a glance, and a bar chart with error bars didn’t seem too readable either when I tried it. The graph display still isn’t perfect though; it works best when you have benchmarks that take roughly similar times, and if you make one huge saving on one benchmark, as I did (a factor of about 100), this throws off the whole display. Normalising all the times, so that one of the versions has all its times normalised to one, would partially fix the problems. Also, if you run a lot of benchmarks, the CSV files do start to litter the directory; I wonder if I should store them in a subdirectory.
That’s the summary of Progression. I hope that Progression helps people who are optimising Haskell programs (especially alongside profiling and tools such as ThreadScope, that can help pinpoint possible candidates for optimisation). But the work-flow is not perfect. Often, not all the benchmarks are written and complete when you start benchmarking; you both make changes and add new benchmarks as you work. Currently, when graphing, Progression ignores any benchmarks for which it does not have data for every recording being graphed (this should perhaps be fixed). Ideally, you would re-run the old version of your library/program (for example, the original version before you began optimising) with the benchmarks to get some data. For this, we need access to an old version. All developers store their projects in version control (and with Haskell, that’s typically darcs) so we could automatically pull out the old version from there rather than making the programmer do it themselves.
So perhaps what we would do is dig through the version history for the latest tag starting “OPT:” (which is the “original” for this current round of optimisation), then re-run that version with the latest benchmarks. In fact, why restrict it to just the original? We could look back to find the last tag with “OPT:”, then work forwards from there, looking for patches, marking out those starting “BENCH:” (the ones that introduce new benchmarks). We would then try and create versions of your program from the OPT tag forwards, with all the different versions since then, but always featuring all the BENCH patches. This would give us complete benchmark data for all old versions, and would also mean you wouldn’t have to tell Progression what the labels are, it could just use the patch names/dates. We could also try different combinations of patches (if you’ve recorded A, B, and C, where B depends on A, try A, A&B, A&B&C, A&C) to see how different independent changes combine together to find the fastest combination.
I’m not sure how much work that would be, but it sounds like it might form quite a useful tool. Or, it might turn out to be too over-the-top and awkward — it might be best to just stick to Progression’s current straightforward design. Comments are welcome below, as are any questions or feedback about the current release of Progression.
The CHP library is built on top of Software Transactional Memory (STM). Ever since I started this blog with the boids example, I have been looking to optimise CHP, part of which thus involves using STM as efficiently as possible. Looking at the code, I realised that my (long-running) transactions often write to a transactional variable early in a transaction, then later in the transaction they read back from the variable and write to it again. According to the STM paper (section 6.1), such behaviour should result in a local caching of the value, and ultimately there will only be one read from the variable to begin with, and one read-and-write to the variable during the commit. But presumably the lookup in this local cache will come at a cost (which I could perhaps avoid).
I constructed a toy benchmark to test this. One version of my program (named modFunc) reads from a transactional variable, applies two functions, and writes the value back. The other version (named modRW) reads, applies one function and writes back — then reads, applies the second function and writes this value back. This should hopefully illustrate the difference between reading the variable once, and writing+re-reading the variable.
To time this difference, I will use Bryan O’Sullivan’s new criterion benchmarking library, which performs some statistics and produces some nice blogable graphs. Here is the full code to benchmark modRW against modFunc:
import Control.Arrow (second) import Control.Concurrent (forkIO) import Control.Concurrent.STM import Control.Monad (replicateM_, when) import Control.Parallel.Strategies (NFData, rnf, (>|), using) import Criterion.Main (bench, defaultMain) data Foo = Foo Int Integer instance NFData Foo where rnf (Foo a b) = rnf a >| rnf b f, g :: Foo -> Foo f (Foo a b) = Foo (succ a) b `using` rnf g (Foo a b) = Foo a (pred b) `using` rnf modFunc, modRW :: TVar Foo -> STM () modFunc tv = readTVar tv >>= writeTVar tv . f . g modRW tv = do readTVar tv >>= writeTVar tv . f readTVar tv >>= writeTVar tv . g t :: Int -> (TVar Foo -> STM ()) -> IO () t n mf = do tv <- atomically $ newTVar $ Foo 0 0 tve <- atomically $ newTVar n let end = atomically $ readTVar tve >>= writeTVar tve . pred task = replicateM_ 1000000 (atomically $ mf tv) replicateM_ n $ forkIO $ task >> end atomically $ readTVar tve >>= flip when retry . (/= 0) main :: IO () main = defaultMain $ map (uncurry bench . second (uncurry t)) [ ("modFunc-1", (1, modFunc)) , ("modRW-1", (1, modRW)) ]
I ran the benchmarks on a 64-bit Debian machine with GHC 6.10.3 (criterion doesn’t yet work on 6.12 because its dependency uvector doesn’t). The results are below:
Note that the times are for a million iterations — so read that axis as nanoseconds, not milliseconds, for an individual call of modFunc/modRW. The version that reads and writes the variable twice is about 1.5x as slow as the version that reads and writes once. Further tests indicate that this factor increases towards 2x if you have multiple variables involved in the transaction. The transaction is still quite fast in absolute terms though — mainly this was a chance to try out the criterion benchmarking tool. I hope to use it more in future to benchmark parts of CHP’s core.