## Criterion’s Confidence Intervals: The Bootstrap

I’ve been doing some benchmarking with Criterion recently, and wanted to write up the results. However, I didn’t really understand how it was arriving at some of the statistics that it spits out, particularly the confidence intervals. One book on the subject later, and I think I’ve got it. I thought I’d write it down in case anyone else needed to know. (I’m not a statistics expert though, so please do send corrections if necessary.) The mean result from Criterion is straightforward; the benchmark is run (typically 100 times) and the mean is calculated in the standard way (total time divided by count). But Criterion also gives 95% confidence intervals for the mean, and these require some more detailed explanation.

#### Confidence Intervals

Confidence intervals are a notoriously tricky part of statistics. A confidence interval of 95% means that if you run many separate experiments, and calculate a confidence interval for each of them, 95% of those intervals will contain the true mean. The kicker being that you have no way of knowing whether the interval you have for your one experiment is one of the 95% that contain the mean, or one of the 5% that don’t.

One way to calculate a confidence interval is to make an assumption about the distribution of the data. It is important to realise that this is an assumption. For example, you might be tempted to assume the times were normally distributed — but do you really believe that your times come from a perfectly symmetric bell-curve? Run times are often skew, distributed more exponentially, or simply too noisy to pick a distribution that could have generated them. (You can perform a statistical test for the validity of the normality assumption in your sample, if you’re interested.) If you did pick a distribution and calculate the parameters — e.g. the variance of a normal distribution — you could say where 95% of the values are likely to fall because you know what shape the distribution is. But if you don’t know the shape of the real distribution (from which your noisy sample was drawn), you need to use a different technique; Criterion uses the bootstrap.

#### The Bootstrap

Imagine for a moment that rather than just one mean, you instead had 1000 means. A straightforward way to form a 95% confidence interval would be to arrange your sample times in order, and pick the 2.5% mark (roughly the 25th highest mean) and the 97.5% mark (roughly the 975th highest mean). This method makes no new assumptions, and would derive the interval just from the data. Of course, you don’t have the time to wait around and record enough samples to form 1000 means. (Some of my benchmarks took in the order of 20 minutes per benchmark to collect the 100 values for a single mean; for 1000 *means* I would have to wait a long time!)

The bootstrap technique generates a lot of means, by “faking” a lot more samples than you actually recorded and taking the mean of each of them. It takes your single sample of 100, and generates many more samples (in Criterion, 100000 by default) by randomly picking 100 values out. Obviously, if you picked 100 values from a set of 100 without replacement you’d always end up with the starting set; the bootstrap picks with replacement, so it is very likely to feature repeated original values in the new fake sample, and also to omit some of the original values in each new fake. The confidence interval can be formed from these fake samples as described in the previous paragraph.

That is the intuition of the bootstrap method of calculating confidence intervals. In fact, Criterion uses bootstrap with bias correction and acceleration. These two extra measures affect how the confidence intervals are calculated by attempting to adjust for skew and sensitivity in the data. Unfortunately the reference books I have available are not very illuminating on the intuition behind bias and acceleration — so you and I will have to remain in the dark, pending a better reference.

#### The Code

Often, programmers learn a system best by tinkering and seeing what happens. In that spirit, here is a little code that takes a sample and prints out the bootstrapped mean confidence intervals for it (as Criterion would derive them):

import Control.Monad.ST (runST) import Data.Array.Vector (toU) import Statistics.Sample (mean) import Statistics.Resampling (resample) import Statistics.Resampling.Bootstrap (bootstrapBCA) import System.Random.MWC (create) sample :: [Double] sample = [1..10]++[20] main :: IO () main = print.runST$do g <- create resamples <- resample g [mean] 100000 sampleU return$bootstrapBCA 0.95 sampleU [mean] resamples where sampleU = toU sample

You’ll need the Hackage packages statistics, uvector and mwc-random. For the above, the output (to 2 d.p.) is:

[Estimate {estPoint = 6.82, estLowerBound = 4.64, estUpperBound = 11.00, ..}]

You can see that the outlier 20 pulls the upper bound of the confidence interval beyond all the other values. Fiddle with `sample` yourself and see what happens.

## Accelerate!

I’ve just released chp-2.1.0 and chp-plus-1.1.0. The former features several rounds of optimisation, and the addition of priority for channels and barriers. The latter features a few new instances for Connectable that involve shared channels.

I’ve also now created a public darcs repository on patch-tag that holds the CHP code:

darcs get http://patch-tag.com/r/twistedsquare/chp

While I’m tidying things up: I wired up the blog to a twitter account a while ago and then never told anyone about it. If you want to follow the blog through twitter (I don’t yet use the account for much more than that), you can do so.

## Processing without Buffering

While giving a talk last week, someone asked about obeying the arrow laws with CHP processes. The long and the short of it is that the arrow laws mean that `arr id >>> arr id` should be equivalent to `arr id`. But with the instance I have in chp-plus, the former is two processes and the latter one — and each process introduces one place of buffering. If you put in the same input you’ll get the same output with both, but with different buffering behaviour (this is already pointed out in the documentation). The same issue came up again later in the day with process fusion (more on that another time).

#### Extended Inputs

The way to resolve the problem of how much buffering to introduce is quite simple in retrospect: don’t introduce any buffering, anywhere. Then there is no issue of adding or removing it with the laws. We can easily eliminate buffering in CHP using extended input. Extended input means that when you read from a channel, you keep the writer waiting longer than usual, while the reader performs an extra action. The API in CHP is quite straightforward:

extReadChannel :: ReadableChannel r => r a -> (a -> CHP b) -> CHP b

The function takes an incoming channel end, an action to perform with the value, and then performs an extended input using your given action. Note that if you need the read value after the action, you’ll have to return it yourself as part of the “`b`” type.

Here is the relevant extended identity process, compared to the identity process (minus poison handling):

id, extReadId :: Chanin a -> Chanout a -> CHP () id = forever (readChannel input >>= writeChannel output) extReadId = forever (extReadChannel input (writeChannel output))

They share the same type but have slightly different behaviour. The key thing is that two of these extended identity processes composed next to each other have the same behaviour as one of these extended identity processes: `extReadId <=> extReadId = extReadId`. So if we build our arrow instance on these, we can avoid the buffering problem and fully obey the arrow laws.

#### Proof

This claim of identical behaviour can actually be proved using the FDR model checker for CSP. FDR doesn’t know about extended inputs, but an extended input can be formalised using a channel communication followed by an acknowledgement; the extended action takes place before the acknowledgement is sent. So here is the extended identity process from above in FDR script, parameterised by two pairs of data and ack channels:

EXTID(left, leftack, right, rightack) = left?x -> right!x -> rightack?ack -> leftack!ack -> EXTID(left, leftack, right, rightack)

FDR has an awkwardness that all its channels must be declared at the top-level, hence we declare the extra “middle” channels used to compose two of these processes at the top level:

datatype ACK = ack channel midack2C : ACK channel mid2C EXTID2(left, leftack, right, rightack) = EXTID(left, leftack, mid2C, midack2C) [| {| mid2C, midack2C |} |] EXTID(mid2C, midack2C, right, rightack) \ {| mid2C, midack2C |}

The two composed processes synchronise together on `mid2C` and `midack2C` (but not the other channels). We then hide (backslash is hide) these events so that they are not visible outside. Without this hiding we would not have equivalence because some outside process could synchronise on `mid2C` and break everything; hiding makes sure these events are contained. All we then need are some more channel declarations and our assertions of equality:

channel leftackC, rightackC : ACK channel leftC, rightC assert EXTID(leftC, leftackC, rightC, rightackC) [FD= EXTID2(leftC, leftackC, rightC, rightackC) assert EXTID2(leftC, leftackC, rightC, rightackC) [FD= EXTID(leftC, leftackC, rightC, rightackC)

The last two lines express refinement; you can think of this as expressing that one process is as general in its behaviour as another. The refinement operator is anti-symmetric, so expressing it twice in different directions like this is asserting equality of the two processes — a strong relationship. Putting all the above together we can run FDR and get the vital output:

This FDR release is for academic teaching and research purposes only. For any other use, please contact Formal Systems (Europe) Ltd at enquiries@fsel.com to obtain a commercial licence. Checking EXTID(leftC,leftackC,rightC,rightackC) [FD= EXTID2(leftC,leftackC,rightC,rightackC) ...sniptrueChecking EXTID2(leftC,leftackC,rightC,rightackC) [FD= EXTID(leftC,leftackC,rightC,rightackC) ...sniptrue

Inductively, if composing two extended identity processes together is the same as one extended identity process, any chain of N extended identity processes is equivalent to a single one. And the map process (on which the arrow instance is based) has the same communication pattern, so the result applies there too if we construct an extended map process: `extMap f <=> extMap g = extMap (f.g)`

## Upcoming Talk

Next Monday morning I’m going to be in Portland (Oregon), and will be giving a talk at Galois on the CHP library. On the off-chance that there are any readers from that area: the talk is open to all, so feel free to come along. It will be an introduction to the library and some of its capabilities.

*Update: the talk seemed to go well; here are the Galois talk slides (with notes). I’ve added various notes amongst the slides to help it be understandable without me gesturing at them. The chats afterwards with some of the Galois guys were very useful and interesting, and have given me a lot to think (and blog) about in CHP.*

## Synchronous Channels using MVars

An MVar in Haskell is a shared variable that is either full, or empty. Trying to write to a full one, or read from an empty one, will cause you to block. It can be used as a one-place buffered asynchronous channel. Consider if you didn’t want choice, or conjunction or any of the fancy features of CHP, but you do want to build a synchronous channel using MVars. How would you do it?

#### MVars — The Obvious Way

There is a very straightforward way to turn asynchronous channels into synchronous channels: you form a pair of channels, and use one for the writer to send the reader the data, and the other for the reader to send the writer an acknowledgement:

data SimpleChannel a = SimpleChannel (MVar a) (MVar ()) newSimpleChannel :: IO (SimpleChannel a) newSimpleChannel = liftM2 SimpleChannel newEmptyMVar newEmptyMVar writeSimpleChannel :: SimpleChannel a -> a -> IO () writeSimpleChannel (SimpleChannel msg ack) x = putMVar msg x>>takeMVar ack readSimpleChannel :: SimpleChannel a -> IO a readSimpleChannel (SimpleChannel msg ack) = takeMVar msg<*putMVar ack ()

Let’s assume that context-switching is the major cost in these algorithms, and examine how many times a process must block in the above algorithm. We know that this must be at least one; whoever arrives first will have to block to wait for the second participant.

We’ll start with what happens if the writer arrives first. The writer arrives, puts the value into the data MVar, then blocks waiting for the ack. The reader arrives, takes the data and sends the ack, at which point the writer wakes up. So in this case: one block.

If the reader arrives first, it will block waiting for the data MVar. The writer will arrive, put the value into the data MVar, then block waiting for the ack. The reader will wake up, take the data and send the ack; then the writer will wake up. So here we had two blocks. The writer blocking is unnecessary; if the reader was already there waiting, there is no need for the writer to wait for an ack, it could just deposit the value and go — *if it knew the reader was there*.

#### MVars — The Faster Way

There are several ways to remove that second block. One way is to have an MVar as a sort of status MVar. When the reader or writer arrives, they try to put into this MVar. If they succeed, they are first and they wait on a second MVar. If they fail, they are second and act accordingly, emptying the status variable and waking up the first party:

data FastChannel a = FastChannel (MVar (Maybe a)) (MVar ()) (MVar a) newFastChannel :: IO (FastChannel a) newFastChannel = liftM3 FastChannel newEmptyMVar newEmptyMVar newEmptyMVar writeFastChannel :: FastChannel a -> a -> IO () writeFastChannel (FastChannel sts ack msg) x = do first <- tryPutMVar sts (Just x) if first then takeMVar ack-- will blockelse takeMVar sts>>putMVar msg x readFastChannel :: FastChannel a -> IO a readFastChannel (FastChannel sts ack msg) = do first <- tryPutMVar sts Nothing if first then takeMVar msg-- will blockelse (fromJust<$>takeMVar sts)<*putMVar ack ()

This version is, in my benchmarks, twice as fast as the first version, which suggests that context-switching really is the expensive part of these algorithms. In fact, I started out with the first version in thi spost, but CHP’s more featured and complex algorithms were coming out faster because I only ever block once. It was only when I improved the MVar version to the second one above that the results were as I expected.

## Choice over Events using STM

I’m currently writing a paper on CHP’s performance for conjunction, which I have been optimising recently. The problem with a new feature like conjunction is that there is nothing else to benchmark it against! But I can benchmark the effect that supporting conjunction has on performance for simple channel communications and other things that don’t feature conjunction.

Two of my comparisons are simple synchronous channels based on MVars and STM. These don’t support choice between events — you can’t choose between writing to two synchronous channels built on MVars or STM without more machinery on top. But they are fast. Another comparison is the CML package, which does support such choice between events — the performance of CML merits its own post some time (in short: fine normally, but terrible if you use its choice operator a lot — unless I have made a mistake in the benchmark).

I also wanted to benchmark an implementation that supported choice but not conjunction, based on STM. Version 1.0.0 of CHP fulfils this criteria, but was badly designed and totally unoptimised — and I know from my recent optimisations how bad the performance might be. So I constructed an optimised version of channels with choice but no conjunction. I was surprised at how short the algorithm was, and realised that it could be explained in a blog post. So here it is.

#### Implementing Event Choice with STM

Let’s be clear on the problem, first. I want an Event type such that I can say “wait for event A or event B, whichever happens first, but only engage in one of them”. Then I want someone else to be able to concurrently say “wait for event B or event C or event D, whichever happens first, but only engage in one of them” and have the algorithm resolve it all. STM doesn’t achieve this by itself; you can use `orElse` to choose between writing to two variables, but that doesn’t suffice for multiple people engaging in events with each other.

We begin with a helper function — one of those functions that is general enough that it might almost feature in a standard library. Here it is:

-- | Executes the actions until it finds one that returns True (at which point-- it will execute no further actions). Returns True if an action did, False-- if none of them did.anyM :: Monad m => [m Bool] -> m Bool anyM = foldM orM False where orM True _ = return True orM False m = m

Next we’ll declare our data-types. Our Event contains a constant enrollment count (the number of processes required to synchronise together), and a transactional variable holding a list of current offers, each with an associated STM action. An offer is in turn a list of events which uses a ThreadId as a unique identifier; think of an offer as saying: I offer to engage in exactly one of the events in the list, and I’m waiting until I do:

data Offer = Offer { offerThreadId :: ThreadId, offerEvents :: [Event] } instance Eq Offer where (==) = (==)`on`offerThreadId data Event = Event { enrollCount :: Int, offersTV :: TVar [(STM (), Offer)] }

Adding an offer to an event is as simple as adding it to the list of offers. My `modifyTVar'` function has type `(a -> a) -> TVar a -> STM ()` and applies the modification function to the contents of the TVar, but it adds a little strictness that helps performance:

recordOffer :: Offer -> (STM (), Event) -> STM () recordOffer o (act, e) = modifyTVar' ((act, o):) (offersTV e)

We also define a function for checking if an event is able to complete (when we are making an offer on it). This takes an event, and an action to perform if the event can complete. It then reads the current offers from the event’s transactional variable. If the enrollment count is equal to the number of current offers plus one (the offer now being made), it can complete. Completion involves performing all the associated STM actions, and then using a revoke function to remove the offers (which have now chosen this event, say: A) from all the events that they had offered on (e.g. event A, event B, event C):

checkComplete :: (STM (), Event) -> STM Bool checkComplete (act, e) = do offers <- readTVar (offersTV e) if enrollCount e/=length offers+1 then return False else do sequence_ (act : map fst offers) mapM_ (revoke.snd) offers return True revoke :: Offer -> STM () revoke off = mapM_ (modifyTVar' removeUs.offersTV) (offerEvents off) where removeUs = filter ((/=off).snd)

We only require one further function. This function, offerAll, handles the creation of an offer, checks if any of the events in the offer can complete immediately, and otherwise records the offers in the event then waits for one of them to be completed by a later participants. It must use two transactions for this; one to make the offers (this transaction needs to finish for it to be visible to the other participants) and one to wait for an event to be completed. A crucial part of the function is not just knowing *that* an offer completed, but also knowing *which one*. For this we construct a TVar of our own into which a result can be written. This starts off as Nothing, and we later wait for it to become a Just value. We augment the user-supplied action-on-completion to also write into this TVar. The design of the algorithm as a whole ensures that this variable will only ever be written to once. Here is offerAll:

offerAll :: [(STM (), Event, a)] -> IO a offerAll off = do tid <- myThreadId rtv <- atomically$checkAll tid atomically$readTVar rtv>>=maybe retry return where checkAll tid = do rtv <- newTVar Nothing let offer = [(act>>writeTVar rtv (Just x), e) | (act, e, x) <- off] complete <- anyM (map checkComplete offer) unless complete$mapM_ (recordOffer (Offer tid [e | (_, e, _) <- off])) offer return rtv

This is all that is needed for events with choice at both ends. You call offerAll with a list of offers and it gives you back the value you associated with that offer.

#### The Public API

To wrap this into a communication channel with a CML-like API, we wrap it up as follows. First we declare an `SEvent` type (named after CML, hence the re-use of the term event for another meaning) that represents a synchronisation action; this is a list (of choices), each containing an internal event, an action to perform during the completion of the offer, and one to perform afterwards that will yield a return value (which we can use for a `Functor` instance):

data SEvent a = SEvent { sEvent :: [((STM (), STM a), Event)] } instance Functor SEvent where fmap f (SEvent es) = SEvent [((dur, fmap f aft), e) | ((dur, aft), e) <- es] choose :: [SEvent a] -> SEvent a choose = SEvent.concatMap sEvent

You can see that the `choose` function simply joins lists of choices together. We define our synchronisation function using `offerAll`, which will return the corresponding afterwards-STM-action for the chosen event, which we then execute using atomically:

sync :: SEvent a -> IO a sync (SEvent es) = offerAll [(dur,e,aft) | ((dur,aft),e) <- es]>>=atomically

Finally we can define a type for a synchronous communication channel, `SChannel` that joins together an event (the internal kind) and a transactional variable for passing the value:

data SChannel a = SChannel Event (TVar a) send :: SChannel a -> a -> SEvent () send (SChannel e ctv) x = SEvent [((writeTVar ctv x, return ()), e)] recv :: SChannel a -> SEvent a recv (SChannel e ctv) = SEvent [((return (), readTVar ctv), e)]

The send function puts the value to send into the variable during the original event completion, and then afterwards the reader takes the value out of the variable at its leisure. (The implementation assumes that the same participants will use the channel each time; an extra level of indirection could be added to make the implementation more flexible in this regard.)

The code in this post provides nearly the same functionality as the CML library, but my tests indicate it is faster. **I have now uploaded this code (with some documentation) as the sync package on Hackage.** This provides a useful “lite” alternative to CHP that runs in the IO monad, and an alternative implementation of most of the features of the CML package.

## Optimising CHP

Recently I have been optimising the core algorithms in CHP. Barrier synchronisations and channel communications in CHP use the same “event” algorithm that supports all of CHP’s features, including choice between events and conjunctions of events. Conjunction really makes the choice algorithms difficult. Here’s a quick example to illustrate. Some terminology: each process wanting to synchronise is “offering” a set of offers — each offer-set is a choice (disjunction) of offers, only one of which should be completed. Each offer is a conjunction of a several events (if you’re not using conjunction, this will be a single event). Here is a test (written in a testing EDSL) from CHP:

testD "Links 3"$do [a, b, c, d, e] <- replicateM 5$evt 2 p <- offer [b] q <- offer [b&c&d&e] r <- offer [c&a, d&a] s <- offer [e] t <- offer [a] always$??

So each of a–e is a two-party event, and there five processes making offers (each list is an offer-set, each list item is a different offer, each offer is a single event or conjunction of them). What should be the result? The event b can complete if p’s single offer and q’s single offer are taken, but for that to work, we’ll also need to be able to complete c, d, and e, in that case. If we look at process r, c can complete if a does (so we’ll need process t), but then d can’t complete as well, because only one of r’s offers may complete.

So the answer is that those offer-sets can’t complete yet, and all processes will have to wait. We want to be able to figure that out quickly, especially in the most common-case where each process is offering just one offer, with just one event (i.e. they want to perform a single channel communication), and the other common case where a process is offering two or three single-event offers. So we want speed — but we also need correctness. Having a bug in such a core algorithm would make CHP very unstable. So far it seems that CHP has been correct in its algorithm, which is why I have been hesistant to optimise it before now. Ideally I would like to prove the algorithm correct but, for now, tests will have to do.

I knew from a recent exchange with a user of the library that CHP was horribly slow when offering choices between appreciable numbers of offers. Ten offers by one process was slow, and twenty would virtually halt the program, which wasn’t right at all. So I set up a simple set of micro-benchmarks; a single channel communication, and others with one “stressed” process offering to communicate with 2, 5, or 10 other processes (who were all eagerly waiting to communicate to the stressed process, and nothing else). None of these things should be slow. I developed my recently-released Progression library to help me see what difference the changes would make.

#### The Optimisation

The old algorithm would look through all the connected offer-sets available (reading them from all the STM TVars) and then see if it could find a completion. I noticed that the offer-sets were being searched in an arbitrary order, rather than by the most likely to complete; a judicious sort at the right point provided surprising speed-up:

On the one hand, I’m pleased to have made such a gain so easily, and on the other I’m embarrassed that such a gain was so readily achievable, and hadn’t been done before. But I wasn’t done yet. The algorithm was needlessly expensive, and this resulted from the way I had decomposed the problem. I had one function that read in the value of all the offer-sets (in the STM monad), one that trimmed these offer-sets to iteratively remove offers that couldn’t complete (a pure function), one that searched for remaining possible completions (also pure) and one that wrote the results to the appropriate places (again, in the STM monad).

This approach previously seemed to break the problem down nicely (and separated the effectful parts from the non-effectful parts), but it’s horribly slow if you take a step back. All offer-sets are read in, even if the first is ready to complete. So you may touch far more TVars than you needed to, then spend time pruning the offers when the first was obviously ready to complete. So I rewrote the algorithm; I now have two (mutually recursive) functions in the STM monad that combine the reading in of offer-sets and the searching (without explicit iterative pruning). This yielded great benefits over the previous algorithm for larger choices; here is the difference between the three versions so far on a *logarithmic scale*:

(As a side note: what horrific time-complexity must my previous algorithm have had to display this curve on a log-scale?!) To my surprise, the details of the algorithm became shorter after I rewrote it to be faster. This was perhaps because I created a new monad to write my algorithm in (a caching, backtracking search monad) which pulled out some of the common logic. With a few extra tweaks, I was able to reduce the speed even further. Here is the final optimisation I settled on, compare to that initial rewrite (back on a linear scale):

See how the performance is now roughly the same for each simpleChoice benchmark, no matter how many offers are being chosen between (because typically in this benchmark, the first one is always ready). I remain a little uneasy about releasing these changes. I have tested the algorithm a lot, but it’s still a rewrite of the core of CHP. I’ll release this soon as CHP 2.0.1 (there is no change in the API, hence the minor version increment). It should be generally faster, but please let me know if you have any odd behaviour that may indicate a bug. (I also plan to do some more optimisation in the future, but this batch will do for now.)

#### Optimisation Techniques

I don’t want to go into all the gory details of what I did to optimise the library, as that would involve explaining the full algorithm — which be a much longer post than I have time for. The changes that brought the most gain were algorithmic. Inserting the sort to change the search order made a big difference, and rewriting the algorithm to exit the search as soon as possible (and touching as few TVars as possible) made most of the remaining difference. After that, I was making changes such as adding a little caching of certain calculations to make sure that events and offers were never unnecessarily processed twice (by using a StateT monad on top of STM). My monad is a backtracking monad (well, sort of — I use an Applicative Alternative instance to explicitly denote searching of several possible paths, with something a lot like a MaybeT monad) but the caching is preserved across the backtracking as you would hope. Another change I made (but may revisit) was switching from using Map and Set from the standard libraries to using ordered lists for the same purpose. In my common cases these maps and sets will be 1-3 items, which is why I was able to see some speed-up from a simpler structure. I originally put these ordered lists in a newtype, but it turned out that removing the newtype wrapper also gained me some speed-up.

It has not escaped my notice that supporting conjunction in CHP’s events algorithm is really an instance of the “other” CSP: a Constraint Satisfaction Problem. But it does have some interesting performance characteristics (e.g. reading from less events can help a lot) and some useful aspects that I can optimise on (e.g. if one process was offering event “e” anywhere in its offers, and you choose one without “e”, “e” can no longer complete; this is not as obvious as it sounds), which is why I have my own algorithm for resolving the choices.