Archive for the ‘Guides’ Category

The octopus, the boids and GHC 6.12.1rc1

October 12, 2009 5 comments

I had been eagerly awaiting the release candidate of the latest GHC (6.12). Last night, the GHC team delivered. Straight away, I downloaded and installed it on octopus, the 8-core machine I use for benchmarking. In the last part of the boids guide I did some benchmarking on octopus to see what speed-up I got with the boids. This was the result (click on image for full-size, wordpress’s thumbnails are quite poor):


Since then, I have optimised the CHP library, and installed GHC 6.12.1rc1. The result is a better graph, that takes less time sequentially, and has a better speed-up profile (including taking out that nasty upswing at the end of the first graph — click for full-size):


Edit: the graph now includes the GHC 6.10 time (including CHP optimisations) and GHC 6.12.1-rc1 time (same CHP optimisations), so you can get an idea of the differences. Apologies for not overlaying it on one graph, but I have mislaid the source figures for the original graph. I should benchmark all this properly and put the error bars on, etc.

I’m pleased with the effect that the optimisations and the GHC upgrade have had. The thing I really want to play with next is Satnam Singh’s ThreadScope, which visualises the thread usage with GHC 6.12, and will hopefully allow me to play with profiling-directed optimisation of CHP programs.

Categories: Boids, Guides

Boids Simulation: Part 5

September 16, 2009 1 comment


After part 1, part 2, part 3 and part 4 of this guide, we now have fully functional boids. In this part of the guide I take the first look at performance.

Our boids are functionally complete. And they’re concurrent — we have a thread per boid, one for the cell and one for the display. But will this translate into a parallel speedup? To find out, I needed more than the netbook on which I’m writing this post. Thanks to Fred Barnes, I have a login on an eight-core x86 machine (“Intel(R) Xeon(R) E5310 @ 1.60GHz”) called octopus:


So I darcs-pushed my code from part 4 across to octopus. I stripped out the display aspect and changed the program to run 2000 frames before exiting, forcing the evaluation of the boids’ position each frame. That way I should be able to time how long the program takes to run on different numbers of cores. The appropriate GHC option is “+RTS -Nx”, where x is 1 through 8 for the number of cores we want to use. I then graphed the results — the red line represents perfect speed-up (based on the one-core version divided by the number of cores), the blue line is the actual time. Here’s my first result:


That huge spike on the right is obviously some sort of oddity in the run-time. I filed ticket #3518, but for now setting the heap with “-H400M” clears it up:


Anyway, the overall result of the graphs is bad. Our program is taking longer (in wall-clock time) the more cores we add. Rather than parallel speed-up, we have parallel slow-down. So what can we do about it?

Given that we are in Haskell, it is instructive to think about where and when the values in our program are actually being evaluated. Our channels are not strict, so values sent down channels can be unevaluated thunks.

Our intention in having a process per boid was to enable parallel speed-up. So a good step is to make sure that the boid evaluates its new velocity (rather than leaving it to the cell process). This is quite a simple matter — CHP provides a writeChannelStrict function that is just like writeChannel, but that uses the Control.Parallel.Strategies stuff to force evaluation. So we add some instances to enable that:

instance NFData BoidVel where
  rnf (BoidVel a b) = rnf a >| rnf b

Then in the boid code, we change writeChannel out cur to writeChannelStrict out cur. And that’s all that was needed to add the strictness. Now we can time it again:


That is a little better (if you compare the graphs closely), but we still have parallel slow-down rather than speed-up. But our boids should be able to get parallel speed-up by evaluating their new position in parallel. Let’s consider what is happening with the cell and the boids as a possible cause. Here’s the cell code again:

    cell' pos = do vels <- mapM readChannel inputs
                   let newPos = zipWith updateInfo pos vels
                   writeChannel outputPos newPos
                   zipWithM_ writeChannel outputs (nearby 0.05 newPos)
                   cell' newPos

The cell process reads in the boid velocities sequentially, then sends out the new positions (trivially calculated from the velocities and old positions) to the drawing process, then sends out the neighbour information sequentially (what may happen is that each boid is sent a thunk that will calculate neighbour information — so each boid will calculate neighbour information in parallel rather than the cell doing it and forming a bottleneck — which would be neat!) then recurses. Hmmm — there are a few too many mentions of “sequentially” in that sentence! We missed an opportunity for concurrency, so let’s rectify that:

    cell' pos = do vels <- runParallel $ map readChannel inputs
                   let newPos = zipWith updateInfo pos vels
                   writeChannel outputPos newPos
                   runParallel_ $ zipWith writeChannel outputs (nearby 0.05 newPos)
                   cell' newPos

We just change our uses of mapM and zipWithM into map and zipWith, then pass the results of these (lists of monadic actions) to runParallel. Let’s see if that made a difference:


Those two changes (the strict-send and the parallel communications) have finally delivered some parallel speed-up. It’s not as much as we might wish for, as it seems to tail off around 2.5x speed-up. I hope to investigate this further at some point, but I suspect that the ratio of communication to computation may be part of the problem.

Optimising for parallel performance is hard in any setting, and being in Haskell (which makes it hard enough to optimise for sequential execution) certainly makes life interesting. Perhaps I can wildly generalise this post to throw together some guidelines on optimising:

  1. If your performance numbers are very odd, look to the garbage collector. (Don Stewart also recommends keeping an eye on GC when trying to get parallel speed-up.)
  2. Try to work out where and when your values are actually being evaluated. In general, if the values are used to take a different monadic action, or if they are sent out of the program, they are forced. Otherwise they probably aren’t, and are being sent around as thunks. Find where you want values to be evaluated to get the most speed-up and try changing writeChannel to writeChannelStrict.
  3. Vary the amount of parallelism. In this example, my sequential communications needed to be made parallel. Sometimes the opposite is true. Optimisation is not straightforward (alas).
Categories: Boids, Guides

Boids Simulation: Part 4

September 14, 2009 3 comments


After part 1, part 2 and part 3 of this guide, we now have boids moving around and doing some very basic flocking. In this part of the guide I will implement the full flocking rules, finishing with a video.

Aside: After the previous parts of the guide, I decided that the architecture for the boids was not quite right. Each boid was doing its own clamping (which had a small bug — now fixed) and was dealing with its absolute position. But a boid doesn’t care about its absolute position, only about the relative position of the other boids. So I have refactored the design so that a boid keeps track of its own velocity, and is sent information about its neighbours’ relative positions (and absolute velocities). The cell keeps track of each boid’s absolute position, and is changed to send the boid the appropriate information. That change on its own can be seen with this command:

darcs get --tag p4a chp-boids-p4a

The remainder of this part of the guide implements the full flocking rules, and is more interesting for how the boids work than for the concurrency (more concurrency next time!). I have implemented the rules in pure Haskell for now to try to make them understandable. First, we will define some helper functions:

    (*+*) (a, b) (a', b') = (a + a', b + b')
    (*/) (a, b) c = (a / c, b / c)

    average xs
      | null xs = 0
      | otherwise = sum xs / fromIntegral (length xs)

    sq x = x * x

    within ps d = filter ((< sq d) . hyp2) ps
        hyp2 n = sq (relX n) + sq (relY n)

The *+* operator adds together two pairs, and the */ operator divides both elements of the left-hand pair by the right-hand side. The average function is as expected, and returns zero if the list is empty. sq is a short-hand for squaring something, and within filters down a list of (X,Y) pairs to all those within the given distance of the origin. For relative positions of boids, this will filter a list down to all the boids within a certain range. But enough boring helper functions — here is the first rule, where the boid tries to match its neighbours’ velocities:

    -- An acceleration to allow us to match the mean velocity (i.e. speed and direction)
    -- of the boids nearby
    meanVelocityAcc neighbours cur
      = (average (map (velX . neighbourVel) neighbours) - velX cur
        ,average (map (velY . neighbourVel) neighbours) - velY cur)

All the rules are given in terms of acceleration — i.e. a desired change in our velocity. This rule is very simple, and just says that we should accelerate to match the average of our neighbours’ velocity (by producing that average, and working out the change needed from our current velocity). The second rule tries to make sure that we don’t bump into our neighbours:

    -- An acceleration to stop us hitting nearby boids
    repulsionAcc neighbours
      = foldl (*+*) (0, 0) $ map ((negate . relX) &&& (negate . relY)) veryClose
        veryClose = neighbours `within` 0.02

The rule first filters the neighbours list to a small distance. Then it works out the velocity away from the neighbour (by negating their relation position) and adds up all these velocities to produce a total (not an average). The &&& is an Arrow function — a habit I picked up from Neil Mitchell (I think his HLint suggests it). Here it applies the two functions to a single input and produces a pair of the results. The third rule is to make us stay close to the other nearby boids:

    -- An acceleration to keep us quite close to nearby boids (stick with the flock!)
    keepCloseAcc neighbours
      = (average $ map relX neighbours
        ,average $ map relY neighbours)

Very straightforward — head towards the average centre of the nearby boids. Obviously there will be tension between the second rule (avoid hitting boids) and the third rule (stay near them). You will see boids jostling around a bit in the flock because of this. There is a fourth rule that I have taken from the occoids example — a slowly changing bias angle to make the flocks head roughly in the same direction. This involves some state-passing and thus I have incorporated it with the definition of the boid:

boid :: Chanout BoidVel -> Chanin [BoidNeighbour] -> BoidVel -> CHP ()
boid out input = boid' (0, 0)
    boid' (biasCount, biasAngle) cur
      = do writeChannel out cur
           possibleNeighbours <- readChannel input
           let neighbours = possibleNeighbours `within` 0.05
               -- Every 100 frames, tweak the bias:
               (biasCount', biasAngle')
                 | biasCount == 100 = (0, biasAngle + 0.5)
                 | otherwise = (succ biasCount, biasAngle)
               (idealAccX, idealAccY)
                 = (meanVelocityAcc neighbours cur */ 8 )
                     *+* (repulsionAcc neighbours */ 4)
                     *+* (keepCloseAcc neighbours */ 30)
                     *+* ((cos biasAngle, sin biasAngle) */ 10000)
               new = limit $ BoidVel {velX = velX cur + (idealAccX/5)
                                     ,velY = velY cur + (idealAccY/5)
           boid' (biasCount', biasAngle') new

The definition for idealAccX and idealAccY combines all the rules together. The seemingly magic numbers in the weighting (and throughout these rules) are tweaked parameters that I have taken directly from the occoids source. The limit function crudely limits the boid’s velocity:

    speedLimit = 0.007

    -- Stop the boids from speeding up crazily by limiting their maximum speed:
    limit v@(BoidVel vx vy)
      | sq vx + sq vy > speedLimit * speedLimit
        = let slowdown = (speedLimit * speedLimit) / (sq vx + sq vy)
          in BoidVel (slowdown * vx) (slowdown * vy)
      | otherwise = v

That’s all we need to implement our boids. You can get all this source with the command:

darcs get --tag p4c chp-boids-p4c

If you compile and run the program, you’ll be able to see the boids flocking together from random start positions. Close the window and re-run to see it happen again from different start positions. I quickly recorded a short video of one instance of the flocking:

Choose from WMV format (suitable for Windows) or MPEG format (suitable for Linux, etc).

David A. Wheeler’s sloccount says that we have written 142 lines of Haskell code to implement the boids concurrently, which is not too bad. It works — but you may wonder if it gets faster should you have multiple cores available. I aim to find out in the next part of the guide.

Categories: Boids, Guides

Boids Simulation: Part 3

September 12, 2009 Leave a comment

After setting up some infrastructure in part 1 of the guide, part 2 added a boid process and got it moving. Now we continue by adding some more sophisticated movement:

Our boids are currently moving around with a fixed velocity. The next step towards implementing proper boid movement is to provide a mechanism for the boids to be aware of their neighbours’ positions and velocities.

To do this we will add another process, called a space cell, that will keep track of all the boids’ positions+velocities and relay them to other boids. We will connect each boid to the cell with two channels — one for the boid to send its new position+velocity, and one for the cell to send the boid that same information about its neighbours. To get the new version explained in this post, use the command:

darcs get --tag p3b chp-boids-p3b

The ordering for these messages is fairly straightforward. Each frame of movement (borrowing an animation term) will proceed like this:

  1. The boids send their new position+velocity to the cell.
  2. The cell sends the drawing process the current boid positions+velocities.
  3. The cell sends neighbour positions+velocities to all the boids.
  4. The boids each calculate their new velocity and thus their new position.

This is illustrated below — the arrows are channels (dashed blue when they are being communicated on, black otherwise) connecting up our boids, the cell and the display:


Here is the new cell process:

cell :: [(Chanout [BoidInfo], Chanin BoidInfo)] -> Chanout [BoidInfo] -> CHP ()
cell chanPairs outputPos = forever cell'
    (outputs, inputs) = unzip chanPairs

    cell' = do pos <- mapM readChannel inputs
               writeChannel outputPos pos
               zipWithM_ writeChannel outputs (nearby 0.05 pos)

The first parameter to the cell is the list of channel pairs (one per boid); the second parameter is the outgoing channel to the drawing process. The main body of the process is quite simple. It reads boid information from all the channels, then sends it on to the drawing process. After that, it sends back to the boids all the information about boids near them. The nearby function has this type:

nearby :: Float -> [BoidInfo] -> [[BoidInfo]]

The first parameter is a threshold distance at which to filter boids. The second parameter is the list of boid positions+velocities. The return is a list such that the first entry is the list of all boids near to the first boid in the supplied list (except for the first boid itself), the second item in the return list is the list of all boids near to the second boid in the supplied list (but not the second boid) and so on. Thus we can zip the in-order list of boid channels with the in-order list of boid neighbours in the cell’ process.

At the moment, our boid is not implementing the full range of boid rules — only the rule that makes it adopt the mean velocity of its neighbours. That was done fairly simply:

    boid' cur
      = do writeChannel out cur
           neighbours <- readChannel input
           -- Use average velocity of nearby boids for now:
           let average xs = sum xs / fromIntegral (length xs)
               (vX, vY) | null neighbours = (defaultVelX, defaultVelY)
                        | otherwise = (average $ map velX neighbours
                                      ,average $ map velY neighbours)
           boid' $ BoidInfo (clamp $ posX cur + vX) (clamp $ posY cur + vY)
                            vX vY

The boid first sends out its position, then it reads in the information about its neighbours. The let block contains the pure calculation for adopting the mean velocity of its neighbours, and then the boid recurses.

If you run the current version of the simulation, you will see a coagulating effect occur — as the boids near each other, they slowly form into a static mass that then moves slowly along as one (see the example screenshot on the right). This is not full flocking behaviour, but it at least bears a resemblance! In the next part of the guide we will implement the full set of boid rules.

Categories: Boids, Guides

Boids Simulation: Part 2

September 10, 2009 2 comments


In part 1 of our boids series, I showed how to get the bare minimum up and running — a window that draws some stationary boids. In this second part I will show how to make the boids move.

Boids should choose their movement by reacting to the other boids around them, and following the flocking rules. We are building our simulation in stages, though — so we’ll start by just getting the boids moving with a fixed velocity, away from the centre located at (0.5, 0.5):

boid :: Chanout (Float, Float) -> (Float, Float) -> CHP ()
boid out initPos@(initPosX, initPosY) = boid' initPos
    fixedVelX = 0.01 * (initPosX - 0.5)
    fixedVelY = 0.01 * (initPosY - 0.5)

    boid' pos@(posX, posY)
      = do writeChannel out pos
           boid' (posX + fixedVelX, posY + fixedVelY)

darcs get --tag p2a chp-boids-p2a

We don’t need to touch any of the rest of the code to change the boid’s behaviour. The boid’ function alters its position every time it recurses, by a fixed amount. If you run this program, you will notice a strange effect — or rather a lack of effect. Nothing is moving! The reason for this is that our boid only moves each time it sends its position. It only sends its position when the other end asks for it. The other end is the OpenGL display callback, which only asks for the position when it redraws the screen, and the screen is only redrawn when necessary. Hence, no movement. (If you drag the window so that part of it disappears off the side of the screen and thus needs a redraw, you will get movement.)

Our architecture is actually that of information pull. The display function pulls the position from the boid, and the boid only updates its position once this has happened. We have a form of concurrent laziness in our program. There are several ways we could fix our program to get the regular movement that we intended. For now, we will add an OpenGL timer that redisplays every 50ms, giving us 20 updates per second. (This is not perfect either, as triggering additional redraws will speed up the boids, but we will return to this issue later in the tutorial.) I’ll not bother with the code here (it’s purely GLUT code), but you can see it by issuing:

darcs get --tag p2b chp-boids-p2b

CHP Boids Screenshot

CHP Boids Screenshot

Now if you compile and run our demonstration, you should see the boids gradually moving out from the centre. (In fact, since the boids all move away from the centre, it looks like a classic starfield effect.) The next problem is that the boids fly off the sides of the window; we can make the window wrap round by clamping the boids’ positions to the range (0,1). The code is quite small, but if you run the new version, you can see that with the latest versions we now have boids moving across the screen and wrapping around:

darcs get --tag p2c chp-boids-p2c

Our boids are on the move! That will do for this update; next time we will start to add the infrastructure necessary for the boids to know where their neighbours are and act upon it.

Categories: Boids, Guides

Boids Simulation: Part 1

September 8, 2009 8 comments


In the 1980s an animator named Craig Reynolds was interested in animating groups of animals, such as flocks of birds. He wanted to avoid having to individually determine each bird’s movement, and he suspected that their behaviour could be captured by an algorithm wherein each bird followed simple rules relating to its surroundings. His work showed that he was right, and he produced a now famous simulation called boids that demonstrated flocking behaviour based on three simple principles: stay close to other boids, head in the same direction as them, but avoid hitting them.



Boids is famous for several reasons: it emulates animal group behaviour quite simply, it demonstrates emergent behaviour, and so on. I’m going to use it as an example application for two reasons: it looks good, and it’s a nice, fairly small example of a physical simulation. So this is the first in a series of posts on implementing boids using the CHP library, based on an existing example in another message-passing concurrency language: occam. That version, occoids, can be seen in videos from the CoSMoS project.

We will start with the framework for visualisation, then build up the simulation. All the code in this guide is held in a chp-boids repository on patch-tag — each sample will have an accompanying darcs command to get the code associated with that point in the guide. Oh, and you will need the latest version of CHP to work with these examples (1.3.0, currently), which you can get by issuing “cabal update ; cabal install chp”. So without further ado, here’s the first piece of code for boids in CHP:

mainCHP :: CHP ()
mainCHP = do
  boidStartPos <- liftIO $ liftM2 zip (replicateM 150 randomIO) (replicateM 150 randomIO)

  let displayIO = do startFrame
                     mapM_ drawBoid boidStartPos

  liftIO $ do setup
              displayCallback $= glRunAs2D displayIO

main :: IO ()
main = runCHP_ mainCHP

darcs get --tag p1b chp-boids-p1b

I’ve omitted some irrelevant bits that can be seen in the full code. displayIO is the callback for the display — currently it’s starting and ending a frame, and drawing some stationary boids. You can see the general structure of a CHP program — the top-level command uses runCHP_ to run the main function (in the CHP monad). Currently the body of mainCHP just uses liftIO to do some OpenGL/GLUT bits, but that will change in due course. The program currently runs and displays some stationary boids until you close it. So far, so good — now it’s time to add some concurrency.

We are going to turn each of our boids into a concurrent process (recall that we aim to add as much concurrency as we can find). For now, our boid processes will simply continually offer to send their unchanging position on a channel. So here’s our starting boid:

boid :: Chanout (Float, Float) -> (Float, Float) -> CHP ()
boid out pos = forever $ writeChannel out pos

Our boid is fairly self-explanatory — it just repeatedly writes its position on the channel (we will come to the reading shortly). It is important to understand that the boid will not sit there filling up some internal buffer until it overflows:

CHP channels are synchronous, which means the writer only sends the data when the reader is ready to recieve it. There is no buffering in the channels.

So although our boid is continually offering to send its position on the channel, the communication will only occur when the reader asks for it. Now we can look at how the rest of our code has changed to use this boid process:

mainCHP :: CHP ()
mainCHP = do
  boidStartPos <- liftIO $ liftM2 zip (replicateM 150 randomIO) (replicateM 150 randomIO)
  -- Allocate channels for boids:
  boidChannels <- replicateM (length boidStartPos) oneToOneChannel

  let (*$*) = zipWith ($) -- apply a list of functions to a list of arguments
      boids = repeat boid *$* writers boidChannels *$* boidStartPos

  displayIO <- embedCHP_ $
                  do boidPos <- mapM readChannel (readers boidChannels)
                     liftIO $ do startFrame
                                 mapM_ drawBoid boidPos

  runParallel_ boids
    <||> liftIO (do setup
                    displayCallback $= glRunAs2D displayIO
  return ()

darcs get --tag p1c chp-boids-p1c

Early on, we now allocate a list of channels that the boids will use to send their position. There is then a little bit of Haskell that forms the boid processes from the list of channels and list of start positions. The displayIO callback has to be in the IO monad rather than the CHP monad so we use a helper function supplied by the CHP library (embedCHP) that pushes a CHP process back down into the IO monad. The displayIO callback can then use CHP functions — here it uses readChannel to collect positions from all the boids.

Finally, the boids are run in parallel with each other (using the runParallel_ function) and also in parallel (this time via the <||> operator) with the main loop of the program. What this gives us is a concurrent program with 151 processes: 150 boids, and one main loop.

The wiring and interface with OpenGL/GLUT is likely to be one of the ugliest parts of our simulation, so it should get better from here. In the next part of the guide I will show how to get the boids moving.

Categories: Boids, Guides

Get every new post delivered to your Inbox.