A little journey exploring job queue models and debunking some programming folklore around the effects of batching on latency.
Last week at work, as part of an ongoing effort to constantly improve performances and scalability of our platform, I started investigating one of our oldest pieces of software infrastructure, i.e. the job queue framework we use for work distribution in our main service tasks.
The basic structure is more or less this:
i.e. a Manager process that controls workers (their status and cardinality), fetches data in large chunks from upstream and feeds worker processes waiting for work with a smaller chunk of data each. A shared memory segment is used as internal buffer. Each worker can process data in parallel before sending results downstream and becoming free to accept more work again.
After briefly discussing this with a colleague (did I ever mention we work with really smart people?), we agreed that a much more efficient approach would be to let the workers connect to the data provider directly, leaving the manager with the task of controlling status and cardinality of the workers, but stripping it of the responsibility of work distribution. As it often happens when optimising software, the new design was simpler (fewer moving parts):
To me and a few other colleagues, the potential improvements were fairly obvious, but I was surprised to hear some strong arguments against this approach from other experienced programmers. The counter-arguments sounded solid, but weren't substantiated by actual numbers, highlighting how we often believe (and are willing to defend!) things without verifying our assumptions.
So let me debunk these assertions :-)
Accessing memory is indeed usually fast, but how memory is accessed is important too, and in this specific case the argument doesn't hold for several reasons:
"Now that many workers have direct and concurrent access to the source, to avoid hitting it too often, each worker must request a larger batch, but that means introducing a higher latency - it's better to increase the number of workers and let them process fewer items each in parallel".
This is a sensible objection, and yet why can it be wrong under many circumstances? Two reasons:
Applying Little's Law to an example, if you need to send 10 messages to a device with 100µs latency, a serial delivery would take 1ms, whilst, assuming it ships one item first, and then the other 9 in another batch, batching strategy #2 would have an average latency of at most 190µs (although on average 5 items will be available for each delivery, resulting in an average latency of only 150µs). Batching can decrease latency.
I do however concede that the original statement might be true if a) the task is NOT CPU-bound and b) after fetching a batch, the items are processed serially within a single worker, when another worker might be have been idle and able to process some items (on an idle CPU core).
The main idea behind the original job queue was to minimise the number of requests to the data provider, by fetching data in large chunks, and have each worker consume a smaller batch each, from a fast memory buffer. Now, ignoring the fact that modern data sources can happily sustain 100K+ connections per second (making the need for one single collector go away), this remark is based on the assumption that the internal queue is always half full, and the Manager process is actually reading large chunks all the time.
Reality is, queues are on average either completely full or empty. In fact, it's very difficult to have a perfect ideal balance between production and consumption rates. You'd rather hope consumption always outpaces production (or you step into the nasty business of unbounded queues).
So assuming we can always keep up with the producer, once the worker is done, it will either find very few new items available in the queue, or none at all. The Manager at this point would no longer fetch large chunks from upstream, but batches similar to the ones processed by each worker, effectively losing the economy of scale and becoming a pointless intermediary.
With each worker having direct access to the source, there's still the problem of stampedes when many of them are free; you can avoid trashing CPU and the producer by having blocking reads (preferred) or, lacking the first option, by backing off for a few ms when there's no work to do. In this situation, the number of requests to the producer shouldn't be very different from the one-reader scenario we started with.
There are other notes regarding queue theory (push vs pull) I'd like to add, but I'll leave them for another blog post.
To make a long story short, I spent a couple of hours simplifying the original job queue for a few tasks, and implementing the second strategy, with results that speak for themselves: much better use of all CPU cores, decreased load, decreased latency and, best of all, given it's no longer artificially throttled, a single node can now take 5 times more traffic, with fewer workers:
I wanted to share this story to explain why having a good understanding of CPU and memory models and a little queue theory is important, and to show an example of the sort of challenges we face daily at DataSift (if this is what motivates you and want to work with us, ping me).
The juxtaposition of two platform alerts just blew my mind, We measure platform traffic in tens of gigabits and our latency in µseconds.— Gareth Llewellyn (@NetworkString) November 9, 2012
My take-away lessons: be careful with assumptions (at least, re-evaluate them often!), measure everything, avoid complexity.