My team is currently working on a project where we are using messaging to perform some business function in reaction to operations which happen in an existing system. The specific operation which is giving us some concerns is the valuation of a clients investment policy or entire portfolio. On the face of it this seems simple enough and the mechanics are fairly trivial however these valuations can happen randomly when advisers or clients choose to get the latest value or they can happen in large batches in cases where product providers provide us with all the latest values for all the policies relating to many thousands of clients. So we could end up with anything from 10k-150k messages at any one time landing in the queue.
To test this we dumped 20k messages in a queue and using one NSB endpoint (1 thread, MSMQ transport) it was giving us around 300 - 350 messages a second and took around 1 minute to complete on a 4 core dev box using a mid market SSD disk. This wasn’t in itself too bad but the receiving endpoint was doing nothing at this point so this was clearly going to get worse when it actually had to do something.
So we wanted to see what the scale profile would look like if we threw more threads and machines at it. Firstly increasing the thread count helped but not massively. We then thought we would add the distributor into the mix to see how much we could squeeze out of this setup. This would give an idea of how we could scale this system in production and charting the gains in throughput versus numbers of machines and threads would help us visualize what we were seeing.
We added the distributor and 4 worker nodes and reran the 20k message scenario. Strangely we noticed the throughput drop to around 30-40 messages per second. Not promising! We also noticed heavy disk IO as you would expect from MSMQ so we decided to distribute the workers to other machines to relieve the disk contention and free up the CPU to get a better understanding of where the bottleneck was. Strangely again we saw no marked improvement on the previous run.We added more and more workers with no real affect. Watching the MSMQ performance counters we could clearly see it was the distributor which was holding things up and couldn’t seem to process more than 30-40 messages a second. The distributor was working as a bottleneck and not a load balancer, something is wrong!
After reading many articles on MSMQ performance (Ayende blogged about this here and here awhile ago) it dawned on me that we were never going to get any where near where I wanted to be using MSMQ as a transport. It seems to be down to MSMQ transactions which is unfortunate.
Some people however were reporting ridiculous throughput in the many thousands of messages a second however, where details were given the hardware used was immense and costly. I would love to know how to achieve better throughput using MSMQ without breaking the bank, does anyone have any ideas?
We eventually moved over to using SqlServer Service Broker as the transport (from NServiceBus Contrib ) for this queue and achieved much more throughput (around 500-700 messages per second) than with MSMQ and the distributor. We eventually modified the SSB Transport to use conversation groups and batched the messages at 1k which is currently giving us many thousands of messages a second which is looking promising. We are still currently playing with this to see where it takes us so I will leave the details of this to a later post.
Does anyone have any tips or experiences with NSB or MSMQ or other transports which they could share as I’m really interested in how other people have dealt with or are dealing with message throughput in their systems?