I've been working with one of our biggest customers and they have a high quality problem in that under extreme load, the grid (WebSphere eXtreme Scale) is holding up just fine. but there is a problem. The issue is the database the grid runs on top of. They are using write behind to asynchronously write changes made to the grid to the database. This basically will only write changes to the database every 5 minutes or 1000 entries depending on what happens first.
The issue is even with this batching occurring, the database cannot keep up with the write load. They could buy a bigger database but all they really use their database for is a disk based log for whats in the grid. They load the grid from the database and then all applications use the grid as the system of record for that data. The updates made to the data in the grid and written back to the database asynchronously.
If the database cannot keep up even with the batching then it starts to fall behind and more importantly, the buffered changes not yet flushed from the grid also build up in memory.
Buying a bigger database box is also kind of pointless as the whole point of the grid was to avoid having to buy a large SMP box to run the database. Scaling up is expensive and database boxes with their SANs that can keep up with extreme loads are not really what we want to purchase. We want the grid to act as a shock absorber so that a smaller database can be used. But, this is proving a problem at this customer just using normal write behind techniques.
The real problem here is databases are not designed to be used like this. WebSphere eXtreme Scale is basically using a SQL database as a write only disk store and frankly, databases are not very good at this.
So, what is designed as a write only disk based store? Believe it or not, a messaging system is. JMS queues are designed to implement two operations. PUT to the queue and GET from the queue and they can do this very efficiently. More efficiently than a database. This is all the grid really needs here. Flushing the writes from memory to disk through a database doesn't make a lot of sense.
The idea here is to put the JMS queue between the grid and the database. This way, the grid can flush the changes to 'disk' faster than before and then we can apply the changes from the queue at our leisure to the database. This decouples the database and the grid. The grid can now deal with flash loads or tsunamis of load when they occur and flush the changes to disk using the messaging system. The database can apply those changes at it's own slower rate because the JMS queue is buffering the changes in a durable way to disk. Thus, we don't need to spend a lot of money on an expensive vertical or horizontal scaling DBMS.
This is very like a seat belt in a car. We're trying to absorb a shock by stretching it out over a longer period of time. The JMS queue allows this to happen. While the flash load is happening, the queue is able to keep up with the grid and everything stays stable. The database is emptying the queue as fast as it can which is slower than the grid. Once the flash load is gone then the grid goes back to normal load while the DBMS can keep applying changes for another couple of hours if need be.
The implementation of this could be a Loader that writes the changes to the JMS queue. A JMS listener can then pick up the changes are apply them to the database. This is basically like the write behind built in to WXS but implemented instead with an external JMS queue.