Posted at 09:15 PM | Permalink | Comments (0)
|
I'd tried to use a distributed grid with Lucene before. I was at a customer with the competitors doing a face off type engagement. We (the competitors and myself) both implemented Directory implementations for Lucene on top of our respective caching products. The implementation I wrote is on github here. The customer had a very large lucene index in memory, abouit 30GB and they were seeing very large GC pauses (minutes) using the RAMDirectory supplied with Lucene. The plan was to use a grid based directory to move the indexes in to a shared remote grid.
This proved much slower than the RAMDirectory the customer was already using. Lucene was too 'chatty' with the Directory and this means lots of RPCs to the grid. The Directory abstraction is too low level. We tried solutions like near caches but depending on the query, the near cache could need to be very big thus making the solution not viable. This ultimately meant that the customer didn't go forward with either grid as a directory store.
However, I was reading about Solandra recently and they have what looks like a better approach. The problem we had was that just implementing a Directory is too low-level. There ends up being too many RPCs between the Lucene engine and the grid. Solandra implements a Cassandra store underneath Lucene. They did not implement a Cassandra directory which would have had the same issues as I just discussed. Instead, they implemented a higher level plugin. They moved up to implement IndexReader and IndexWriter and this allows them to push some of the search down in to Cassandra's native index/query support rather than just use it as a block store. This seems to reduce the number of RPCs between the store and the lucene engine considerably.
This looks like a much more performant approach and one that should work well with a data grid also. If I get time then I may try to implement this with a datagrid and post results.
Posted at 09:18 AM in Apache Lucene, WebSphere eXtreme Scale | Permalink | Comments (2)
Technorati Tags: coherence, datagrid, extreme scale, lucene, RAMDirectory, solandra
|
I've been working with one of our biggest customers and they have a high quality problem in that under extreme load, the grid (WebSphere eXtreme Scale) is holding up just fine. but there is a problem. The issue is the database the grid runs on top of. They are using write behind to asynchronously write changes made to the grid to the database. This basically will only write changes to the database every 5 minutes or 1000 entries depending on what happens first.
The issue is even with this batching occurring, the database cannot keep up with the write load. They could buy a bigger database but all they really use their database for is a disk based log for whats in the grid. They load the grid from the database and then all applications use the grid as the system of record for that data. The updates made to the data in the grid and written back to the database asynchronously.
If the database cannot keep up even with the batching then it starts to fall behind and more importantly, the buffered changes not yet flushed from the grid also build up in memory.
Buying a bigger database box is also kind of pointless as the whole point of the grid was to avoid having to buy a large SMP box to run the database. Scaling up is expensive and database boxes with their SANs that can keep up with extreme loads are not really what we want to purchase. We want the grid to act as a shock absorber so that a smaller database can be used. But, this is proving a problem at this customer just using normal write behind techniques.
The real problem here is databases are not designed to be used like this. WebSphere eXtreme Scale is basically using a SQL database as a write only disk store and frankly, databases are not very good at this.
So, what is designed as a write only disk based store? Believe it or not, a messaging system is. JMS queues are designed to implement two operations. PUT to the queue and GET from the queue and they can do this very efficiently. More efficiently than a database. This is all the grid really needs here. Flushing the writes from memory to disk through a database doesn't make a lot of sense.
The idea here is to put the JMS queue between the grid and the database. This way, the grid can flush the changes to 'disk' faster than before and then we can apply the changes from the queue at our leisure to the database. This decouples the database and the grid. The grid can now deal with flash loads or tsunamis of load when they occur and flush the changes to disk using the messaging system. The database can apply those changes at it's own slower rate because the JMS queue is buffering the changes in a durable way to disk. Thus, we don't need to spend a lot of money on an expensive vertical or horizontal scaling DBMS.
This is very like a seat belt in a car. We're trying to absorb a shock by stretching it out over a longer period of time. The JMS queue allows this to happen. While the flash load is happening, the queue is able to keep up with the grid and everything stays stable. The database is emptying the queue as fast as it can which is slower than the grid. Once the flash load is gone then the grid goes back to normal load while the DBMS can keep applying changes for another couple of hours if need be.
The implementation of this could be a Loader that writes the changes to the JMS queue. A JMS listener can then pick up the changes are apply them to the database. This is basically like the write behind built in to WXS but implemented instead with an external JMS queue.
Posted at 09:47 PM in High Availability, Persistence, WebSphere eXtreme Scale, XTP | Permalink | Comments (4)
|
I'm maintaining a live google document with the best that we're seeing. You can see the document at this link http://bit.ly/wxsbestpractice.
Posted at 01:22 PM in WebSphere eXtreme Scale, WebSphere Performance, XTP | Permalink | Comments (1)
|
Posted at 11:56 AM | Permalink | Comments (0)
Technorati Tags: 3g, 4g, att, performance, tether, verizon
|
We just found a customer issue and it's worth talking about. The customer was preloading data into WebSphere eXtreme Scale. They were multi-threading the preload to speed it up. They were using a default fixed size ExecutorService as the pool. They were fetching records from the database and then doing a submit call on the pool to hit the grid.
They were running out of memory on the preload client. The reason is as follows. The thread pool size was fixed. Once those threads are busy then queueing additional items just puts them in the queue. The queue was getting longer and as a result holding more items. Items in a preload case might be 10k records to push in to the grid. Items are big, they can take a lot of space. A lot of space if you are able to grab stuff from the database very quickly.
This means if the thread pool is undersized then the pool will queue up lots of items and when the items are big, your preload client can run out of memory pretty easily. There are a couple of things to do here. But, the first is to fix that thread pool so that this can't happen. If you initialize the pool like this then you'll get a better behavior:
// this means that once there are 3x numThreads jobs queued waiting for
// a thread then it will start running jobs on the submitting thread.
LinkedBlockingQueue<Runnable> queue = new LinkedBlockingQueue<Runnable>(numThreads * 3);
// Once the queue reports that it is full, the CallerRunPolicy will run job
// on the submitter thread.
ExecutorService p = new ThreadPoolExecutor(numThreads, numThreads, 2L,
TimeUnit.MINUTES, queue,
new ThreadPoolExecutor.CallerRunsPolicy());
This shows us creating a LinkedBlockingQueue with a specific maximum capacity. I used 3 times the thread pool size but you need to figure out what works best. The major change here is the CallerRunsPolicy parameter. Now, that the thread pool uses a fixed size queue then if the queue fills then normally the pool will throw exceptions when this happens. We'd have N threads + M items on the queue at that point. CallerRunsPolicy means that instead of throwing an exception, the job is instead executed on the submitter thread. This will prevent the memory runaway that was happening in the case I described earlier and will throttle the preload operation itself rather than let it go crazy.
The other things to watch here would be to use larger batch sizes. If you are using WXSUtils.putAll_noLoader to bulk load the items in to the grid then make sure you have around 1000 key/value pairs PER partition. This is a good rule of thumb to start from when tuning things. If you have 200 partitions and a batch size for PutAll of 1000 items then thats only 5 per partition which isn't enough to really start to get the benefits of batching the puts. A number per partition in the hundreds is probably better.
If you have a grid with 200 partitions then a thread pool size of 32 isn't great either. You should at least have the same number of threads otherwise you'd overrun the pool instantly when doing any bulk operation. The default pool size for WXSUtils is indeed 32 and is too low for production. You can specify the size of the pool using a constructor parameter or by editing wxsutils.properties if thats how you are connecting to the grid.
I've already updated wxsutils to configure the threadpool as described above given recent experience. I have not changed the default thread pool size from 32 but thats something I'm considering also. Maybe, I'll default it to one or two times the number of partitions in the grid but bounded by some 'reasonable' number.
Anyway, it's worth pointing out the problem and how it's resolved just for every bodies education. An unbounded thread pool happens not just because of a growable thread pool for example but they can be memory unbounded pretty easily as in this case when the pool cannot service the pool queue fast enough.
Posted at 09:34 AM in WebSphere eXtreme Scale, WebSphere Performance, XTP | Permalink | Comments (2)
|
I'm seeing this a lot these days. A customer is using IBM WebSphere eXtreme Scale in front of a database. The DBAs are concerned about how many JDBC connections that the grid is requiring. If it's a larger grid then the problem is especially acute. For example, lets say we have a grid with 200 JVMs in it. All of the grids that I have sized recently are in this range. Lets look at different aspects of this problem
1) WXS as side cache
Here, there are no connections from the grid to the database. The application server JVMs do all the database interaction and hence, the grid adds no additional demands on the DBMS.
2) Preloading WXS with data from a DBMS
Here, preloading it can be interesting. The main issue is usually the time to load the grid with the DBMS data. We usually use parallelism when we do this but that means multiple JDBC connections per preload client. The main issue I see with DBMS preload is simply query complexity slowing down the preload process. Simplifying the query and moving join logic in to the grid side preload process seems to be the key to solve this problem. Once you can achieve a high preload speed per single preload thread then you don't need so many JDBC connections to achieve a specific preload speed.
3) Write through WXS to a DBMS
This is probably the worst offender here. Typically, a WXS Loader consists of a TransactionCallback (like this JDBXTxCallback example) and a Loader. The TransactionCallback makes sure a JDBC connection pool is available for the Loader and the TransactionCallback manages JDBC.begin and JDBC.commit and rollback operations. The Loaders share the connection provided by the TransactionCallback for a specific transaction.
Clearly, preloading the grid will enormously reduce the usage of JDBC connections for grid read operations. However, write operations will still consume one JDBC connection per concurrent WXS transaction. Given, the potential throughput performance of such a grid, clearly, the load is usually too much for most databases. A grid server with 20 JVMs on a modern blade can still achieve 20k transactions a second per blade. Lets start with a grid with a 100% write rate for some peaks. Thats 1000 write transactions per JVM. If each write transaction costs 40ms (remember, the DBMS is directly involved with writes here, this isn't the grid, it's the DBMS) then we could do 25 transactions per thread per JVM (1000 / 40ms). Now, how many threads is that? We'd need 40 (1000 / 25 ops per thread) threads per JVM to achieve this amount of throughput in a single JVM. So, the JVM needs a JDBC connection pool of 40 connections (one per thread) too.
If there are 200 JVMs then we need 8000 (200 x 40) JDBC connections for the grid! Now, we can see why the DBAs may be concerned. A write through grid thats write intensive is a great idea for a grid. But, only with write behind. Write through makes NO SENSE AT ALL for this situation. The DBMS will remain the bottleneck and the customer will likely question why use a grid at all if such a solution with write through was proposed. Remember a 10 blade grid with 20 JVMs per blade and a 100% write rate for some workloads would get you here.
So, write through can still use a lot of DBMS connections if the write transaction percentage is high. If it's lower then these numbers can all be revised downwards. For example, if it was 10% then lets do the math again. One box does 20k tx/sec. 10% is 2000/sec. There are 200 JVMs then thats 10/sec per JVM. One or two connections in the DataSource would likely be enough here depending on response time requirements. This gives us 200 JVMs x 2 connections or 400.
It's very important to share a DataSource within a JVM, do not give each JDBCTxCallback it's own DataSource. The example JDBCTxCallback provide in this blog entry (see the link above) accepts a DataSource reference when it's initialized. You can use Spring to wire it in or change the JDBCTxCallback to get one from JNDI when there is an application server hosting the grid containers. This way, even though there is one JDBCTxCallback PER primary shard in a JVM, at least they will all share the same DataSource instance. This would reduce the connection usage dramatically. This is usually only a problem when used a J2SE hosted WXS container. Customers using WXS inside WebSphere ND will likely be using WebSphere's DataSource implementation and this will naturally result in one DataSource per JVM.
4) Write behind WXS to a DBMS
This approach gives the best opportunity to cut JDBC connection usage. Assuming, we preloaded the grid then the connections for reads can likely be ignored. This leaves writes. Write behind decouples the writes to the database from the grid writes. This means that one WXS write does not mean one DBMS write. This was the major problem with write through.
WXS allow DBMS writes to be triggered every X 'dirty' entries in a Map or when 'dirty' entries in a Map are older than a certain time (5 minutes for example). This means we would see much fewer database writes than grid writes. If we have a grid JVM with 5 primary partitions in it (the normal amount when sized correctly) then there are only 5 potential concurrent writes per second in such a JVM. Each partition will send its changes to the DBMS using a single thread. Therefore, the maximum JDBC connections used for write behind is the number of primary partitions in the JVM which in this case would be 5 PER JVM.
Note, that when boxes fail then the # of primary partitions can be more. Thus, size everything with what you expect to be your minimum # WXS container JVMS to figure this out. It's more likely to be 8-10 primaries in this case. Thus, we'd expect 10 JDBC connections per container if the system was extremely write intensive. However, a very high write rate is unusual. Normally, the write rate is much lower (10-20% might even be high). But, a security system built with a grid that updates last login time for example might result in a WXS write per login so be careful sizing this.
We expect in the worse case scenario with write behind to need one JDBC connection per partition. However, this is only the case when every partition writes at the same time. The big factor here is how long does a write behind transaction take against the database or the 'duty cycle'. If it takes 1 second and we're doing one write behind every 5 minutes then we can likely use one shared JDBC connection in a DataSource per JVM, thats a duty cycle of 1 : 300 (300 seconds in 5 minutes). We are unlikely to get much contention for that one connection as there is a 1:300 change of writes from different partitions clashing within a single JVM. If each write behind tx takes 2 minutes then this clearly changes things given there are 5-10 primaries per JVM. 2 minutes would mean a duty cycle of 120:500 or about 25%. You'd need probably one per primary partition here or 10 per JVM.
But, regardless, it's clear that write behind can reduce the DBMS connection usage significantly when the write behind configuration is 'optimal'. Optimal depends on the situation but as a starting point, 5 minutes or 1000 records (T300;C1000) isn't a bad place to start. The size of the DataSource pool for each JVM depends on the duty cycle of the write behind within a single JVM.
The points on sharing a single DataSource within a WXS JVM equally apply here as to write through.
Summary
I hope this blog entry opens peoples eyes in terms of JDBC/WXS interactions. Most applications elect to use a grid because the DBMS is unable to provide adequate response times under peak loads. If the write rate is very low then there is no problem. But, when the write rate is high then write behind is usually the best scenario here. A high write rate application using write through is likely going to still have performance issues as the DBMS will prevent the grid from achieving linear scaling and good response times.
A quality DataSource implementation is a big deal here. The DataSource thats provided with WebSphere Single Server/Base and WebSphere ND is about as fully developed as exists out there. It is a significant improvement (in my humble opinion) over C3P0 or Apache dbcp. Thus, running WXS inside a single server WAS install is something you could consider instead of a J2SE based WXS implementation for this reason alone, the DataSource implementation.
Posted at 10:10 AM in J2EE, J2SE, nosql, Persistence, WebSphere eXtreme Scale, WebSphere Performance | Permalink | Comments (0)
Technorati Tags: database, datagrid, datasource, dbms, jdbc, sizing, websphere extreme scale
|
Jason McGee just started a blog (finally...) on his thoughts on cloud computing and IBM technologies in that space. Check it out. http://cloudjason.com/
Posted at 09:08 AM | Permalink | Comments (0)
|
I'm looking at the Spring cache APIs in 3.1 M1 with a view to implementing it in my wxsutils samples projects but I'm not comfortable with them at all. As a DataGrid vendor, they just make me uncomfortable for the following reasons:
1) Follows conventions of Java.Map
It's for a Cache, it's not a Map. The Map API was never a good one for datagrid products. There is so much more to this than just get/put that to continue to base stuff on Map ideas/conventions is crazy. Now, Springs Cache doesn't extend Map, whats Billy talking about? Has he finally given up drinking tea? Look at the generics signatures. The update/replace/put methods are type parameterized. The get and remove methods are not. This is not consistent with how a user of a cache of key/value pairs would expect. It may be consistent with the Map APIs in the JDK but it's not a map, it's cache, they are not the same! I'd prefer to see get/remove methods similarly parameterized.
2) The mutable methods return the old value
This costs nothing when the cache is local like ehCache but when it's remote then it's a major performance issue because you need to serialize the old value back to the client where in all likelyhood, the client will just discard. Thats a lot of cost for no value. I'd like to see those methods return voids.
3) No batch methods
DataGrids are remote. If I was to use the Spring Cache interface to store 1000 key/value pairs in a remote grid then I'd be doing 1000 RPCs. Lets use this to preload a 200GB cache with 100m entries, yep, it's going to take a REAL LONG TIME, put the kettle on. This is clearly inefficient but is the best a Spring programmer will be able to do with this interface. It should implement bulk operations so that they can be optimized when using a remote DataGrid. Now imagine bulk methods that return the old values, it's a horrible idea. Plus, the returned data could easily swamp a single client if you aren't careful.
4) Nothing has to be Serializable
So lots of applications will break when a distributed cache is plugged in OR the cache adapter will have to convert everything to some intermediate form (might be a good idea anyways).
To finish
I think it's good, nope, I think it's great that the Spring guys are doing this. It's a great way to make it easy to consume caching technology, being able to integrate it easily in to the many applications using Spring today. But, the interface isn't enough and has, for me, major issues when used with modern caching technologies. I don't believe anything here is specific to my own product, IBM WebSphere eXtreme Scale. It would apply to the others also.
BTW, I am emailing with Costin discussing this so I'm not just throwing rocks here :) I'm trying to get a better approach on the next revision.
This is a very common question. A customer wants to keep a copy of important data in a DataGrid, like IBM WebSphere eXtreme Scale. The data is sourced from a database typically. The data is changed by legacy applications in the source database only. The customer wants to know whats the best way to keep the DataGrid in sync as data changes in the database.
A great solution for this is the IBM Infostreams Change Data Capture product. This product installs on the same box as the database and it can watch the database for changes. It watches for changes usually by watching the transaction logs. This is very lightweight and typically has little or no impact on the database process. It can also watch using triggers but this is going to have an impact on the database.
Once CDC spots a problem then it can send a message to a JMS queue to record the fact that a record in a table that is being watched has been modified. The customer then writes an JMS message listener to listen to that queue and apply the change described by the message to the data in the DataGrid.
This is also a great way for buffering changes made to very large datasets when the grid is down for maintenance. We can stop the JMS listener. CDC will still buffer changes on the queue when the listener is disabled. Then we would write a disk snapshot of the grid using wxsutils. Do the maintenance. Restart the grid, reload the snapshot and then enable the JMS listener. The JMS listener will then bring the grid up to date with the changes buffered during the maintenance interval. This will be much faster than a full reload from the source database when the dataset is large or the source database is a smaller database (i.e. it's not that there is a lot of data, the database is just slow getting it all out quickly).
Anyway, this is something I'm going to start recommending every time moving forward. I think it's a great solution, not very invasive and is pretty simple to get working.
Posted at 03:52 PM in Event Processing, Persistence, WebSphere eXtreme Scale | Permalink | Comments (5)
Technorati Tags: cdc, change data capture, data grid, infostreams, objectgrid, stale data, websphere extreme scale
|