We got a lot of questions from customers on performance. The application is slow and we want IBM to help. Usually the first question is whats slow and nobody knows. The application was written with no builtin monitoring at all. The product monitoring can be useful but sometimes application specific metrics are much more useful.
IBM WebSphere eXtreme Scale includes monitoring but I met with Tom Lubinski from SL.com at QCon and he said that all samples should create MBeans and thus be monitorable. Many customers simply copy samples so this would mean more customer applications would have monitoring. So, here is the PureQuery Loader for WebSphere eXtreme Scale with monitoring.
I create a single MBean for each Map in a JVM. This MBean tracks statistics on the Loader for that Map. The purequery Loader uses classes from the purequery.jmx package but you can modify your Loader to make the same calls and get the same monitoring. You can do this with your own Loader pretty easily. Typically, add 3 calls for the start time, log exceptions and log the time in the operation.
I used JConsole to just connect to a container or test JVM and then I can see the beans for each Map, click on attributes and there are all the statistics. Here is an example of what it looks like in JConsole
You can see how many of each type of operation happened, how long the min/avg/max time is in milliseconds. How many exceptions? The string of the last exception and so on. This is very useful and adds almost no overhead to the running application. This kind of instrumentation in your application will save you a lot of time when performance isn't what you expected. It's easy to add, adds almost no performance overhead (when done with fast code, no sync blocks and so on). Just do it.
I'm in the process now of modifying the Chirp sample application to have MBeans for the client side/agent calls the application uses and the Loader code above. It makes the application much more demo-able as well as easier to find issue when they arise.
The source code for the instrumented PureQuery Loader with some framework classes are attached here.
New version which is thread safe: Download JMXLoaderSampleV2
Old version which had thread safety issues, sorry I was lazy :) (remains to compare with updated one)
I'm a fan of JMX, but I haven't found an out-of-the-box way to automatically persist metrics over an extended period for long term trending, forecasting, etc.
Is there a missing piece of the puzzle out there that lets me capture selective (and aggregated) JMX stats into a persistent store (file, database, RPC, whatever)?
Posted by: Mike Burke | December 03, 2009 at 08:03 PM
You should really make the code threadsafe. There are some serious issues and some minor ones.
A Serious one:
public long getAvgTimeNS() {
if(count.get() > 0)
{
return totalTimeNS.get() / count.get();
}
This will fail if the counter is reset between the first and the last line. Because the counter would then return 0.
A Minor one:
minTimeNS.set(Math.min(minTimeNS.get(), durationNS));
You should use compareAndSet in a for loop to make sure you don't miss any updates.
There are also some small inconsistency windows where you can get crazy readings because the reset functionality is not atomic.
Posted by: Kasper Nielsen | December 07, 2009 at 09:46 AM
I did those deliberately except for the / 0 one. People complain if its synchronized given there is only one MBean sometimes. I'll upload a new version.
Posted by: Billy Newport | December 08, 2009 at 03:08 PM
Kasper,
Fixed, thanks for keeping me honest...
Posted by: Billy Newport | December 08, 2009 at 03:22 PM
Billy please stick to data/compute grids and leave management/monitoring to others because my friend clearly you have not got a clue what is needed to really solve performance & scalability problems in production when you recommend a technology that is poorly conceived and implemented for the task at hand (especially in the coming years).
It does not even scale to the level of metric data collection required for anything more than a petshop application.
http://opencore.jinspired.com/?page_id=129
I have yet to see JMX solve anything but the lowest of hanging fruit and even then you would probably be quicker and more accurate asking a witch doctor for advice.
JMX was designed for legacy productions from HP (OpenView) and IBM (Tivoli) and even there it fails to hit the mark offering no efficient means of collection and no standardization of measurements that could be correlated in some intelligent manner.
William
Posted by: William Louth | December 28, 2009 at 01:53 PM
Kasper reset functionality is a very bad practice in production. Its a developer feature. No one person in operation should be allowed to effectively delete monitoring data. Because JMX is not transactional (and most of the time no even thread safe) it just corrupts the data even more than it is already due to latency across multiple calls.
Marking is what is required.
Posted by: William Louth | December 28, 2009 at 01:58 PM
Will,
You're awesome, brought a smile to my face :) Your stuff is very cool but it's not in the JDK. Most customers with performance problems have no instrumentation at all. The first thing they do is call support because it's always a product problem until proven otherwise which while sometimes it is, usually it's something else. Any increase in the amount of instrumentation people do is a good thing and it would allow them and us to see whats going on. I'm just trying to encourage this using stuff that they already have. If they had your stuff and were using it then they can happily ignore this blog post, they are already ahead of the game.
Posted by: Billy Newport | December 29, 2009 at 08:20 AM
OK I tried showing some degree of restraint but obviously it did not work.
- JMX is sufficient for simple control operations (stop/start/restart/gc/...) when these allow the user to make a serious of transient state transitions across a number of services in a single operation. That said I am very uncomfortable with attribute write operations because of the obvious lack of persistence (change management control) and transaction support (rarely are updates localized to one attribute).
- JMX serves as a basic (and somewhat primitive) read-only management interface for generic remote client consoles as the cost of remote calls dwarfs the huge overhead in making local access calls. But at the same time its design makes the management consoles pretty much anemic (low data collection) and not terribly scalable (high latency) as attributes have to be pulled one by one. There are ways around some of the issues. Many, many ways. Actually each release of JMX seems to bring a new approach (or MBean) in futile effort to correct (mask) the original sin buried within its design.
If you are doing health status monitor or high level reporting in a well tested and pretty stable (in terms of workload & execution behavior) system then is sufficient to build those big RED & GREEN circle dashboards. Anything else (is there anything not changing these days) and you seriously need to get back to the dormitory and take your medication.
William
Posted by: William Louth | December 29, 2009 at 02:08 PM
I forgot to follow up on this point you raised.
"Any increase in the amount of instrumentation people do is a good thing and it would allow them and us to see whats going on."
Actually this is the crux problem. Developers rarely have enough information to make this determination. It is best left to tools that do not guess hotspots but instead instrument and dynamically enable/disable resource metering based on accurate real-time resource usage profiles.
Granted if you have not got this technology then anything is better than nothing but lets strive to reach higher than such low hang fruit.
By the way I did try hard to get Oracle, IBM and Sun behind my technology. The IBM team was just too busy doing CSI on dead JVM's.
Posted by: William Louth | December 29, 2009 at 04:10 PM