Virtualization takes multiple physical machines, virtualizes them and then runs them concurrently on a single server. The argument is that they are normally using little CPU so it's ok to run them all on a single box which will then run with a much higher CPU utilization, say 70/80%. The main issue is memory. Most boxes which are targets for consolation are usually older servers with slow CPUs. 5% on such a box might be 1% on a modern box. This means that many such older machines can be consolidated on to a newer more powerful server. However, while CPU is usually plentiful and cheap, the real issue is memory. That much memory can be expensive. DRAM cost goes through the roof when larger capacity DIMMs are required. This means that virtualization vendors typically recommend over committing the server memory. This basically means they are recommending swapping at the hypervisor level. This dramatically cuts the amount of memory required and lowers the TCO of the solution correspondingly.
C programs or scripts tend to respond well to paging. They typically have small working sets (active memory pages) and then rest of the process or virtual machine can be paged out without dramatically impacting the performance of the virtual machine. They typically don't regularly sweep the contents of their address space. Java, however, is a completely different animal. Java Virtual Machines only have one working set size, the whole virtual machine heap. The objects used by a Java application are typically randomly spread all over the address space of a process. Any timers which fire can cause much of the JVM to be paged in by the hypervisor and if a garbage collection is triggered then the whole JVM heap must be paged in from disk as the garbage collector sweeps ALL the memory, a very slow operation. This page in will cause other virtual machines to be paged out then they garbage collect and we get a ping pong effect. This behavior can cause major variances in response times for users of the applications hosted in the JVM. A garbage collection which normally completes in hundreds of milliseconds might take minutes to complete in such an environment and the applications within that JVM are frozen during this time. Many Java applications utilize timers for JDBC connection pooling, time outs, heart beating for cluster management and so on. The paging activity wreaks havoc on these kinds of time sensitive operations and will cause most Java applications to behave very erratically if not fail completely. Any kind of clustering framework will not survive a paging environment due to the timing sensitive nature of heart beating and similar code.
Java Virtual Machine technology needs to be improved to that it can cooperate with the host operating system or hypervisor to minimize its working set which should allow it to be paged more efficiently but garbage collection will likely still remain a tough problem to beat in an environment that is actively paging. Any virtualization vendor will struggle to keep a customer happy in an over committed memory virtualized system. These systems are frequently sold to customers with the assurance that over committing the memory is ok but this usually results in the customer being unsatisfied and then complaining to the Java application vendor about poor performance or unstable behavior. The response to these questions will likely be "Don't overcommit your memory!". The response is usually "But, the virtualization vendor said it was ok otherwise the system becomes very expensive…" and the final response from the application server vendor will be "Yep, thats right…"
Solid state disks for paging might help with the situation. But, any stateful application server with timing sensitive code will likely still encounter some issues in such as environment because of the jitter introduced by paging. WebSphere includes code to check for such jittery environments. It's a simple alarm which fires periodically and simply checks that it is fired when it expected to be fired. If the timer discovers that its several seconds later than it expected then it logs a message indicating CPU starvation is occurring. This only happened in the past on machines whose operating systems were paging OR on machines running hundreds of JVMs which suddenly all get busy and overwhelm the CPU capacity of the server. Imagine 500 threads that are all CPU bound running on a 8 core box. If each thread runs for 200ms when it gains the CPU then one second can service about 40 threads. This means around 12 seconds will elapse for any given thread before it even RUNS again. This is unacceptable for any timeout style operations with resolutions lower than that. We are seeing customers now reporting problems in virtualized servers where they have overcommitted the memory and while the local operating system is not paging, the hypervisor is. The result is the same. Erratic behavior.
In summary, this kind of behavior is not new. It's been seen before on systems which page or are simply overloaded. Either way the system was either out of CPU or out of memory or BOTH. Customers running virtualized systems need to be very careful when overcommitting either resource if predictable application behavior is required.