My Photo

Become a Fan

DailyMile

Google Ad Skyscraper

« Battle of the mash ups, Adsense versus Yahoo | Main | Cache size and multi-core »

November 21, 2006

Comments

Sethu

Hi Billy,
Great blog - I follow your blog regularly and find it informative.
A quick question on the multi core thing: Don't you think it would boost the app server performance - which is inherently multi threaded: app server services plus user threads.

Kindly reply.
thanks,
Sethu

Billy

No
My point is that Java is very dependant on fast GC for good performance. GC has elements that are multi-threaded but the main task remains single threaded and this thread will be slower on a multi-core CPU and get slower as clock speeds drop as more cores are added. The other issue is that more cores do help multi-threaded code but only at the limit. If your JVMs don't run at high load and lets face it, most don't then it's not the multi-thread thing that gives performance, it's the speed of a single pipe and thats dropping.

kirk

Hey Bill,

Is you experience here related with IBM or Sun GC. I ask because there are some not so insignificant differences between the memory models used by Sun and IBM. My experience with the Sun JVM is that GC speed has increased considerably due to multicore technologies.

Also one other point, my experience is that hyper-threaded cores do better than multi-core machines though these results are very use case sensitive. Leads me to believe that things are much more memory latency dependent than pure CPU.

nice blog though.

Cheeers,
Kirk

Markus Kohler

Hi,
I do not fully agree with your statements.

1. There are several options at least in the SUN VM that allows the GC to be run on more than one processor.

2. app servers are multithreaded, and therefore should scale on a multicore machine

3. Typically the CPU time spend by the GC is around 5%. If it's above than your application will not scale anyway. So even if this time would go up on a multicore to 10% you would only loose 5% performance

4. The problems with multicores are problems for all programming languages that have a GC. Almost all app servers are based on programming languages that have a GC

I agree that for implementing caches the JVM might not have all the features it should have. Being able to give the GC a hint that objects should be GC'ed now, could be of value.
Regards,
Markus

BrianCal

Billy,

For high performance servers, can't there be a co-processor to help with fast GC ?

Second, a lot of collected objects are visible only to single threads for their entire lifetime - in such situations multi-cores should certainly speed up the process.

What do you think.

Thanks
Cal

jonathan

To mitigate slow multithreaded GC, why not just run multiple JVM's?

I often horizontally scale WAS on large windows boxes.

Antonio

Hi Billy,

It's all the way round!!

- "JavaEE promotes a simple threading model for applications"

Each EJB method invocation may run in its own thread (or in its own server if clustered). Each servlet invocation may run in its own thread (or in its own server if clustered). What other threading mechanism would you suggest? That's scalable, isn't it?

- The architecture is basically lots of cores but low clock speed per core.

Why low speed per core? If a run an Intel Pentium Dual Core, am I getting lower clock speeds? I don't think so...

For T1 processors all cores run at around 1GHz (either in the 4/6/8 core models, http://www.sun.com/servers/family-comp.html#coolthreads). That's a pretty good speed for a sparc processor...

- "Applications may have to be written to exploit threading if they want very high performance."

Well, of course. That's what everybody is saying since a long time (including C++ "Guru" Herb Sutter http://www.gotw.ca/publications/concurrency-ddj.htm).


- "JavaEE evolves to support common threading patterns so that it's easier for normal developers to leverage threading on these slower processors."

JavaEE containers are responsible for the lifecycle of managed objects. The Spring Framework is also responsible for bean lifecycle. How are you expected people to leverage threading? It's frameworks that should handle threading, right?

- "Garbage collection remains heavily dependant on clock speed for small pauses."

Why? Garbage collection may be run without pauses (using a concurrent GC). Sun's VM uses parallel GC (so as to reduce pause times in multiprocessors). I'd say that space sizes and correct tuning are more important to small pauses than CPU speed.

Well, Billy, I think you should consider rewriting your entry.

Cheers,
Antonio

Billy

Antonio.
Concurrent GC is a great thing. The short term garbage is collected very efficiently. The problem is the long term garbage such as data kept in a cache in the JVM. If that data doesn't change then you could size the long term heap to hold it and not see a big GC event which is obviously cool. But, if the cache holds data that changes (i.e. a write through cache for example) then that large heap will fill and when it does there will be a lot of data in it and you'll take a big pause when that happens even with the latest JVMs. Caches are now adding features such as indexing and this, of course, means the cache takes more memory also.

GC algorithms all have an element of multi-threadedness about them but there are still portions that are single threaded.

Running multiple JVMs per box is the only way to scale GC right now as was suggested by a previous poster. Rather than run a single JVM with a 1.5GB heap, run 3 with 500MB heaps. Now, GC is fully multi-threaded but it takes multiple JVMs to do this.

The problem with that is that the second level caches get polluted by having 3 active large processes running on the box. This splits the CPU cache 3 ways. There are projects to 'share' the compiled bytecode on the machine which would help with this.

The other issue is you have more threads running now than before. Each JVM probadly has M + N threads in use where N is the container threads and M are constant ones in the JVM implementation and the JVM application server. 3 JVMs means 3M + 3N where as a single JVM is likely M + 3N.

Framework provided threading models so far have been simplistic. It thats what your application needs then cool and for most applications it is fine. But, there is a reason we added commonj and it's why this API gets a lot of use, people couldn't write efficient applications with that simplistic threading model. The thread model choice can be critical for the performance of some applications.

Markus
Thanks for the comments. Intel right now indeed have high clock speeds but I think thats temporary and we'll soon see the clock dropping as the number of cores go up. I always push the high clock as an advantage over the Sun CPUs but it may be just a temporary advantage.

Sam

I may be misinterpretting your opening premise "The architecture is basically lots of cores but low clock speed per core." Does Intel's recent Core 2 Duo or Core 2 Quad chips qualify? If so, then I disagree with your premise.

I just recently read some benchmarking data for Intel's Core 2 Duo. It only has two cores so it may not qualify as "lots of cores", but the performance per core is very good. See the benchmarks at http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2795.

Billy

The dual cores had a clock speed reduction. The quad cores are now at 2.66Ghz. I expect this trend to continue in order to manage heat and power.

Nate Edel

Saying things like "the dual cores had a clock speed reduction" without qualification is misleading. What matters is single-core/single-thread performance, and clock speed is only one variable in that.

Apologies for the long comment/explanation that follow...

The clock speed reduction was not a matter of going to dual core (*) but a matter of going to a new processor architecture... and as anyone in this industry ought to know, clock speeds are NOT terribly useful measurements for comparing processors of different architectures.

Intel's Core 2 architecture (**) architecture made its focus increasing instructions per cycle, rather than raw clock speed - which was the focus of Intel's earlier Netburst architecture (used in the P4 and it's relatives) did.

IGNORING the fact that they are dual core, either core of the top of the line 2.93ghz Core 2 Duo (or 3.0ghz "Woodcrest" Xeon) qualifies as the fastest single-core Intel x86(***) processor ever made. There is no sacrifice in the lower clock speed; you can argue exactly what multiple of megahertz is comparable for Core 2 vs. Pentium 4 comparisons, but a VERY conservative low end of the argument would be about 40%, which puts the top Core 2 chips well beyond the fastest Pentium 4 chips (the 3.93ghz Extreme Editions)(****).

As for the quad-core chips, like the first-generation Intel dual core chips (Pentium D 8xx series), they are actually two dual-core dies on a single package, not a unique quad-core design. The clock speed reduction is notable, but more a matter of cost and power consumption than of design capability. Given what overclockers are achieving with the Core 2 Duo chips today, it's fair to assume that the architecture has quite a ways to go before hitting its actual clock speed limits.

(* except for the very peak 3.8ghz and 3.93ghz chips, clock speeds between the Pentium 4 series and the P4/Netburst based dual core Pentium-D series very rapidly achieved parity. A similar pattern applies for AMD's chips.)

(** and to an extent the Pentium-M/Core mobile architectures that preceded it, even if Intel claims that they're not that closely related.)

(*** in the sense of x86 made by Intel; I leave the Intel vs. AMD arguments for others.)

(**** most of us have been using a 1.5:1 ratio for estimating K8 vs Netburst performance, and most benchmarks are showing that the Core 2 Duos beat out AMD K8 at the same clock speed, so "40%" is VERY conservative.)

Dmitry Leskov

I recall my colleague telling me about a research paper by GC guru David Bacon which stated that under heavy load on a system with 8 or more CPUs (or cores) your enterprise Java app is going to wait for I/O completion often enough for the JVM to allocate one CPU exclusively for memory management tasks and use, guess what, reference counting for GC.

I could not find the paper, but you may have some slides if you
google for "Java without the Coffee Breaks".

Kirk

Hi Billy,

But, if the cache holds data that changes (i.e. a write through cache for example) then that large heap will fill and when it does there will be a lot of data in it and you'll take a big pause when that happens even with the latest JVMs.

GC is a run to failure process. Failure for GC means the GC failed to collect the object in question. Consequently, caching will cause long GC pauses in old generation GC where as releasing objects will actually cause GC to run faster. I have some very nice GC pause time vs bytes collected that demonstrate this point very nicely.

Kind regards,
Kirk

Scott

It's an interesting hypothesis, but there are several existence proofs that negate the it: several companies, including IBM using WebSphere, have produced excellent results on SPECjAppServer -- a Java EE benchmark -- on multi-core/multi-thread systems, including Sun's T2000 system which runs 32 threads on a single chip at only 1.2 ghz. Those results tend to have better cost/operation (both in purchase price as well as power consumption, etc.) than similar results that use lots of little boxes horizontally scaled.

Multi-core/multi-threaded systems may not be appropriate for all applications (and certainly not for apps where single-threaded performance is important). But Java isn't necessarily in that category (and Java EE specifically isn't): GC is highly multi-threaded, and (in most cases) GC simply isn't the performance bottleneck it once was.

Billy

Scott,
We have published good numbers on them and throughput wise the new chips are an improvement over the old ones. But, as you point out, single thread performance is compromised. GC wise, if the applications do not have large caches or similar things in the heap then you're right also but if app do have large heaps of cached objects then as multi-threaded as current GC is, there are still elements that are single threaded and those apps will suffer.

As an aside, I think the time has come when we need to bring back ways to allow application to manage the lifecycle of certain object graphs. This would solve the GC issue for almost all apps. Single thread throughput wise, nothing changes though.

Michael

Billy, I have thought that commonj is a very nice idea for a while now. It is really too bad it is not used more widely.

Otherwise, I think that the need for explicit deletion of objects (malloc and free) is a very good idea if kept out of the hands of the vast majority of Java programmers who would not call free(). I perfect example is the way Derby for example tries to avoid running out of heap and knowing when to fall to disk. It has to test the memory left after allocating different objects to discover their sizes. What I'd really like to see is a modifier on local variables and parameters to say that the object will never be assigned to a field and will always be stack-bound and can then be cleaned up when it goes out of scope. The other thing I'd like to see is a way to attach a hook to a ClassLoader so that an application can decide how much relative space to assign to different Classes of objects. Finer control is always better.

Stuart

Michael, a number of modern JVMs do simple 'escape-analysis' on bytecode to determine whether an object is stack-bound without the need for modifiers.

IBM have also recently released a realtime version of WebSphere which uses a special JVM with a new GC technology called 'metronome'. This allows programmers to allocate from different sub-heaps depending on their time/space requirements.

The benefit of this approach is that you can cope with large malloc/free-style object caches without disturbing the pause time for realtime threads.

See the Metronome site at http://domino.research.ibm.com/comm/research_projects.nsf/pages/metronome.index.html for more details.

Christopher Smith

There are a number of problems with your thinking, and the best way to empirically demonstrate that your conclusion is wrong is to note that J2EE apps seem to get a more significant boost than most other types of apps with Niagara server.

So let's go down the list:

1) As others have pointed out, multi-core designs don't necessarily mean measurably slower executions inside a core. You'd be hard pressed to find a top of the line multicore CPU that was more than about 10% slower than it's top of the line single core brethren when running single threaded apps. If you can speed up cross core memory access in exchange for the 10% CPU core penalty, you'll probably make JVM optimizers, and GC implementers in particular very happy.

2) Your notion that Java likely has a longer path length than languages like C doesn't make much sense at all. Java's execution model actually reduces the need for sequential execution as compared to C, and the nature of HotSpot's dynamic runtime actually allows Java programs to take shorter optimistic paths and then fall back to longer paths only when a pessimistic case comes out (hey, maybe you could execute both concurrently on different cores ;-). Of course you can do this in C with a lot of hard work, but in Java the runtime can do it for you more easily.

3) Your notion that a simple thread model (specifically J2EE) somehow executes poorly as you have more and more parallelism in hardware is actually backwards. Complex thread models tend to have significant problems scaling. Simple is a huge advantage.

4) Garbage collection does *not* remain heavily dependant on clock speed for small pauses. Concurrent GC actually relies heavily on having excess CPU cores available and extraordinarily low latency inter-core MESI type manipulations in order to have small pauses. Assuming you are doing incremental GC with a write barrier, your biggest concerns are a) being able to find a CPU to remove the write barrier before someone needs to write to an object and b) being able to update all the other cores about changes to the write barrier.

4) Multi-core doesn't mean shared cache or memory bandwidth. The Pentium D's had this feature, but the trend tends to be for each core to have independent cache and memory bandwidth.

5) Finally: JIT's, GC's, etc. actually benefit from more overall CPU throughput than from improvements in the single threaded execution. If you've got extra cores that an app otherwise isn't using, no harm in having the JIT do some extra profiling analysis or code generation, the GC prematurely reorganizing memory, etc. This is an advantage that you effectively don't need or take advantage of with more traditional execution models.

So, in fact even on a theory, Java is likely to benefit from future multicore designs more than languages with more traditional execution models. Of course, functional languages will probably receive the biggest boost of all.

Prabuddha

I have a few objections to points you make

1) You say it is a J2EE problem that core speeds will be held down. But compared to other applications in C , J2EE applications are much more multithreaded and multi cores should help them much more than say single threaded first person shooters

2) A number of times you have mentioned that the second level cache gets polluted as it is share across cores. While this is true of Intel chips its not true for AMD Opteron processors where each core has its own L1 and L2.

Billy

Prabuddha,
Every system is a balance. You are not getting those extra cores without a cost. There is only so much room on the die. AMD currently have cache per core but trendwise, I don't think that will continue. What may happen is that chips will have groups of cores that share a cache per group.

The comments to this entry are closed.