Cashing in on Parallelism
Recent articles on the Intel and AMD quad-core offerings have focused on memory hierarchy and the cache structure in particular. This has sparked a discussion about who is “stealing” ideas from whom. The quad-core Intel Nehalem will see small per-core Level 1 (64 KB) and Level 2 (256 KB) caches connected to a big, shared Level 3 (8 MB) cache. This architecture will be used for devices with up to 8 cores. Some observers are saying that this follows what has been done in the AMD Phenom; others are pointing out that L3 shared cache is old Intel server technology anyway, as used in the K10m architecture. Does this really matter? Isn’t there a more important point underlying all these developments?
This week’s announcements about the Intel Larrabee have really spiced up this discussion. Here is a device which is ostensibly a GPU, but which is really a highly programmable general-purpose compute engine. It’s not by accident that it uses an x86 architecture for its cores and it’s not by accident that it has an awesome double-precision floating-point capability. This is going to be a powerful number cruncher with around 32 cores, each with a peak performance of 16 double precision Gflops. It’s also got an impressive cache structure with a Level 1 cache of 32 KB data and 32 KB for instructions per core. Each core also has a Level 2 cache of 256 KB. The status of the Level 3 cache doesn’t seem to have been announced. What is interesting is that the Level 2 caches communicate over what’s called a ring bus which is a very fast 1024-bit wide data highway running at the Larrabee’s clock speed, which is predicted to be around 2 GHz, giving an overall bandwidth of 256 GBytes/s. Larrabee is expected to be available at the end of 2009 or beginning of 2010.
What’s actually happening is that each core in these multi-core devices is getting more and more local memory via the L1 and L2 caches and the L3 cache (or the ring bus in the case of Larrabee) is becoming progressively a communication mechanism between the cores. Given that we now have devices from both Intel and AMD with 4 cores and 32 cores predicted for 2009/10, it’s not too fanciful to predict that, within the next decade, there’ll be devices with several hundreds, if not thousands, of cores. It’s not too difficult to imagine that these cores will each have local cache memory. The real question is will shared memory still work for such devices. Will a shared-memory architecture scale to a large number of cores, that is will this architecture be able to deliver data to the horde of cores quickly enough to keep them occupied? What are the alternatives?
A good candidate has to be the message-passing model adopted by the high-performance computing world over the last two decades. Basically most compute clusters and high-end systems follow this model which assumes a sea of processors each with its own local memory connected by some sort of communication mechanism. All notions of the processors sharing a common address space are abandoned. Well established programming models exist for this type of architecture. The big advantage is scalability and it’s not uncommon for codes to run on 10,000 or more processors. The downside is the need to parallelise codes so that they work with message passing. For research centres with lots of experienced parallel programmers this isn’t an issue. For organisations used to the shared-memory model running on a handful of cores, the move to message passing would be a big step and one which would highlight the shortage of parallel processing expertise outside the academic and research communities. However, doing nothing isn’t an option because even with the shared-memory model, there will still be a need to parallelise codes as the number of cores moves into double digits. Whichever way it goes, there will need to be a significant restructuring of codes and issues of parallelism will need to be addressed. The next few years are going to be exciting!
