Introduction
The Cell processor consists of a general-purpose POWERPC processor core connected to eight special-purpose DSP cores. These DSP cores, which IBM calls "synergistic processing elements" (SPE), but I'm going to call "SIMD processing elements" (SPE) because "synergy" is a dumb word, are really the heart of the entire Cell concept. IBM introduced the basic architecture of the SPE today, and they're going to introduce the overall architecture of the complete Cell system in a session tomorrow morning.
In this brief overview, I'm first going to talk in some general terms about the Cell approach — what it is, what it's like, what's behind it, etc. — before doing an information dump at the end of the article for more technical readers to chew on and debate. Once the conference is over and I get back to Chicago and get settled in, I'll do some more comprehensive coverage of the Cell.
Back to the future, or, what do IBM and Transmeta have in common?
It seems like aeons ago that I first covered Transmeta's unvieling of their VLIW Crusoe processor. The idea that David Ditzel and the other Transmeta cofounders had was to try and re-do the "RISC revolution" by simplifying processor microarchitecture and moving complexity into software. Ditzel thought that out-of-order execution, register renaming, speculation, branch prediction, and other techniques for latency hiding and for wringing more instruction-level parallelism out of the code stream had increased processors' microarchitectural complexity to the point where way too much die real-estate was being spent on control functions and too little was being spent on actual execution hardware. Transmeta wanted to move register renaming, instruction reordering and the like into software, thereby simplifying the hardware and making it run faster.
I have no doubt that Ditzel and Co. intended to produce a high-performance processor based on these principles. However, moving core processor functionality into software meant moving it into main memory, and this move put Transmeta's designs on the wrong side of the ever-widening latency gap between the execution units and RAM. TM was notoriously unable to deliver on the intitial performance expectations, but a look at IBM's CELL design shows that Ditzel had the right idea, even if TM's execution was off.