Features

The Pentium: An Architectural History of the World’s Most Famous Desktop Processor (Part II)

John "Hannibal" Stokes concludes his series on the growth and development of …

Jon Stokes – Jul 26, 2004 1:40 am | 0

Introduction

Back when the P4 first came out, there was quite a bit of negativity toward the new design in the hardware enthusiast community. Initial benchmarks showed that its performance was clearly clock-for-clock worse than that of the P-III, which was to be expected given its much longer pipeline.

Poor benchmark performance aside, there were also quite a few technical criticisms of its radical new design, leveled with varying degrees of validity by everyone from programmers to technology pundits.

Perhaps the most common gripe about the Pentium 4's microarchitecture, called Netburst by Intel, was that its staggeringly-long pipeline was a gimmick ? a poor design choice made for reasons of marketing and not performance and scalability. Intel knew that the public naively equated higher MHz numbers with higher performance, or so the argument went, so they designed the P4 to run at stratospheric clock speeds and in the process made design tradeoffs that would prove detrimental to real-world performance.

I was one of the original dissenters from this school of thought, and in my P4 vs. the G4e series I tried to make a plausible technical case for why the P4's designers had made some of the design decisions that they did. I ultimately managed to convince myself and not a few others that the P4's deeply pipelined design was, in fact, performance-driven and not marketing-driven.

That was then, and this is now. As it turns out, the P4 bashers were right. Revelations from former members of the P4's design team, as well as my own off-the-record conversations with Intel folks, all indicate that the P4's design was the result of a marketing-driven focus on clock speeds at the expense of actual performance and scalability.

It's my understanding that this fact is pretty widely known within Intel, even though it's not publicly acknowledged. Furthermore, the P4's focus on megahertz has made it especially vulnerable to the industry-wide problems that have accompanied the 90nm transition, with the result that the new P4 probably won't scale very well at all in terms of both clock speed and performance. But I'm not going to say any more about the 90nm P4 problems, because I've addressed those elsewhere.

We now know that that during the course of the P4's design, the design team was getting pressure from the marketing folks to turn out a chip that would give Intel a massive MHz lead over its rivals. The reasoning apparently went that MHz is a single number that the general public understands, and they know that, just like with everything in the world except for golf scores, higher numbers are somehow better.

In the present article, which is the conclusion of my architectural history of the Pentium line, we'll take a look at the P4's Netburst architecture and at the sacrifices that Intel made at the altar of MHz. We'll then look at the relatively new Pentium M, before finishing off with a look at Prescott. If you didn't catch the previous article, be sure to read it first.

The Pentium 4

Pentium 4 summary table

Introduction date: April 23, 2001
Process: 0.18 micron
Transistor Count: 42 million
Clock speed at introduction: 1.7GHz
Cache sizes: L1: ~16K instruction,16K data
Features: hyperthreading added in 2002

I'm not going to give a breakdown of the P4's massive 20-stage basic pipeline, because I've done that elsewhere, but I will make a few general remarks about the ways in which it differs from that of the P6 core. I'll also cover one of P4's most radical innovations: the trace cache.

The P4's basic approach

The Pentium 4's designers took the P6's 12-stage pipeline and sliced it up into finer increments. Each stage does much less work, but this allows the processor to run faster. In this way, the P4 translates clock speed directly into performance, which is one way to take advantage of Moore's Curves.

Actually, let me unpack the previous statement a bit to show you what I mean. The following scenario is a bit oversimplified, but it gets the basic point across.

Let's say that each stage of a 20-stage processor does half the amount of work per clock cycle as each stage of a 10-stage processor. So the 20-stage processor takes two clock cycles to do what the 10-stage processor does in one. This means that the 20-stage processor has to run twice as fast as the 10-stage processor if it wants to do the same amount of work in the same amount of time. Why would you do things this way? Well, if people want to buy processors with higher clock speeds, then why not? Besides, as transistors shrink you can switch them faster, which means that you can continue to scale the clock speed of the processor as your manufacturing process improves. So to adapt the familiar dot-com business plan parody, we might say that Intel's reasoning went something like:

process improvements

clock speed increases

profit!!!!

This plan works pretty well until the clock speed increases start to run out of gas... but let's not get ahead of ourselves.

If you read my article on Moore's Law (the principle that I now call "Moore's Curves") then you understand that there's more than one way to take advantage of shrinking transistor sizes and other types of process improvements. Increasing clock speeds a la the P4 is one of them, but adding functionality is another. Instead of translating process improvements into clock speed increases, the P4's competitors (e.g., AMD's Athlon) decided to turn them into performance-enhancing hardware. Adding functionality in the form of execution hardware, branch prediction hardware, cache, etc. is another way to turn process improvements into performance.

Actually, both the P4 and the Athlon do a little bit of both: they add hardware and they increase their clock speed. The difference between the two designs is a matter of emphasis, with Intel emphasizing clock speed increases and AMD emphasizing hardware increases.

Unfortunately for AMD there's no single magic number that sums up "performance as derived from various and sundry hardware improvements." This didn't stop AMD from trying to invent such a number, though. AMD debuted its performance rating system to mixed reviews from the tech community, but the company has stuck with the system and while its positive effects may be debatable it doesn't seem to have done them any real harm. In fact, Intel is now adopting an analogous system for similar reasons, but more on that later.

The trace cache

The previous article talked about the buffering effect of deeper pipelining on the P6, and how it allows the processor to smooth out gaps and hiccups in the code stream. The analogy I used was that of a reservoir, which can smooth out interruptions in the flow of water from a central source.

One of the innovations that made this reservoir approach effective was the decoupling of the execution core from the front end by means of the reservation station (RS). The RS is really the heart of the reservoir approach, a place where instructions can collect in a pool and then issue when their data become available. This instruction pool is what decouples the P6's fetch/decode bandwidth from its execution bandwidth by enabling the P6 to continue executing instructions during short periods when the front end gets hung up in either fetching or decoding the next instruction.

With the advent of the P4's much longer pipeline, the reservation station's decoupling just wasn't enough. The P4's performance plummets when the front end cannot keep feeding instructions to the execution core in extremely rapid succession. There's no extra time to wait for a complex instruction to decode or for a branch delay ? the high-clock-speed execution core needs the instructions to flow quickly.

One route that Intel could have taken would be to increase the size of the code reservoir, and in doing so increase the size of the instruction window. Intel actually did do this ? the P4 can track up to 126 instructions in various stages of execution ? but that is not all they did. More drastic measures were required to keep the high-speed execution core from depleting the reservoir before the front end could fill it.

The answer that Intel settled on was to take the costly and time-consuming x86 decode stage out of the basic pipeline. They did this by the clever trick of converting the L1 cache ? a structure that was already on the die and therefore already taking up transistors ? into a cache for decoded uops.

As the previous article mentioned, modern x86 chips convert complex x86 instructions into a simple internal instruction format called a micro-operation (a.k.a., micro-op or uop). These micro-ops are more uniform, and thus it's easier for the processor to manage them dynamically. To return to the previous article's Tetris analogy, converting all of the x86 instructions into uops is kind of like converting all of the falling Tetris pieces into one or two types of simple piece, like the "T" and the block pieces. This makes everything easier to place, because there's less complexity to manage on-the-fly.

The P6 fetches x86 instructions from the L1 instruction cache and converts them into uops before passing them on to the reservation station to be scheduled for execution. The P4, in contrast, fetches groups of x86 instructions from the L2 cache, decodes them into strings of uops called traces, and then fits these traces into its modified L1 instruction cache (i.e., the trace cache). This way, the instructions are already decoded, so when it comes time to execute them they need only be fetched from the trace cache and passed directly into the execution core's buffers.

Intel will not say exactly how large the trace cache is, but they claim that it holds 12K uops and has a hit rate equivalent to that of a 16K L1 cache.

In sum, the trace cache is a reservoir for a reservoir; it builds up a large pool of already decoded uops which can be piped directly into the execution core's smaller instruction pool. This helps keep the high-speed execution core from draining that pool dry.

The P4's instruction window

Before I discuss the nature of the P4's instruction window, I should note that I'm using the terms instruction pool and instruction window somewhat interchangeably. These two terms represent two slightly different metaphors for thinking about the set of queues and buffers positioned between a processor's front end and its execution core. Instructions collect up in these queues and buffers, just like water collects in a pool or reservoir, before being drained away by the processor's execution core. Because the instruction pool represents a small segment of the code stream, which the processor can examine for dependencies and reorder for optimal execution, this pool can also be said to function as a window on the code stream. Now that that's clear, let's take a look at the P4's instruction window.

As I explained in the previous article, the P6 core's reservation station (RS) and reorder buffer (ROB) made up the heart of its instruction window. The P4 likewise has a ROB for tracking uops, and in fact its ROB is much larger than that of the P6. The functions of the P6's reservation station, however, have been divided among multiple structures. A glance at the master P4 diagram above will show you how these structures are configured.

Up to 3 uops per cycle can flow from the P4's trace cache into the ROB. From the ROB, uops flow into one of two structures, depending on what type of uop they are: the memory uop queue or the arithmetic-logic uop queue. This works the way the names suggest: memory uops flow into the memory queue, and everything else flows into the other queue.

These FIFO queues are themselves attached to a group of schedulers, which schedule the uops for execution when their operand data become available. A glance at the large P4 diagram above will show you the names of the schedulers and which execution units they send uops to.

This partitioning of the instruction window into memory and arithmetic portions has the effect of ensuring that both types of instructions will always have space in the window, and that an overabundance of one instruction type will not crowd the other type out of the window. And the multiple schedulers provide fine-grained control over the instruction flow, so that it's optimally reordered for the fast execution units.

All of this deep buffering and scheduling and queuing and optimizing is essential for keeping the P4's high-speed execution core full. To return yet again to our Tetris analogy, imagine what would happen if someone were to double the speed at which the blocks fall; you'd hope that they would also double the size of the look-ahead window to compensate. The P4 greatly increases the size of the P6's instruction window as a way of compensating for the fact that the arrangement of instructions in its core is made so much more critical by the increased clock speed and pipeline depth.

The downside to all of this is that the schedulers and queues and the very large ROB all add complexity and cost to the P4's design. This complexity and cost are part of the price that the P4 pays for its deep pipeline and high clock speed.

The P4's execution core

The P6 core's reservation station sent instructions to the execution core via one of five dispatch ports. The P4 uses a similar scheme, but with four dispatch ports instead of five. There are two ports for memory instructions: the load port and the store port, for loads and stores, respectively. The remaining two ports are for all the other instructions: execution port 0 and execution port 1. The P4 can send a total of six uops per cycle through the four execution ports.

How can six uops per cycle move through four ports? The trick is that the P4's two execution ports are double-speed, meaning that they can dispatch instructions (integer only) on the rising and falling edges of the clock. But we'll talk more about this in a moment. For now, here's a breakdown of the two execution ports and which execution units are attached to them.

Execution port 0:

Fast Integer ALU 1: This unit performs integer addition, subtraction, and logical operations. It also evaluates branch conditionals and executes store-data uops, which store data into the outgoing store buffer. This is the first of two double-speed integer units, which operate at twice the core clock frequency.
Floating-point/SSE Move: This unit performs floating-point and SSE moves and stores. It also executes the FXCH instruction, which means that it's no longer "free" on the P4.

Execution port 1:

Fast Integer ALU 2: This very simple integer ALU performs only integer addition and subtraction. It's the second of the two double-speed integer ALUs.
Slow Integer ALU: This integer unit handles all of the more time-consuming integer operations, like shift and rotate, that can't be completed in half a clock cycle by the two fast ALUs.
Floating-point/SEE/MMX ALU: This unit handles floating-point and SSE addition, subtraction, multiplication, and division. It also handles all MMX instructions.

You can also glance at the P4 diagram above for another take on the same information.

The P4's execution core exhibits one major peculiarity that sets it apart from any other architecture (at least that I've ever seen): two of its integer execution units run at twice the core clock speed. This allows each double-speed unit effectively to act as two regular-speed units, because each unit can take in and spit out two instructions per clock cycle (one on the clock's rising edge and one on its falling edge). The P4 thus can reasonably be said to have four "logical ALUs", which when combined with the complex ALU give the processor a total of five ALUs. That's a lot of integer horsepower, and indeed the P4 does quite well in integer benchmarks, especially at higher clock speeds.

The double-pumped ALUs are yet another example of the P4's general approach to performance as described above: why do two things at once, when you could just do one thing twice as fast?

The P4 has two functional blocks which handle both floating-point and SIMD instructions. Combining floating-point and SIMD into one set of units, as opposed to giving them dedicated units like the G4 (MPC74xx) and G5 (PPC 970) isn't quite as limiting as it seems, because floating-point and SIMD code are rarely mixed. The FP/SIMD units have relatively low latencies for a machine with such a lengthy pipeline, which means that their performance scales relatively well with clock speed.

If anything hurts the P4's floating-point capabilities, it's the fact that the FXCH instruction is no longer "free" on the P4. This hurts floating-point performance for regular x87 stack-based floating-point code.

With the introduction of the P4, Intel tried to encourage developers to quit using regular stack-based floating-point and start using SSE for all floating-point code, including scalar floating-point. This is because SSE has a flat register file, which improves performance by... well, by not being a bad idea like a stack-based register file.

Speaking of SIMD and SSE, with the P4 Intel introduced the SSE2 extensions. In fact, since I haven't done it elsewhere in this series, let's recap the history of Intel's SIMD implementations starting with MMX. The following summary is adapted from pages 7 and 8 of the ITJ article "A Detailed Look Inside the Netburst Architecture of the Pentium 4 Processor"

MMX Technology

64-bit MMX registers

Support for SIMD operations on packed byte, word, and doubleword integers.

Streaming SIMD Extensions (SSE)

128-bit XMM registers

128-bit data type with four packed single-precision floating-point operands.

Data prefetch instructions

Non-temporal store instructions and other cacheability and memory ordering instructions.

Improved 64-bit SIMD integer support

Streaming SIMD Extensions 2 (SSE2)

128-bit data type with two packed double-precision floating-point operands

128-bit data types for SIMD integer operation on 16-byte, 8-word, 4-doubleword, or 2-quadword integers.

Support for SIMD arithmetic on 64-bit integer operands

Instructions for converting between old and new datatypes

Extended support for data shuffling.

Extended support for cacheability and memory operations

With Prescott, Intel introduced SSE3 instructions, which I'll outline a bit later.

Progressively-improved support for SIMD is an example of Intel's exploiting Moore's Curves by adding transistors and functionality to the core, as opposed to just increasing the clock speed. For some types of applications, improved SIMD support pays off, but the amount of payoff depends on how well the application is tuned to take advantage of the hardware's capabilities. This is in contrast to a hardware improvement like improved branch prediction hardware, which tends to pay off to one degree or another regardless of the software type.

Branch prediction on the P4

I talked above about how the P4's greatly-enlarged and more-complex instruction window is one place where the P4 pays a price for its deep pipeline and high clock speed. The other place where the P4 spends a ton of resources to cover the costs of deep pipelining is in its branch prediction unit. For reasons which will become even more clear in the section below, the P4's deep pipelines mean that a mispredicted branch will cost many more wasted cycles than it would on the P6.

Like the P6 before it, the P4 has two branch predictors, a dynamic predictor and a static predictor. The dynamic predictor consists of a relatively-large Branch History Table (BHT) and a Branch Target Buffer (BTB). I have explained the function of these two structures elsewhere, so I will not do so again, here. What you should know about them is that the bigger they are, the more accurately they can be used to predict both the direction and the target of branches in the code stream. The P4's BHT and BTB are relatively sizable, and they're also very accurate. Intel won't release exact statistics for the P4's branch prediction accuracy, but it's estimated to be upwards of 98% on most common types of code.

Branches that can't be predicted dynamically (i.e., they have no entry in the BHT) are predicted statically. The P4's static predictor is simple, and it operates on the assumption that most branches occur as the terminating condition in a loop; therefore forward branches are always predicted to be not taken and backwards branches are predicted to be taken. If you don't completely understand the connection between loop conditions and this type of static branch prediction, don't worry; I'll explain it more clearly when we talk about Prescott.

The P4's pipeline vs. the P6 pipeline

The P4's basic pipeline was the first one that I'd ever seen that included drive stages. These drive stages are there solely to handle shuttling signals across the chip's wires. They keep signal propagation times from limiting the clock speed of the chip, and they're one of the factors in making the P4's clock speed so scalable.

They also do absolutely no "useful" work (in the sense that they perform a function that's included under any of the four classic pipeline stages) and in this respect the first drive stage represents one of the costs of high clock speeds. Why? Because that's another pipe stage and hence another instruction that must be flushed in the case of a branch mispredict.

The P4's basic pipeline doesn't have the two decode stages that the P6's has; in fact, P4's basic pipeline doesn't have any decode stages. This is because of the trace cache, covered above. While the P6 would spend 5 stages fetching and decoding, the P4 spends 5 stages fetching micro-ops from the trace cache and delivering them to the ROB, where resources are allocated to them over the course of the next three stages.

The Pentium 4 then spends a considerable number of stages, four to be exact, moving the uops through the system queues and schedulers described above. Compare these four stages to the single stage that the P6 spends scheduling uops for dispatch. Similarly, dispatch on the P4 takes two stages, in contrast to the P6's single dispatch stage.

In between the dispatch stages and the execute stage, the Pentium 4 inserts two register file read (RF) stages. These two stages, absent in the P6, are needed for moving operands from the Pentium 4's register file to its execution units. Of course, when I say that the RF stages are "absent" in the P6, I mean that the P6 performs this same act of reading the register file within the traditional execute stage; so RF happens on the P6, but there are no separate stages devoted to it. In sum, the Pentium 4's two RF stages are very much like the two drive stages mentioned above ? they are there solely to keep signal-delay times from limiting the clock speed of the Pentium 4.

After the RF stages comes the execute stage (EX). This is stage 17 on the Pentium 4, as opposed to stage 10 on the P6 and stage 3 on the Pentium. That's a lot of cycles spent on executing one instruction.

Following the execute stage on the P4 are the flags, branch check, and drive stages. In the flags stage any flags that the need to be set as a byproduct of the instruction's execution (i.e., a division by zero flag, overflow flag, etc.) are set in the x86's flag register. The branch check stage, stage 19, is where branch conditions are evaluated, which means that in the case of a mispredict you have to flush 19 stages worth of work out of the pipeline. That's a lot of wasted cycles. Finally, there's a final drive stage dedicated to sending the results of the execution back to the ROB where they'll wait to be written back to the register file at retirement time.

So the Pentium 4's pipeline does all of the things that the P6's pipeline does (with the exception of x86 instruction decoding), but in many cases it takes multiple stages to do what the P6 does in one.

The cost of x86 support on the Pentium 4

With a design as unique as the Pentium 4, it's tough to really talk about the "cost" in transistors of x86 compatibility. The trace cache complicates the discussion greatly, because it's both a symptom of x86 support and a cure for it. As an instruction cache, the trace cache has a direct counterpart in the on-die L1 caches of non-x86 designs. So in this respect the trace cache doesn't really go in the "x86 costs" column. But its special features, such as the logic required for building trace segments and such, add to it complexity that a regular L1 cache lacks. This complexity could reasonably be said to be a cost of x86 legacy support, but even that characterization wouldn't be completely accurate. Why?

Because the P4's trace-cache-related complexity has the effect of taking the decode stage entirely out of the basic pipeline, so it significantly reduces the impact of legacy x86 support on the P4's performance while at the same time slightly increasing relative cost of such support in transistors. Of course, the overall relative cost of x86 support still went down on the P4 because of the increase in cache sizes (as per the trend mentioned in the first article)...

As you can see, the x86 legacy support picture on the P4 is complicated. In the end, the P4 just doesn't compare well enough architecturally to comparable RISC designs to make x86 legacy support costs isolatable and readily quantifiable. In fact, it was the difficulty inherent in isolating x86 legacy support costs in modern processor designs that prompted me to write this article on the future of the x86 ISA.

The Pentium 4 and hyperthreading

One of the ways that Intel has tried to squeeze more performance per Watt out of the Pentium 4 has been to introduce a technique called hyperthreading. For in-depth coverage of all things hyperthreading, I'll refer you to this article. The present very brief discussion will only be a general summary.

Hyperthreading, also called simultaneous multi-threading (SMT) enables a processor to execute instructions from multiple threads at once. Normally, a processor executes instructions from only one thread at a time, with the OS switching between multiple threads in rapid succession in order to give the illusion of simultaneous execution. With SMT, the processor is able to fetch, decode, schedule, and execute instructions from two threads concurrently.

You could say that a normal processor's instruction window allows it to look at only a single thread at any given moment, while a hyperthreaded processor's instruction window has been enlarged to allow it to view two threads at once. This ability to look at and draw instructions from two threads simultaneously makes it easier for the processor to fill the different reservoirs that keep its execution core fed.

SMT support costs relatively little to add to the processor die, but it helps increase the average number of instructions that get executed on any given cycle, thereby boosting performance per Watt. So hyperthreading provides the kind of gain in net performance per watt that the Pentium 4 needs to keep it viable as its clock speed increases start to slow down.

The Pentium 4: Conclusions

When the first Pentium 4 chips came out, they didn't stack up very well at all against the P-III. But as I and others argued, the processor's performance scaled reasonably well with its clock speed, and its clock speed scaled rapidly. The P4 eventually became a solid enough integer and floating-point performer, but during its lifespan Intel and AMD traded the performance lead a number of times. AMD's K7 architecture was a serious contender, and the P4 never developed the kind of lead over the x86 competition that the Pentium Pro enjoyed.

As the P4's clock speed scaled ever higher, so did its power consumption and heat output. This fact made the architecture less and less suited for laptops with each generation of P4 chips. Furthermore, blade servers had begun to take off, and extrapolating the P4's MIPS/watt curves into the future showed that the architecture would only be less and less capable of filling that niche. Intel knew that they had to do something if they were going to stay competitive in the mobile space and become a force in the growing blade server space, so they commissioned a team of designers based in Israel to begin work on a new x86 processor designed from the ground up with low-power applications in mind. This processor, codenamed Banias, eventually became the Pentium M.

The Pentium M: Intel gets back to basics

Pentium M summary table

Introduction date: March 12, 2003
Process: 0.13 micron
Transistor Count: 77 million
Clock speed at introduction: 600MHz, 900MHz, 1.1GHz, 1.2GHz, 1.3GHz, 1.4GHz, 1.5GHz, and 1.6GHz with 400MHz
Cache sizes: L1: 32K instruction, 32K data, 1MB L2

As I noted when I initially wrote on the Pentium M, Intel has not been as forthcoming with microarchitectural details for this processor as they have for other processors. So the details of the Pentium M's implementation are a bit sparse. Nonetheless, it's possible to surmise a few important things from the details that Intel has released.

Also, I should mention that in preparation for this article I read back over my original Pentium M piece. I know a bit more about the PM now, and I'm not entirely pleased with certain aspects of my previous coverage. In this article I'll be correcting and augmenting that coverage with more solid information and analysis.

Now that the introductory matters are out of the way, let's take a look at the Pentium M.

Judging by Intel's public statements and my own conversations with insiders, the P-M is essentially the latest and greatest version of the venerable P6 core described in the first article. Specifically, the P-M builds on the Pentium III core, and since I didn't go into a whole lot of detail in Part I on the PIII's alterations to the P6 core, I'll cover some of that stuff now.

As I noted in Part I, the Pentium II saw the addition of MMX to the P6 core, with MMX units added to ports 0 and 1. The P-III brought floating-point SIMD support in the form of SSE instructions to the P6 core, but Intel didn't want to keep adding execution units to multiple ports. As a result, they modified the P-III's floating-point (which is on port 0) unit to handle some SSE's SIMD floating-point instructions, and they put all of the new SSE units for handling the remaining SSE instructions on port 1.

In the diagram below, I've lumped the three new SSE units added on port 1 into one box labelled "SSE" for clarity's sake. This layout should give a slightly better picture of how the work is divide among the various units than the diagram in the original Pentium M article.

The Pentium M made a few critical changes to the P6 core in order to make it look a little more like the P4, but without going to the extremes to which the P4 goes.

Enlarged instruction window

The most critical of these changes was that Intel greatly enlarged (some rumors say doubled) the P6 core's instruction window to enable the processor to track more instructions in flight than its predecessor. This enlarged instruction window means that the ROB and RS sizes have been increased, and in general the other buffers on the processor are deeper (i.e., the outgoing store buffer has probably been deepened to handle more stores, and the memory reorder buffer (MOB) has probably been deepened as well). Note that this business about increased ROB and RS sizes is the opposite of what I speculated in my previous article; the flaw in the previous article's reasoning is quite obvious to me now, and I have no idea why I did not see it originally.

At any rate, this deeper buffering is necessitated by the P-M's increased pipeline depth. The Pentium M adds a few stages to the P-III's pipeline in order to help keep clock speed up there in spite of a slight increase in decoding complexity (more on this below), and as I explained above, the more pipeline stages you have the larger of an instruction window you need. Unfortunately, Intel has not disclosed the number of added pipeline stages, but speculation puts it at around three or four new stages.

Improved branch prediction

In addition to deeper buffers that track more in-flight instructions, another change related to increased pipeline depth lies in the Pentium M's improved branch prediction.

I described the Pentium M's branch prediction scheme in some detail here, so I won't recap all of that. I'll only note that the PM adds improved prediction for indirect branches, and it improves dynamic prediction in loops with the addition of a special loop detector.

Note that this improved branch prediction isn't just there to help out the PM's deeper pipeline. Branch prediction is one place where you can get a lot of performance increase on all types of code by expending relatively few transistors. So if one's concern is in increasing a processor's performance per watt ratio, then spending transistors on branch prediction is a good way to go.

Micro-ops fusion, and added pipeline stages

I've made some references throughout this article to x86 processors' technique of decomposing complex x86 instructions into smaller, RISC-like uops. The Pentium M takes this process a step further and fuses certain types of uops back together into small bundles called macro-ops. I can't really top a previous explanation of mine for brevity and clarity, so I'll just quote it, here:

Because uops are what the P6's execution core dynamically executes, they're what the reorder buffer (ROB) and reservation station (RS) must keep track of so that they can be put back in program order after being executed out-of-order. In order to track a large number of in-flight uops the P6 core needs a large number of entries in its ROB and RS, entries that take up transistors and hence consume power. The PM cuts down on the number of ROB and RS entries needed by fusing together certain types of related uops and assigning them to a single ROB and RS entry. The fused uops still execute separately, as if they were normal, unfused uops, but they're tracked as a group.

Intel appears to have limited this fusion technique to two specific types of memory instructions, with the result that it's not very widely used on the PM. Nonetheless, Intel claims that micro-ops fusion yields a 5% performance improvement on integer code and a 9% improvement on floating-point code.

For this technique to be worth implementing on a low-power architecture like the PM, the transistor cost for making it happen would have to be minimal compared to the aforementioned performance gains. My own estimation of its overall architectural impact is that at least one of the new pipeline stages is an extra decode stage that was added to take care of the uops fusion.

As for my speculation on the other added stages, it's possible that there was an extra ROB read stage added to the PM because of the increased (doubled?) ROB size. It also may be the case that micro-ops fusion complicated scheduling somewhat, so that there's an extra stage added to the scheduler.

The stack execution unit

The PM also introduced a few other changes to the P-III, one of which is a stack execution unit for handling updates to the ESP register. This innovation, which I cover in my Pentium M article, cuts down on the number of in-flight uops and improves overall execution efficiency.

The costs of x86 legacy support on the Pentium M

There's really not anything to say here that hasn't already been said in my discussion of the Pentium III. But before closing out the topic I should just note that the micro-ops fusion technique really doesn't have anything to do with x86 legacy support, since as I pointed out in my Pentium M article the PowerPC 970 uses a similar instruction fusing technique.

Pentium M conclusions

The Pentium M's deeper pipeline, improved branch prediction, and enlarged instruction window are reminiscent of the Pentium 4, but the PM's moderation in all of these elements reflects the fact that the PM lacks the P4's excessive focus on clock speed. All told, the Pentium M is a solid performer with a great performance/watt ratio. If the rumors are true ? and they probably are ? Intel realizes that the future of its consumer x86 line is in the Pentium M, and that means not just laptops and blade servers but desktops as well. Look for multicore Pentium M variants to start cropping up eventually, and for the eventual introduction of hyperthreading into the design.

There's a lot to like about this new architecture, and in fact Intel themselves like it so much that they've rolled some of the lessons learned during its design back into the latest P4 design, codenamed Prescott.

The 90nm P4 a.k.a. Prescott: the Pentium 4's last gasp

Prescott summary table

Introduction date: February 4, 2004
Process: 0.09 micron
Transistor Count: 125 million
Clock speed at introduction: 2.8GHz, 3.0GHz, 3.2GHz, 3.4GHz
Cache sizes: L1: ~16K instruction, 16K data, 1MB L2

The 90nm Prescott introduces some significant changes to the P4's Netburst architecture in an effort to squeeze more performance out of it and to make it a bit more efficient. As with the Pentium M, the details are scarce, but I'll present the main points that are publicly known. Most of these enhancements are along the same lines as the enhancements we have talked about earlier, i.e., lengthened buffers and queues, improved branch prediction, etc.

Let's first run down the list of miscellaneous improvements and tweaks that I do not have too much to say about, before moving on to more detailed stuff.

Pipeline, execution core, SIMD, other Prescott improvements

As was publicized at the time the processor was unveiled, Prescott's basic pipeline is a little longer than that of the P4; we're not sure how long, but rumors are that two extra stages have been added. I won't speculate on what those stages are, but Intel claims that they were added for ease of clock speed scaling. Yes, in case you were wondering, this does amount to beating a dead horse.

Intel has beefed up Prescott's core a bit with some changes to the integer ALUs. Specifically, they added a shifter/rotator to one of the fast ALUs, which means that more common forms of shift and rotate can now be executed twice as fast. They also added a dedicated integer multiplier, which I'm assuming is an addition to the complex integer ALU. Previous versions of the P4 used the floating-point hardware to do integer multiplies, which means extra latency because operands have to be moved over to the FPU. The addition of the dedicated multiplier to the complex integer ALU does away with this move, improving multiplication latency and performance.

Prescott's ROB is still 126 entries deep, but the system of queues and buffers that makes up its instruction window has been enlarged. Specifically, the floating-point/SIMD schedulers have been increased in size, as have the sizes of the queues that feed all of the schedulers. The former changes should yield some improvement on floating-point and SIMD code, and the latter may help slightly with other types of code as well.

One way of keeping a deeply-pipelined design full of code and data is to improve the size and hit rate of the processor's caches, and Intel does this by upping the L1 data cache size from 8K (4-way associative) to 16K (8-way associative). Also, the P4's unified L2 cache (256K on the low end versions, 512K on the high end) gets a boost in Prescott to 1MB.

Prescott also brings with it the latest extension to the x86 ISA: SSE3. SSE3 consists of 13 new instructions designed to speed multimedia code. I won't describe all of those here, but the list includes instructions designed to speed floating-point-to-integer conversion, complex arithmetic, video encoding, graphics, and thread synchronization.

Finally, there are a few miscellaneous improvements to the internals of the processor that will give better performance on hyperthreaded code. Prescott is a little bit smarter about how it shares microarchitectural resources between concurrently-running threads, and it also includes software instructions which coders can use to help with thread management.

Branch prediction on the Prescott

I said earlier that branch prediction is a great place to spend transistors because of its performance-enhancing potential on all types of code, and with that in mind Prescott has two new tricks up its sleeve for predicting branches.

The first of these two tricks is an improved static branch predictor. In the previous section on the Pentium 4 I briefly described static branch prediction, and I promised that I'd go into a bit more detail this time. Here's an explanation from a previous article on the P4 that sums up static branch prediction well enough:

There are two main types of branch prediction: static prediction and dynamic prediction. Static branch prediction is simple, and relies on the assumption that the majority of backwards-pointing branches occur in the context of repetitive loops, where a branch instruction is used to determine whether or not to repeat the loop again. Most of the time, a loop's conditional will evaluate to "taken," thereby instructing the machine to repeat the loop's code one more time. This being the case, static branch prediction merely assumes that all backwards branches are "taken." For a branch that points forward to a block of code that comes later in the program, the static predictor assumes that the branch is "not taken."

By studying loop behavior in actual code, Intel has discovered something about loops that has allowed it to improve the plain old static branch predictor a bit. Here's Intel's own description of their new method:

We can try to ascertain the difference between loop-ending branches and other backwards branches by looking at the distance of the branch and the condition on which the branch is dependent. Our studies showed that a threshold exists for the distance between a backwards branch and its target; if the distance of the branch is larger than this threshold, the branch is unlikely to be a loop-ending branch. If the BTB has no prediction for a backwards branch, the Intel Pentium 4 processor will then predict taken for the branch only if the branch distance is less than this threshold.

So in situations where the static predictor is used, Prescott's static predictor compares the distance of the branch to a hardwired threshold number, and if that distance is less than the number it assumes that the branch is a loop-ending branch and marks it taken.

Intel also improved the Prescott's dynamic branch predictor by taking a page from the PM's playbook and adding an indirect branch predictor. The literature doesn't really say how it works, but they do credit the P-M folks with the motivation for the addition, so it's possible that it works similarly to what I've elsewhere described for the P-M.

Trace cache improvements

Prescott's trace cache was improved so that it now holds more types of uops than the P4's trace cache. Although I didn't mention the fact previously, the P4's trace cache doesn't hold the long uop sequences that correspond to really complex, multicycle legacy x86 instructions. When the P4's decoder comes across an x86 instruction that will decompose into a whole string of uops, it inserts into the trace cache a pointer to a place in the Microcode ROM that holds the proper uop sequence. When the time comes to execute this string of uops, the pointer is fetched from the trace cache and the front end is redirected to look in the microcode ROM for the proper instruction sequence.

This little jump into the microcode ROM takes time, so for a few of the less lengthy instructions Intel has decided that it would be better if Presocott decoded them the old-fashioned way and stored them in the trace cache. This saves time by allowing the instructions to be fetched more quickly, since they now come directly from the trace cache. The downside is that it pollutes the trace cache with these longer strings of uops that were previously stored in ROM, thus reducing the cache's effective size.

Prescott conclusions

I won't say much about Prescott here, because I've already said everything I want to say about it my most recent Prescott article. In short, the initial benchmarks for Prescott are very disappointing, and its power requirements are through the roof. Prescott's days are numbered, and it represents Intel's last major version the Pentium 4's Netburst architecture.

General conclusions

So we've now followed the Pentium name from its beginnings in the original Pentium to its current, schizophrenic role as the brand name of two radically different architectures which embody two quite different approaches to personal computing performance. On the one side is Prescott, the final incarnation of an ambitious, commercially-successful, but ultimately flawed architecture that reflects both the heady days of the gigahertz race and a supreme confidence in the onward march of Moore's Curves. On the other side is the Pentium M, which with its roots in the venerable P6 and its status as Intel's Next Big Thing assured, is the once and future king of Intel's consumer product line.

Back in 1993 I would have been the last person to think that the peculiar "Pentium" name would endure for over a decade after its introduction. But endure it has, just like the P6 core that made it a household name. And in fact, the not-so-distant future may very well see the day that a multicore Pentium M derivative sports x86-64 support and supplants Itanium as the focus of Intel's 64-bit server efforts, thus bringing the Pentium name and the P6 core back into every segment of today's highly segmented computer market.

Revision History

Date	Version	Changes
07/26/2004	1.0	Release

Jon Stokes

0 Comments