![]() |
| Home > Digging Into the Core Story | |||||||
|
![]() |
||||||
| Technology for technology's sake is fine of you're a rabid early adopter with more spare cash than you can carry. But the rest of the world takes a more practical view. Technology advances don't open wallets; the worthwhile and valuable application of new technology does. | |||||||
|
In the processor space, we've seen a lot of incremental advances that earned headlines but ultimately proved to be lackluster as a customer investment. Most of us can think of instances when a new product release boasting more modern, convoluted technology actually underperformed its predecessor. So we're skeptical of big talk about new product generations, and you should be, too. With Intel's Core architecture, though, the changes over the preceding NetBurst generation are substantial and so positive that the company has given over its entire desktop, mobile, and non-enterprise server/workstation fleet to the new design. What is Core about and what advances does it offer that you can turn into real-world benefits for customers? Let's dig in. There's Core...and Then Core Don't be fooled by the name Core Duo. The Yonah mobile processor that debuted in January 2006 is a dual-core derivative of the Banias/Dothan Pentium M processor, which in turn owes much of its lineage to the "P6" architecture dating back to the Pentium Pro. The first processors sporting the "Core architecture" are the Conroe (desktop Core 2 Duo), Merom (mobile Core 2 Duo), and Woodcrest (Xeon 5100 series). Certainly, the Core architecture shows much similarity to Yonah, but there are some very large differences based around resource sharing and processing efficiencies. Like Yonah, the initial Core chips feature two processor units on one die built with the 65 nm fab process. And like its predecessor, Core 2 Duo will integrate Enhanced SpeedStep Technology (EIST) and Execute Disable Bit support, a must-have in the age of thwarting malware. Where the new generation leaps ahead is in support for 64-bit x86 extensions (EM64T in Intel parlance), Virtualization Technology, and the long-anticipated LaGrande security technology. This is a series of advances based around middleware, client applications, and, most importantly, a trusted platform module (TPM) mounted on the motherboard. LaGrande's end-to-end approach to security will prove far hardier in safeguarding PCs than previous methods and should quickly become a main staple in business environments where user identity authentication is essential.
Power Revamp The Core architecture is a major overhaul, not some incremental bus speed hop. We could fill this entire magazine with flattering minutia comparing Core to prior generations and its present competitors. In the big picture, though, what you need to know and convey to prospective customers boils down to a handful of points, all of which come under the headings of improvements to power conservation and speed increases. To the outside world, NetBurst sure looked alive and well in early 2003. Prescott Pentium 4s had recently topped 3.0 GHz, and the 865 chipset was still fresh. When Centrino bowed that March, and the Banias Pentium M with it, the 27W mobile processor was widely admired for turning in solid benchmark scores in a thermal envelope small enough to enable quite compact thin and light designs. But nobody on the outside dreamed that little Banias, once refined and optimized, would prove so efficient that it would ultimately bury NetBurst and the Pentium brand along with it.
The following "Scalar Performance" chart relates power to performance (based on SpecInt2K benchmarking), taking the i486 as a baseline and removing the effects of shrinking fabrication processes. As you can see, the Cedarmill Pentium 4 shows an 8X performance gain over the 486 but 38X the power consumption. In contrast, the Core Duo delivers 7.7X the performance and only 8X the power draw. By no small coincidence, you'll note that the Pentium M and Yonah power draws are approximately in line with the Pentium Pro, which, as mentioned above, provided the root source for today's Core architecture. Compared to the NetBurst design, Core uses a substantially shorter pipeline. (In microprocessors, a pipeline is a sequence of execution elements through which an instruction travels in each processor "cycle," sort of like a many-staged assembly line for CPU code.) Shorter pipelines that don't sacrifice execution performance are in part achieved through a major innovation in the Core architecture called micro-op fusion (see below). The upshot is that more work is getting done in fewer CPU cycles, or, said differently, Core delivers more performance in fewer megahertz. Fewer cycles means less power required to drive the chip. This is why a 65W Core 2 Duo (Conroe) chip can blow a high-end, 130W Pentium Extreme chip out of the water. But that's just for starters. Core goes substantially farther with power savings through several innovative means. Perhaps foremost among these is Fine-Grain Power Management (FGPM), which essentially lets some components go into a low-power idle state while others stay active. The norm up until recently has been either "sleep" or "active," as you might see in monitors or the "standby" mode Windows enters when all components are idle. ATI, for example, took this to a new level a few years ago in its mobility-slanted GPUs, allowing some parts of the graphics chip to stay active while others could fall into a sleep state for better power efficiency. Intel now takes the concept to new highs with Dynamic Power Coordination. Dynamic Power Coordination introduces new logic within the CPU capable of monitoring power usage and thermal states from many "hot spots" within each processor core. The location with the highest reading is what gets reported to the outside system as the chip's overall temperature. Each of these processor sub-units is monitored and can be independently powered down and reactivated as needed with no negative impact on overall CPU performance. Similarly, Core has the ability to monitor each processor unit and power down any that are not in use. In a dual-core chip, this could yield a nearly 50% power conservation in a single-threaded application environment. As we progress into quad-core and higher chips, the savings could be even greater. That said, note that Core power states aren't simply black or white, on or off. There are Halt, Stop Clock, Deep Sleep, and Enhanced Deeper Sleep modes each core can enter into, moving into increasing levels of power conservation as conditions allow. This provides more granularity on top of the Enhanced Intel SpeedStep Technology (EIST) carried over from the Pentium M. By monitoring processor utilization, EIST would throttle back core frequency and voltage levels in times when full horsepower was unnecessary. For instance, the 21W, 2.0GHz Pentium M 755, when running in EIST's "battery optimized mode," dropped its voltage by nearly 25% to run at 600 MHz and 7.5W. The Core architecture's sleep states will prove even more dramatic. While Core's advances in dropping power utilization with the CPU are impressive, an equally substantial line of advances is happening between the CPU/chipset and surrounding devices. This is part of Intel's Energy-Efficient System Architecture (EESA) initiative, of which Fine-Grain Power Management (FGPM) is perhaps the leading set of features. Rather than requiring the entire system to be inactive before entering a low-power state, FGPM can render individual components idle after only a few seconds of inactivity. If there is no new data being sent to a display, for example, then the IGP is allowed to go into a low-voltage sleep state while the monitor refreshes from its own cache. (This assumes that the monitor supports self-refreshing, which none yet do as of this writing.) Similarly, self-refreshing audio buffers a large chunk of audio data so that the hard drive and other system components can sleep while the memory and audio circuitry provide uninterrupted playback. The wireless adapter isn't allowed to ping the chipset while the system is idle. These and similar measures may only come into play in the minutes before OS- and/or BIOS-level sleep commands take over, but the cumulative power savings can be substantial. Why Core's Power Savings Matter True, Core's predecessor, the Pentium M, was a mobile chip, and that environment was where its power savings shined brightest. Lower temperatures meant (and still mean) thinner, smaller, and lighter notebook form factors without sacrifice of processing speed. Additionally, lower power consumption meant longer battery runtimes. However, in desktop and server settings, lower power consumption has nothing to do with battery life. Small form factors remain important, both for multimedia-centric boxes, such as a Viiv set-top-style system, as well as enabling thicker data density in 1U and 2U server racks. Lower thermals also beget less noise from system fans, which concurrently means that less power is needed to dissipate system heat. Anyone who dislikes noise will want Core, especially for media center systems and corporate cubicle boxes. Environmentally-minded users will laud some ultra-low voltage (ULV) Core SKUs for averaging under 2W during regular operation, and organizations with tens to hundreds of PCs in use will see the benefits on their power bills. Performance Play There is a surprising amount of overlap between Core architecture innovations that conserve power and those that increase system performance, a fact that might provide inspiration for many manufacturers regardless of their industry. The leading example of this with Core is the new macro-op fusion functionality. The similar yet different micro-ops fusion debuted with the Pentium M, and the root of both features lies in how CPU instructions get executed. Many x86 commands (macro-ops) are divided into microinstructions (micro-ops) at one end of the processing chain and then rejoined with a decoder at the other end. Micro-ops fusion ties two microinstructions from the same macro-op together in the proper order, allowing the CPU to view them as a single command. This minimizes the delays caused by out-of-order executions. Core now steps in with macro-ops fusion, which enables two common macro-ops to be fused into a single instruction. According to Intel, for every 10 macro-ops, two can be fused on average, thus yielding a hypothetical 11% increase in execution efficiency. These fusion steps are part of why Core can thrive with a shorter execution pipeline and save power in the process. No less important is Core's revamp of communication across the memory bus, now dubbed Smart Memory Access. This is a critical point as some industry experts maintain that, all other things considered, lack of an integrated memory controller in the CPU is Core's only remaining competitive disadvantage. What they may not understand is that Smart Memory Access offsets the latencies incurred by an external memory controller. Since the Pentium Pro, x86 CPUs have been able to handle instructions out of their proper order by placing a batch of instructions in a holding buffer for reordering. There are two types of instructions in this case: loads and stores. The more priority you can place on loads ahead of stores, the faster your performance. As with most things, you want to start a job as early as possible, not right when it's needed, so getting a leg up on loads is one way to cover up memory latencies. The catch is that most CPUs won't place loads ahead of stores because the processor doesn't know if an executed store will tie up resources the load will try to address. Sometimes, these conflicts apply, but often they don't. The trick is in discerning which you're dealing with so you know when it's OK to bump up a load in the queue. The cornerstone of Smart Memory Access is "memory disambiguation," which employs algorithms to predict when such load reorderings are safe. On paper, this approach yields up to a 40% improvement over prior designs. Real world results will likely be less but still substantially better than competing alternatives. Larger L2 cache sizes have been a path to higher performance for Intel processors for years, and Core ratchets this up yet again, at present intelligently sharing 4MB of cache between two cores. This is one of the new architecture's defining characteristics as preceding designs (Banias, Dempsey, et. al.) used separate caches for each core, which incurred a lot of extra latency as both L2s engaged in trying to figure out what data the other was holding. Core's intelligent management of a unified cache, called Intel Advanced Smart Cache, removes this problem. Each core can address up to 100% of the L2 cache, and if one core doesn't need much L2, the other core can utilize more than its fair share. Advanced Smart Cache has the additional benefit of allowing each core's L1 to be bi-directional, accessing the shared L2 or the other core's L1 as needed. Moreover, Intel gave Core's L2 more bandwidth (now a 256-bit cache bus instead of 128-bit) and integrated "advanced prefetchers." Prefetching is nothing new, but the algorithms employed to predict what information will be needed from memory before it is actually called are what keep improving. The more you can load instructions from the L1 and L2 cache stores sitting on the CPU die and not have to go out across the front-side bus into system memory, the better, so advanced prefetching analyzes the streams of data requested by the cores and speculates further requests accordingly. There are two prefetchers in both the L1 and L2 caches that manage which fetches should go where for the most likely optimization of data placement. If you used your screwdriver four times in the last two minutes, odds are you'll want it sitting by your hand (L1), not nearby on the wall (L2), never mind all the way across the room (RAM). Another Core improvement is Intel Advanced Digital Media Boost. This sounds more complex than it really is. As you may know, Streaming SIMD Extension (SSE) codes are arithmetic and floating point operations designed to help accelerate certain program tasks, especially those rich in graphics, audio, and video. We now have three generations of SSE codes (SSE, SSE2, and SSE3) on top of the original MMX extensions designed for this purpose. Normally, SSE codes execute at a rate of two per clock cycle. A 128-bit instruction might execute its first half in one cycle and its second half in the next. Intel Advanced Digital Media Boost simply doubles the execution rate, running all 128 bits of the instruction in a single cycle. The ramifications of this for Viiv PCs and similar entertainment systems are clear.
Of course, there are more predictable advances with Core, starting with some bus frequency boosts. Core 2 Duo (Conroe), for example, boosts the front-side bus to 1,066 MHz and the memory bus to 677 MHz, up from Pentium D's 800 MHz and 533 MHz respectively. Mobile Core 2 Duo (Merom) keeps Yonah's 667 MHz FSB, but Woodcrest makes the leap all the way to 1,333 MHz, up from Paxville's 800 MHz. Interestingly, Hyper-Threading is wholly absent from the Core line, although you can still find it on the NetBurst-based Dempsey Xeon 5000-series. The Web is abuzz with rumors about "reverse-Hyper-Threading" and "Core Multiplexing Technology" technologies that might or could replace HT, but the fact is that the initial Core parts all trounce their HT-enabled forebears, so customers shouldn't feel that they're losing anything by not having an HT bullet point on their spec list. Why Core's Performance Gains Matter The foundation of what constitutes performance is shifting right under our feet. The priority of absolute speed remains, but how that speed is achieved—and at what energy cost—is in flux. In 2003, a CPU frequency of 4.0 GHz seemed just around the corner, but a 3.0 GHz Core chip can run circles around what would have been 4.0 GHz with NetBurst. Beyond that, Core processors often sport an unprecedented amount of frequency overhead, which means that Intel platforms are the new haven for those who enjoy overclocking. Intel processors have always had stability appeal for corporations; now enthusiast consumers are learning that Core is the new "it" chip. The one missing piece in this discussion is pricing. In years past, there was always a large price premium placed on Intel's flagship processor families. Now, Intel is launching top-performing technology at mainstream price levels. This means that budget-conscious consumers now have access to sterling media centers and killer gaming rigs. Laptops can at last achieve desktop-class performance with comfortable heat levels. And corporations can buy workstations and servers with unprecedented efficiency and compelling ROI stats. (See our Bensley cover story for more on this point.) Any way you look at it, Core delivers more bang for the buck than any other processor technology available. |
|||||||
Copyright © 2006 RAM Magazine. All rights reserved.
Do not duplicate or redistribute in any form. |
|||||||