| |
We all know the benefits of being a team player (assuming you get along with your team). There's only so much you can do by yourself, even if you're ultra-organized and eat your Wheaties. Take this article as a case in point. Chris Angelini and I each have our respective areas of expertise. I could have written the whole story by myself, but we were able to write a better article in less time by dividing up the work and staying in constant contact throughout the writing process.
Two heads can be and usually are better than one, and this design principle is at the, well, core of this year's hottest processor innovation: dual-core technology. In reality, though, dual-core processors are not new. IBM was busy racking up design kudos way back in early 2000 for its dual-core POWER4 server processor. The POWER4 was developed in a supercomputing context but seems nearly archaic in some regards by modern standards. The processor's clock frequency topped out at 1.9 GHz, and the I/O subsystem connected to the processor over a 32-bit GX Bus running at one-third of the processor's frequency. The two processor cores communicated to three separate 0.5MB L2 cache modules across a shared CIU crossbar. The technology could access 32MB of off-module L3 cache, and four processors could fit on a single multi-chip module, yielding eight processor dies on one POWER4 module, each communicating with one another at speeds up to 500 MHz. Meant for mid-level servers, the POWER4 supported a maximum of 32 processors and can still hold its own even against current server designs.
But the big news this year, of course, is dual-core offerings from Intel and AMD, which are just starting to ship. With processor frequencies now running into the brick wall of what the 90nm fabrication process will allow, packing more than one processor die into a CPU is the most economical way to make sure system performance continues to climb. (Interestingly, the POWER4 was made with IBM's 0.18 micron copper process—twice the size of today's 90nm—but was not restricted by adhering to an industry standard socket format.) Pundits were predicting that AMD would follow IBM's example even before the first "Hammer" designs became a market reality, but no one was sure how Intel would handle the transition.
Today, we have answers and products from both major CPU vendors. There has been a lot of hype around dual-core technology, though, and not all of it positive. This advance in processors has been called the biggest thing to happen to CPUs in a decade. Well, some people seem to say that about everything. Dual-core is a big deal, but it's been evolving for over half of a decade already, and it will likely be another year or two before it reaches its true potential for a broad audience.
But that doesn't mean dual-core is without real-world benefits to offer today. We didn't come here to advocate selling new technology for the technology's sake. That approach is a great way to destroy your credibility with customers. Rather, we spoke with a lot of people about the solutions dual-core is enabling. Our success in these talks was predictably hit and miss. Many software vendors have yet to give much thought to dual-core CPUs, even those in product areas that stand to benefit from them the most. But just because some vendors aren't approaching you touting how the marriage of their applications with this new hardware offers value for end-users doesn't mean those benefits and upgrade possibilities aren't there.
Let's dig into dual-core and see where this technology can best improve your product offerings.
All About the Thread Count
To really grasp the fundamental significance of dual-core CPUs and the multi-threaded processing they perform, we need to back up a couple of steps. There are plenty of ways to accelerate processor performance besides increasing the clock speed, including branch prediction, out-of-order execution, and caching. Modern super-scalar processors feature multiple parallel execution pipelines, allowing multiple instructions to be executed with each clock cycle, and these interface with memory. Because having to fetch data from system memory incurs latency, cache assists with keeping the instructions flowing to the processor core at a good clip.
Trouble sets in when things like "cache misses" and branch mispredictions happen. These necessitate data refetching and often result in the processor not executing instructions to its fullest potential or sitting idle altogether for one or more cycles. Additionally, with frequency speeds now in the multiple gigahertz range, a preponderance of applications simply don't need enough processing power to strain a CPU's potential.
Conversely, some applications, particularly in the server space, are so compute-intensive that they swamp the capabilities of even the fastest processor and require two or more CPUs to handle the load. Symmetric MultiProcessing (SMP), where multiple processors in a computer share a common main memory, have been common with RISC machines all the way back into the ‘80s. The x86 world first saw SMP with a very few 486 designs, but the architecture didn't swing into its own until the Pentium Pro supported quad-CPU configurations for servers and workstations. Intel continued with dual-CPU designs for Pentium II and III, AMD dabbled with dual-processors on its ill-fated Athlon MP effort, and the Intel Xeon MP line now accommodates up to eight CPUs.
In an SMP-based computer, the operating system (assuming it supports multiple processors) typically works to divide processing loads between the CPUs. This can be an extremely efficient process if the application running has been coded to utilize multiple processing "threads," or data processing streams. If a program is equipped to crunch data with two threads, one thread would be sent to each CPU in a dual-processor system. Software can also be written for four or even 64 parallel threads. If an application is not multi-threaded, it will run on only one CPU and leave the others idle.
That said, if one has multiple single-thread applications running under a multi-processor-enabled operating system, each application can crunch on its own processor. (Actually, well-designed operating systems can run services and background operations on multiple processors, as well.) Intel seized on this idea and modified the early Xeon's Super-threading technology (each cycle in a processor crunches a different thread) into the Hyper-Threading (HT) found in Pentium 4 processors starting with the 2.40C GHz. Hyper-Threading takes certain areas of the processor and separates them into a second "logical" processor. These areas are typically going unused for any of several reasons. While there is only one physical processor, the operating system and, if applicable, applications see two logical processors. You can witness this at work in the Performance tab of Windows XP's Task Manager when running an HT Pentium 4. Hyper-Threading makes sense on the desktop not because consumer apps are multi-threaded (most are not) but because many people now multitask single-threaded apps.
"Hyper-Threading Technology works like an extra gear within the latest Intel Pentium 4 processor, enabling users and their PCs to do more things at once," Intel expounded in one Pentium 4 marketing glossy. "Examples of activities that benefit from Hyper-Threading Technology include burning a CD while editing home movies, playing a realistic PC game while burning a family photo album onto DVD, or preparing a Powerpoint sales presentation while running a virus scanning program."
Critics of Hyper-Threading often latch onto more extreme scenarios and say that in the real world no consumer ever tries to encode HDTV content while playing DOOM III. True enough. Even though you can do these things simultaneously, no prudent multimedia enthusiast or gamer would risk the hiccups associated with such a load on a single processor. But this ignores the types of mainstream multi-tasking Intel points out above. Those of us who remember how Norton Antivirus scanning in the background would cripple foreground applications can appreciate the help Hyper-Threading bestows today.
Hyper-Threading uses less than an extra five percent of the die area versus non-HT operation yet yields performance gains of around 15 to 30 percent, with the higher end of that range typically showing up in server applications running on Xeon chips. Hyper-Threading is far from perfect, though. Just because you have two logical processors doesn't mean you get double the performance. In fact, when it first arrived, Hyper-Threading received mixed reactions, and test results were not always flattering. However, Intel continued to refine the technology, and even the competition now admits its benefits.
"When Hyper-Threading came out, it didn't help things; it hurt things," says AMD spokesman and not bashful Intel critic Damon Muzny. "Through a lot of rigamarole and optimizations, though, Intel got support and Hyper-Threading actually helped them in the end."
Let's put it all in context, though. Muzny adds, "But if you just roll two processors out there, so long as the software is multi-threaded, there's nothing that needs to be done to take advantage of it. That's our approach. People say, ‘Well, wouldn't your games be better with Hyper-Threading?' Yeah, but look how much better they are with another core."
This brings us back to SMP. One of the benefits of Hyper-Threading is that you get extra performance without creating increased strain on the memory bus, because you still are only operating on one physical processor. While it's true that this processor may spend periods waiting for more data to come in from the system memory, this problem is multiplied in SMP designs because each processor is trying to pull information from memory down a shared pipeline. If the memory is stalled, all processors sit around twiddling their thumbs.
Fortunately, there are other multi-processor designs, including Non-Uniform Memory Access, or NUMA. Rather than lump all memory banks into a central pool, NUMA assigns different memory banks to different processors. Memory accesses are performed independently and in parallel, and in many software situations this can greatly benefit system performance. Microsoft started adopting NUMA architecture for its server operating systems back in 2002.
Also keep in mind the difference in how CPU designers can go about accessing system memory. This piece of the puzzle has emerged as one of the defining (and, in Intel's case, damning) separators between the primary CPU architectures.
"Putting the memory controller inside the CPU rather than out on the northbridge controller chip gives some performance advantages over technologies that don't do it that way," says Mark Tekunoff, senior technology manager for Kingston. "Because it doesn't require additional clock cycles to get the requests in and out of the CPU, out to the controller, out to the memory, and back. So there are certain latency advantages of having an on-board memory controller. That would be an advantage to any company. It just happens that AMD is the only one doing it right now."
With the new products from Intel and AMD, we have the latest evolution of multi-processor technology. Now, rather than having two processor cores on two dies, we have two processor cores on one die and thus on a single socket. In Intel's case, it's important to keep in mind Hyper-Threading operation because some (definitely not all) of the new dual-core processors incorporate HT technology. AMD appears more ready to ignore HT and focus on maximizing single-thread performance in each core.
With the basics of multi-threading now in mind, let's take a closer look at how both companies are implementing these concepts.
Divergent Dual-Core Strategies
Not so coincidentally, Intel and AMD are emerging with dual-core processors at roughly the same time, each claiming it got to the market first. They even launched with working product three days apart from each other. However, the two rivals are approaching dual-core from somewhat different angles.
AMD's perspective is that the server and workstation markets already run rampant with threaded software, all of which stand to benefit from the enhanced processing capabilities of a dual-core processor. The Opteron family is particularly receptive to such an evolutionary move because the existing single-core chips, powerful as they may be, only handle one thread at a time. Intel's Xeon, on the other hand, has its own tricks, including Hyper-Threading, for improving multi-tasked performance with a single processor, which AMD feels the need to counter. By going after high-end computing, Opteron stands to completely displace single-core solutions in the markets most interested in multi-processing.
On the other hand, Intel is looking toward the home and sees usage models whereby many applications are consuming processor resources simultaneously—the ol' let's burn or crunch this while you play that scenario. Consumers can use multiple cores to handle different tasks. In other words, even though most mainstream programs can't directly harness the power of two processing cores, several single-threaded applications running at the same time will still demonstrate an improvement. Unfortunately, it's particularly hard to quantify these gains to customers because existing benchmarks generally measure the time it takes to complete a given task rather than incorporate several executing concurrently. The best you can do in general is characterize the performance of a Hyper-Threaded or dual-core processor as "smoother" or "less likely to stutter when you open several applications simultaneously." At least until it enters the server and workstation markets, Intel is fighting an uphill battle to outshine single-core processors on the desktop.
That said, if you want to demonstrate the benefits of Intel dual-core to customers, I recommend setting up a Pentium Extreme Edition or Pentium D box against a similar frequency single-core Pentium 4 configuration. Set an antivirus program to conduct a full system scan after you've loaded the machines with enough files to take 10 or more minutes to examine. Then run Dr. DivX, Windows Media Encoder, or some similar encoder on a home video file that will take four or five minutes to convert. This is a common consumer scenario that will clearly illustrate the benefits of dual- over single-core.
...more
|
|