![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|
|
|
![]() |
|
|||||||||
By William Van Winkle |
|||||||||||
No one argues against competition being good for the market. And lawsuits aside, few would say that we have anything but a duopoly in today’s processor market. Since the debut of Intel’s Core microarchitecture, though, AMD has had a tough run. For the first time since maybe 2002, Intel’s Xeon had a persuasive argument against AMD’s Opteron in the 1P/2P world, and AMD’s share gains started to slip. The purchase of ATI led to inevitable strains as the two companies learned to meld. The CPU price wars have been brutal to AMD’s bottom line. And now we have the company’s long awaited shift to quad-core. Saying that AMD is betting the farm on this launch would probably be an overstatement...but not by much. Is AMD going to turn the tables again? And if it does, are you going to be along for the ride? |
|||||||||||
|
In mid-August, journalists received an odd postcard in their mailboxes from AMD. The little placard showed a fish-eye view of a lush theater, with twin phalanxes of empty seats and gold lighting everywhere illuminating rich, red curtains. The 3D text popping out of the image read: “THE MOST ANTICIPATED PREMIERE OF 2007”. The image is a good but risky one. In an era when satisfying sequels are hard to find— Star Wars: Episode I through III, the two Matrix follow-ons, and the third Shrek all spring to mind—building anticipation and setting buyers up for disappointment is dangerous. AMD’s gala premiere is, of course, the launch of “Barcelona,” the company’s first fully unified quad-core processor. Barcelona, weighing in at 283mm2 and 463 million transistors, will arrive under the Opteron brand, and a new, premium line of quad-core chips dubbed Phenom will hit the consumer sector next quarter. No doubt, you already know the one-sentence pitch for Barcelona: The front-side bus (FSB) architecture is a dinosaur, so customers should buy the more efficient, scalable Opteron. In reality, there’s a lot of truth behind that statement. Most people who understand the inner workings of these chips would agree that AMD has the superior engineering. Intel, in turn, has been able to compete in part through brute force clock speed and loads more on-die cache made possible through the company’s aggressive fabrication node (90nm, 65nm, etc.) schedule. So in a sense, AMD’s pitch boils down to this: Innovation and architecture beat brute manufacturing force. If that sounds too simplistic, it is. As we’ve detailed here in the past, there’s much more to Xeon and the Core microarchitecture than simple brute force. But we’re not here to tell you which chip is better because the reality is that there is no “better.” Even in the dark days of 2003 and 2004, there were still reasons to buy Xeon, some of which had nothing to do with microprocessor architecture. However, at that time, the benefits of Opteron reached a wider audience, and the market share shift reflected that. Today, the scales of market appeal may again be tipping toward AMD’s favor. The only way to know for sure, and thus to gauge how much of your own energy and resources you want to put behind the product line, is to roll up our sleeves and dig into the details of AMD’s quad-core overhaul.
A TWO-SIDED QUAD QUESTION If you’ve followed the Intel story for the last year or so, you’re already familiar with the basic Core microarchitecture design: optimized processing pipeline; 65nm fab production; dual-core CPU dies; and a fat, 4MB pile of shared L2 cache for each die pair (in the E6000 series). Intel uses a multi-chip module (MCM) approach for the current line of Core 2 Quad and Core 2 Extreme chips. This means that two dual-core dies are placed side-by-side in the same package. They don’t share cache resources. Instead, when a core on Die 1 needs data and doesn’t find it in local L2 cache, for example, the request gets sent to the L2 in Die 2. Rather than just hop a few millimeters to the adjacent die, the request goes out through Die 1’s L2, into the front-side bus, down to the northbridge, into the memory controller, then back through the northbridge, up through the FSB, and into the L2 on Die 2...and then back again to complete the round-trip journey. Obviously, the latency times incurred in this process can be substantial not only in memory-intensive apps, where cache gets polled in advance of going out to system memory, but also in multi-processor settings, such as 4P or 8P systems. In contrast, AMD uses its HyperTransport architecture, defined by AMD as a “high speed, lower power I/O bus” to link the memory controller embedded in each modern AMD processor to system memory. (AMD has used this integrated memory controller approach in every chip following the Athlon XP. Intel will finally adopt the design element with next year’s 45nm “Nehalem” microarchitecture.) Nothing radical here. In terms of linking core components, HyperTransport is a point-to-point connection, similar to the front-side bus in some ways, only with different mechanical specs. More importantly, in addition to linking various components on the motherboard, such as the northbridge and southbridge, HyperTransport can also connect CPUs directly to one another. AMD brands the various HyperTransport links between CPUs, system memory, and I/O as Direct Connect Architecture (DCA), which is the foundation of Opteron’s five-year success story. With DCA, there’s no reason to run memory calls out to the chipset; processors send them directly to one another. “Applications that are very cache-sensitive and clock-sensitive are probably going to perform better on an Intel platform because they’re going to have larger caches and higher clock speeds,” says John Fruehe, worldwide market development manager, server/workstation products at AMD. “Now, that being said, most of the applications that are memory- or I/O-intensive are going to run better on an AMD platform because of our integrated memory controllers, lower latency memory, and we have a more efficient system architecture. Memory tends to be the key for the vast majority of the applications.”
This is the heart of why Opteron decimated Xeon on performance from 2002 to 2005. In an era when single-core CPUs reigned and Intel’s NetBurst architecture had little going for it besides raw frequency, the latency savings of DCA in memory-intensive or multi-threaded settings won the day. Intel’s Core microarchitecture took some of the wind out of Opteron’s sails, leapfrogging its competitor primarily through a fab shrink, pipeline improvements, and the switch to a shared L2 cache for both cores in a 5100-series Xeon. The benefits of these improvements more than compensated for the FSB’s memory latencies. Now we come to AMD’s quad-core architecture, the first to feature production on a 65nm process. AMD decided to forego the MCM approach and unified all four cores into a single die, a design it calls “native quad-core.” The obvious advantage is all four cores have equal ability to intercommunicate at full, on-die speeds. There’s no delay as the first core pair needs to signal out to core logic and back to chat with the second core pair. Tied to this is how the L2 caches for each core pass data. Each core has a dedicated, 512KB L2 cache block. Under this (if you think in block diagram terms) is a 2MB L3 cache layer connecting all four L2 caches. “The L3 is holding all of the data that gets flushed from the L2 cache,” says Fruehe. “Each of the cores has an individual L2 cache, and then as information is flushed from the L2 caches, it’s deposited into the L3. Think of that as a holding tank for data that was recently used and might be needed again.” So here we have an intriguing difference between AMD and Intel in their cache architecture decisions. Prior-generation, 90nm Opterons (including “Revision F”) used a 2x1MB L2 cache design, with 1MB of cache dedicated to each core. With Barcelona, AMD halves the size of each core’s L2—still yielding a total of 2MB on the processor—and adds a 2MB L3. (A 64KB L1 cache is also tied to each core.) AMD believes that a shared L3 topped by smaller, dedicated L2 blocks is a better choice for long-term scalability, no doubt with an eye toward the eight-core designs of tomorrow. According to AMD, L3 cache allows you to share multi-tasked data across multiple processor cores from a single pool instead of having to go back and ping each of the individual cores.
The odd thing about this concept is that Intel’s L2 cache also acts as a “shared pool” under Core microarchitecture. The closer the cache is to the processor core, the faster the performance. Thus L1 is faster than L2, which is faster than L3. Intel seeks to overcome the burden of its FSB latencies through the use of 4MB of L2 on each die, making 8MB of L2 in each Core 2 Quad or Xeon 5300 processor. If L3 as implemented in Barcelona is so great, why didn’t AMD use its 65nm fab shrink to make more of it? Good question. Some of it may have to do with manufacturing yield rates. Some of it may have to do with AMD’s desire to fit Barcelona into the same power/thermal envelope as Rev F. We do know that AMD’s current road map shows the 2008 quad-core Opteron, code-named Shanghai, featuring 6MB of L3 while still keeping 512KB of L2 per core. No matter what the rationale, it’s clear that many buyers prefer AMD’s architecture choices. “Last October, I was talking to a system partner at an event,” says Fruehe. “He pulled me aside and said, ‘I’m having this interesting problem. I’ve got a Clovertown, a Woodcrest, and a Rev F. When I go to load my application, it loads in about 15 minutes on the Opteron Rev F. It loads in about 20 minutes on a higher clock speed Woodcrest. And I can’t get it to load on the Clovertown at all.’ The problem is that this application is very thread-sensitive, and it understands that there are a lot of cores in the platform, each of which has cache. And it understands that the quickest way to get data is not to pull it off the hard drive but to get it out of cache from these cores. With a front-side bus with multiple cores and MCM packaging, every time data in core #1 needed information from the other core, the information request had to do that whole front-side bus/memory controller round trip. So all of the cache synchronization was actually happening over the FSB, and that was dragging the system down to the point that he couldn’t even load the application. He said the easiest work-around was to turn off some of the performance features in the platform so that the application turned stupid and didn’t realize there were other cores and stopped snooping for other caches.” This is not the first time during Barcelona’s pre-launch that we’ve heard this story from AMD, which gives the tale something of an odd ring. For its part, Intel has now sold over one million of its MCM-based, quad-core processors and, according to Intel’s Shannon Poulin, has never heard a complaint of this sort. “If the front-side bus is causing so much congestion, then why do Woodcrest and Clovertown outperform Opteron on all meaningful server benchmarks?” she asks, turning a blind eye to the occasional independent benchmark tests Intel didn’t win, even against Rev F. “This appears to be a very convoluted way for AMD to argue implementation choices rather than focus on product availability, performance, and reliability. AMD has chosen HyperTransport; we have chosen a front-side bus—these are implementation details. No real customer is concerned with how a processor is implemented; rather they care about availability, performance, and reliability.” “What our competitors fail to realize,” Poulin adds, “is that the best way to improve performance is to not have to go to system memory. Intel provides a large cache on all of our MP processors to alleviate this bottleneck. We have also used our expertise and ability to manufacture large cache to put a snoop filter on the chipset itself, which also improves performance and reduces the times you need to go out to system memory. The other ways to improve performance, using the analogy above, is to provide more memory lanes (channels) or increase the speed limit (higher frequency memory).”
And so we return to the eternal grousing match between the two competitors, both of whom have their solid and flawed arguments. But skepticism over Barcelona doesn’t start and end at Intel. A lot of buzz around the server chip stated that yields would be low (at least at higher clock speeds), a unified quad design would be more costly, and so on. The news from AMD now is that Barcelona yields are on par with dual-core yields, although, as of this writing in mid-August, launch models and pricing have yet to be announced. Intel won’t make its move to native quad-core probably until late 2008 with Nehalem, which gives AMD time to show off the benefits of the architecture it fought so long and hard to develop ahead of the competition. If all proceeds throughout 2007 according to AMD’s plan, Opteron will have a sizable performance and power advantage. In an effort to gain market share while simultaneously working to reverse the difficult balance sheet conditions of recent quarters, the company will have an interesting set of decisions to make on pricing and margins. If the ATI Radeon HD 2000 launch is any harbinger, then we might expect mid- to upper-mid-level Opteron parts aiming to undercut Intel “equivalents,” with higher-end SKUs to follow later at higher margin levels. Another intriguing possibility is that, once AMD has higher frequency Barcelonas in ready supply, the company may opt to counter Intel’s 45nm fab shrink, expected in late 2007 or early 2008, with its own multi-chip module implementation of lower-clocked Opterons. Intel would have faster frequencies and more cache, but AMD would beat Intel at its own game by featuring eight cores per processor. “We’ve never said that multi-chip is bad,” notes Fruehe. “We’ve said that multi-chip with a front-side bus or multi-chip where the two dies don’t talk to each other within the processor ultimately leads to less efficiency and scalability. So with HyperTransport facilitating that die-to-die communication, you may see us pursue that in the future. It depends on where the market is going.” An octo-core MCM design would strain AMD’s power envelope messaging, but with Barcelona’s worthy power savings enhancements, the communication capabilities of Direct Connect Architecture, and lower clock speeds, such a product just might have a market in mega-threading niches and moderate-performance virtualized servers. BEYOND NATIVE You wouldn’t know it from the headlines, but there’s more to Barcelona’s architecture revamp than just integrating four cores. A lot of this never gets airplay because it’s more technical than “four cores on one die!” but much of it is still worth noting. Just as Intel made several enhancements in how data gets retrieved and processed, AMD has made numerous updates to improve Barcelona’s performance. While each core only gets half as much L2 cache as the Rev F models, the bandwidth of that cache has been doubled from 2x64 bit loads per cycle to 2x128 bit loads per cycle. This doubling of the shovel size per data fetch, if you will, shows up in other places too. SSE execution width jumps from 64 to 128 bits, which can help considerably with multimedia tasks. Instruction fetching doubles from 16 to 32 bits per cycle. The floating point scheduler likewise deepens from 36x64-bit operations to 36x128-bit operations. In a nutshell, all of this means more bits are moving in each clock cycle. Note, however, that 128-bit floating point operations are not supported by most software to date. In a shift very reminiscent of when AMD debuted 64-bit extension support in an era of 32-bit applications and operating systems, it will fall to developers to recompile and update their code for 128-bit FP support before this feature yields any real-world benefits. As with Core microarchitecture, Barcelona advances not only the core’s out-of-order load execution but also the advanced branch prediction algorithms, allocating more physical resources for better prediction results. The memory controller now integrates a prefetcher for buffering reads, and these store into L1 rather than L2 under previous “K8” core architecture. The memory addressing system hops to 48-bit, which means a maximum of 256TB of system memory. (Yes, terabytes.) ...more |
|||||||||||
|
|||||||||||
Copyright © 2007 RAM Magazine. All rights reserved.
Do not duplicate or redistribute in any form. |
|||||||||||