|
|||||||||||||||||||||||||||||
| COVER STORY | |||||||||||||||||||||||||||||||||||||||||||||
|
![]() |
|
|||||||||||||||||||||||||||||||||||||||||||
Shanghai Noon |
|||||||||||||||||||||||||||||||||||||||||||||
We all know the great Western showdown clichés. Tumbleweeds. Sweat. People fleeing from the dusty streets as two gun-packing opponents face off under a merciless sun. Odds are good that one of them will be bruised and bleeding, barely able to stand. Somewhere, a soundtrack whistles a forlorn, warbling tune. Fingers twitch. No one breathes. It’s Shanghai time. |
|||||||||||||||||||||||||||||||||||||||||||||
Without question, 2008 was one of the most challenging years in AMD’s four decades. The long-anticipated launch of K10 architecture, better known as Barcelona, struggled from the outset for several reasons, many of which had more to do with manufacturing issues than the product’s fundamental architecture. Barcelona was and remains an excellent desktop and server processor. However, competition from Intel soon exposed the performance holes in AMD’s release. Saddled with a 65nm fab process, Barcelona struggled to scale in frequency and was limited in cache size, even though—like a little Western town waiting for gold to be found in the hills—the underlying design was capable of much more. At last, the promise of Barcelona is reaching fruition. AMD took its native quad-core flagship, applied a 45nm fab shrink to it, made several important additions, and code-named the results Shanghai. We know that Barcelona left a bad taste in many mouths. From early availability problems to trouble with frequency scaling, Barcelona was often the microarchitecture that could’ve, should’ve, and didn’t. (Of course, Barcelona priced right in the low- and mid-range markets can still be a screaming good value.) Shanghai now fulfills the Barcelona promise. If you and your customers blew off Barcelona, try shifting to Shanghai. You’ll find it’s well worth the time. Barcelona Benefits For better or worse, though, all other things weren’t equal. Intel has almost always had the benefits of faster clock speeds and much larger L2 caches to help mitigate its front-side bus latencies. So ultimately, which architecture was better for the customer often boiled down to which resources the application emphasized. If the app relied heavily on frequency speeds and cache, Intel usually won. If the app leaned more on memory, the needle often shifted to Barcelona because the CPU had a direct connection to RAM through a memory controller embedded in the processor. Historically, Intel always required more memory delays because RAM could only be accessed through the shared front-side bus and the northbridge’s memory controller. This is one reason why AMD has had so much success with heavily virtualized servers, which require very fast memory performance.
AMD’s more flexible alternative to the front-side bus is called HyperTransport, and it’s been in use since the Athlon XP. Whereas the front-side bus only connects CPUs to the northbridge, HyperTransport interlinks CPUs, core logic components, and memory. Taken together, AMD calls this web ofHyperTransport links Direct Connect Architecture (DCA), and it’s the foundation of Opteron’s rocket ride to success in the first half of this decade. The pre-Barcelona, dual-core Opteron Revision F used a 64KB L1 cache in each core backed by a 2x1MB L2, meaning 1MB of L2 cache for each core. You probably know that the purpose of cache is to keep recently or soon-to-be-needed (as guessed by algorithmic prediction) memory data as close to the processor cores as possible so as to avoid lengthy seeks to RAM. The smaller the cache, the faster it can be searched, which is why CPU designers use multiple cache tiers. When the CPU has a memory request, it first searches the L1. If there is no hit for the data in L1 (better known as a cache “miss”), the search proceeds to L2, and so on, all the way out to system memory if necessary.
We want to review these points for two reasons. First, to lay some of the groundwork for the features that all carry forward into Shanghai. Second, to remind you that Barcelona remains a very strong offering. The chip may not have lived up to everyone’s wildest expectations, but that doesn’t mean that Barcelona still doesn’t have a lot of value to offer. As AMD was planning its move from 65nm to 45nm, it had to weigh several considerations. The change was not only about getting to 45nm, but also selecting an approach that could apply well beyond 45nm and recycle manufacturing assets. When it costs anywhere from $1 billion to $4 billion to build a fab plant for a new process technology, you want to get the most bang for your buck. For example, one Intel presentation from September 2007 (“Intel Silicon & Manufacturing Update”) notes than the theoretical percentage of fab equipment that could carry forward from 90nm to 65nm was about 90 percent; the move from 65nm to 45nm stepped up to roughly 95 percent. Contributing to the economy of Intel’s 45nm shift was the fact that the company persisted in using 193nm dry lithography for its critical layers. Migrating to immersion lithography, a technique Intel is already using on its early 32nm SRAM chips, would have mushroomed critical layer lithography costs by over 25 percent.
However, AMD has a different set of goals and needs at this point, and the decision was made to bite the bullet and adopt immersion lithography now. This entails injecting water between the projection lens and the wafer’s dye layer. This focuses the projection by about 40%, effectively narrowing an already ultra-narrow light beam and enabling a smaller fabrication node. Despite Intel’s presentation, AMD insists that immersion lithography allows it to accomplish in a single pass what normally takes Intel two passes with 45nm fabrication, so it’s ultimately a more cost-effective approach. The shrink to a smaller fab node usually gives manufacturers two possible benefits. They can either scale to faster frequencies or improve power efficiency. With Shanghai, AMD opted for a middle path. At system idle, Shanghai draws about 35% less power than Barcelona. At a system level, that means about 8% less power. In large deployment applications, such as storage farms or cloud computing, where users typically have massive clusters of systems that may have high activity during the day but have limited activity overnight, this kind of power conservation can be very compelling. Meanwhile, at a time when Shanghai had been widely expected to debut at 2.4 GHz, AMD is announcing launch SKUs up to 2.7 GHz and expects to scale higher quickly. “The technology and manufacturing team have done incredible work in bringing up 45nm silicon,” writes AMD’s Randy Allen, senior vice president, Computing Solutions, in a memo allegedly leaked through Polish site PCLab.pl. “In fact, the silicon was so healthy and the process so mature that this is the fastest AMD Opteron processor that has gone from first wafer to production parts. Our leading-edge immersion lithography technology helps enable us to deliver dramatic performance and performance-per-watt gains, second only for an AMD processor to the initial transition to AMD dual-core. Our original plan of record for ‘Shanghai’ was to launch at 2.4 GHz in the 75-watt ACP thermal band. We have been able to significantly exceed that frequency target, and the parts are drawing much less power at both full load and idle than we originally expected. We believe many industry watchers will be pleasantly surprised with what they see from AMD at launch.”
About Those Thermals Now, recall that Barcelona had a 2MB shared L3 and Nehalem (in the Core i7 version) has an 8MB L3. Intel opted to make its L3 inclusive, meaning that the contents of every L1 (32KB) and L2 (256KB) gets replicated in the L3. The purpose of this is to save on snoop traffic. When core #1 wants a piece of memory data, it first polls its own L1, then L2. If a miss occurs at L2, a search of an inclusive L3 will reveal if any core’s cache holds the data; if not, the request goes out to system memory. With an exclusive L3, as Barcelona uses, core #1 must “snoop” the caches of all other cores after an L3 miss before proceeding out to system memory.
The HyperTransport link gets a big boost under Shanghai. Today’s Barcelona-based Opterons use 8.0 GB/sec HyperTransport. The debut Shanghai models will carry this forward, using multiple links to get the necessary bandwidth. But in the second quarter of 2009, look for Shanghai to slip into HyperTransport 3.0, doubling the link bandwidth to a maximum of 17.6 GB/sec. To return for a moment to Shanghai’s cache, AMD also introduces a new data integrity feature fetchingly called L3 Cache Index Disable, also due to come online in 2009 once select operating systems support the feature. The idea is that with more physical cache comes the potential for more physical cache errors. If the OS is continually doing ECC corrections on a certain section of L3 cache, Shanghai can automatically shut that area down. The L3 is divided into 16 sections, and AMD will allow up to two of those to be turned off. There will be a slight performance hit because the chip is losing part of its cache, but this will be counterbalanced by a reduction in error corrections. If the feature is implemented properly, the user should be able to detect the fault and have the CPU swapped when convenient. Meanwhile, L3 Cache Index Disable provides a slightly higher level of reliability over Barcelona. “When you combine all of this with the aggressive pricing Shanghai will offer,” notes AMD’s Allen in his leaked memo, “we like how we are positioned to go after the high-volume 2P server market in addition to further reinforcing our leadership position in the 4P and 8P market. Shanghai is planned to deliver enterprises unparalleled price/performance and set new performance records for the most critical and demanding server workloads. And just like AMD set the standard for power-efficiency in the datacenter starting in 2003 with the launch of the original AMD Opteron processor, we should do the same with what is driving much of the server growth today—virtualization.” In virtualization, a large part of the performance picture centers on how quickly a system can switch between virtual machines, or “worlds,” and how memory gets accessed is critical in this process. When an application tries to access a memory address, a memory management unit in the CPU monitors the process. If the page isn’t where it’s supposed to be, the memory management unit generates a page fault interrupt. At best, the operating system then tries to resolve the problem; at worst, the program crashes. On top of all this, Shanghai breaks new ground with something called tagged, or guest, TLB. The translation lookaside buffer is a component of the CPU’s memory manager designed to assist and accelerate virtual address translation. Tagged TLB functions much like an L2 cache for the primary TLB. Both RVI and the tagged TLB help to accelerate world switching. With Barcelona, though, AMD didn’t have enough space to cache all of the virtual-to-physical memory translations generated by RVI. Tagged TLB under Shanghai provides extra room to cache more of that translation string, so there are fewer memory lookups needed for virtual machine-based applications. All in all, AMD states that world switching times improve under Shanghai by up to 25% versus Barcelona.
“People are always cautious with their money, but in the current fiscal environment, they’re even more cautious,” says AMD’s Demski. “We view Shanghai as being a really low-risk upgrade from something people are already familiar with, whether it’s the dual-core or Barcelona. But they have a system they’re using. They can either go the Nehalem path, which is all brand new—new chip, new chipset, new memory, new platform, new software. Everything is brand new, and in some instances it may achieve some impressive performance gains. But you also have to rip up the floor plan of your IT infrastructure and redesign around Nehalem, whereas Shanghai is a very evolutionary approach. The software you use today is going to work just as well and better on Shanghai. The systems you have today, you don’t have to worry about your cooling or your power delivery. It’s all going to just plug in and work. Shanghai will provide very good gains over Barcelona but also provide those gains without causing any rift to the business.” |
|||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Copyright © 2008 RAM Magazine. All rights reserved.
Do not duplicate or redistribute in any form. |
|||||||||||||||||||||||||||||||||||||||||||||