COVER STORY    
  COVER STORY
  FEATURE STORY 1
FEATURE STORY 2
EASY UPSELL
VAR SHOWCASE
WHAT MATTERS
RAMPAGE
   

 

 

Shanghai Noon
By DiaNna Rao

   
 
We all know the great Western showdown clichés. Tumbleweeds. Sweat. People fleeing from the dusty streets as two gun-packing opponents face off under a merciless sun. Odds are good that one of them will be bruised and bleeding, barely able to stand. Somewhere, a soundtrack whistles a forlorn, warbling tune. Fingers twitch. No one breathes. It’s Shanghai time.
   
 
Without question, 2008 was one of the most challenging years in AMD’s four decades. The long-anticipated launch of K10 architecture, better known as Barcelona, struggled from the outset for several reasons, many of which had more to do with manufacturing issues than the product’s fundamental architecture. Barcelona was and remains an excellent desktop and server processor. However, competition from Intel soon exposed the performance holes in AMD’s release. Saddled with a 65nm fab process, Barcelona struggled to scale in frequency and was limited in cache size, even though—like a little Western town waiting for gold to be found in the hills—the underlying design was capable of much more.

At last, the promise of Barcelona is reaching fruition. AMD took its native quad-core flagship, applied a 45nm fab shrink to it, made several important additions, and code-named the results Shanghai. We know that Barcelona left a bad taste in many mouths. From early availability problems to trouble with frequency scaling, Barcelona was often the microarchitecture that could’ve, should’ve, and didn’t. (Of course, Barcelona priced right in the low- and mid-range markets can still be a screaming good value.) Shanghai now fulfills the Barcelona promise. If you and your customers blew off Barcelona, try shifting to Shanghai. You’ll find it’s well worth the time.

Barcelona Benefits
At the end of a punishing year—OK, two years—for AMD, it’s easy to lose sight of all the things that made Barcelona an excellent product. For starters, there’s the unified quad-core design. AMD was never shy about criticizing Intel’s use of two dual-core dies in one CPU package rather than having a single, unified die with four cores on it. AMD long maintained that requiring core 1 in the left die to communicate with core 3 or 4 in the right die via a front-side bus connection that wended all the way to the northbridge and back was inefficient at best . . . which is true. Nobody disputes that, all other things being equal, a unified design is better.

For better or worse, though, all other things weren’t equal. Intel has almost always had the benefits of faster clock speeds and much larger L2 caches to help mitigate its front-side bus latencies. So ultimately, which architecture was better for the customer often boiled down to which resources the application emphasized. If the app relied heavily on frequency speeds and cache, Intel usually won. If the app leaned more on memory, the needle often shifted to Barcelona because the CPU had a direct connection to RAM through a memory controller embedded in the processor. Historically, Intel always required more memory delays because RAM could only be accessed through the shared front-side bus and the northbridge’s memory controller. This is one reason why AMD has had so much success with heavily virtualized servers, which require very fast memory performance.

   
 
 

How AMD-V Stacks Up.
The collection of CPU optimizations known as AMD-V works with the hypervisor to accelerate the switching time between virtual machines.

AMD’s more flexible alternative to the front-side bus is called HyperTransport, and it’s been in use since the Athlon XP. Whereas the front-side bus only connects CPUs to the northbridge, HyperTransport interlinks CPUs, core logic components, and memory. Taken together, AMD calls this web ofHyperTransport links Direct Connect Architecture (DCA), and it’s the foundation of Opteron’s rocket ride to success in the first half of this decade.

The pre-Barcelona, dual-core Opteron Revision F used a 64KB L1 cache in each core backed by a 2x1MB L2, meaning 1MB of L2 cache for each core. You probably know that the purpose of cache is to keep recently or soon-to-be-needed (as guessed by algorithmic prediction) memory data as close to the processor cores as possible so as to avoid lengthy seeks to RAM. The smaller the cache, the faster it can be searched, which is why CPU designers use multiple cache tiers. When the CPU has a memory request, it first searches the L1. If there is no hit for the data in L1 (better known as a cache “miss”), the search proceeds to L2, and so on, all the way out to system memory if necessary.

With quad-core Barcelona, AMD kept 2MB of total L2 on the chip, dividing it into 512KB dedicated to each core. But Barcelona then added a shared 2MB L3. AMD believes that a shared L3 topped by smaller, dedicated L2 blocks makes for a more scalable, ultimately higher-performing design. AMD notes that L3 cache helps applications share multi-tasked data across multiple processor cores rather than circle back to ping each individual core. Apparently, this is another instance of AMD engineers beating Intel to the punch, because Intel’s Nehalem made a move very similar to Barcelona, downsizing the L2 and introducing a larger L3, bringing it to market a year and a half after AMD.

Beyond fundamental architectural changes, Barcelona introduced several new energy-saving features. First among these is split power planes, or Dual Dynamic Power Management, which established separate power feeds to the CPU cores and memory. Segregating these allowed CPU cores to enter lower power states while leaving full voltage and frequency to the memory and vice versa. On top of pre-existing PowerNow! Technology, Independent Dynamic Core Technology automatically adjusts the frequency of each core to lower the power draw on underutilized cores. Getting even more granular, AMD CoolCore technology could dynamically and quickly (within one clock cycle) turn off sections of each CPU die in order not to waste power on logic blocks that didn’t need to be active.

   
 

 

Minds of Their Own.
Each Barcelona (and Shanghai) core has the ability to modify its own frequency according to system demand. This results in far less power consumption under normal conditions.

We want to review these points for two reasons. First, to lay some of the groundwork for the features that all carry forward into Shanghai. Second, to remind you that Barcelona remains a very strong offering. The chip may not have lived up to everyone’s wildest expectations, but that doesn’t mean that Barcelona still doesn’t have a lot of value to offer.

A Shanghai Surprise?
It’s probably fair to call the first “Multi-Core” Opterons (meaning the original dual-core designs introduced in 2005) a revolution. Going from decades of single-core designs to an integrated dual-core is a pretty massive step. From there, going to a unified quad-core chip, Barcelona, is impressive if not quite as groundbreaking. So to say that Shanghai is anything more than an impressive evolution would be misleading. By and large, Shanghai is the same as Barcelona, only better. There are no major surprises, but the iterative improvements made across the board add up to a product with significantly more value. Let’s explore how and why Shanghai is better in order for you to illustrate the new processor’s merits to customers.

The 45nm Move
A fab shrink is a costly, critical affair. Get it wrong, and the resulting yield rates will plummet. In general, it becomes more difficult to achieve each successive node shrink because coming up with the optics required to perform optical lithography on masks with feature sizes near to or less than the size of the light wavelength can be extremely challenging. It’s like trying to use your fingertip to draw a line smaller than the width of your finger, with each line successively thinner than the last.

As AMD was planning its move from 65nm to 45nm, it had to weigh several considerations. The change was not only about getting to 45nm, but also selecting an approach that could apply well beyond 45nm and recycle manufacturing assets. When it costs anywhere from $1 billion to $4 billion to build a fab plant for a new process technology, you want to get the most bang for your buck. For example, one Intel presentation from September 2007 (“Intel Silicon & Manufacturing Update”) notes than the theoretical percentage of fab equipment that could carry forward from 90nm to 65nm was about 90 percent; the move from 65nm to 45nm stepped up to roughly 95 percent. Contributing to the economy of Intel’s 45nm shift was the fact that the company persisted in using 193nm dry lithography for its critical layers. Migrating to immersion lithography, a technique Intel is already using on its early 32nm SRAM chips, would have mushroomed critical layer lithography costs by over 25 percent.

   
 

Who Says Water and Electronics Don’t Mix?
By using a water layer to further focus a light beam, immersion lithography helped AMD migrate from 65nm down to 45nm fabrication.

 
   

However, AMD has a different set of goals and needs at this point, and the decision was made to bite the bullet and adopt immersion lithography now. This entails injecting water between the projection lens and the wafer’s dye layer. This focuses the projection by about 40%, effectively narrowing an already ultra-narrow light beam and enabling a smaller fabrication node. Despite Intel’s presentation, AMD insists that immersion lithography allows it to accomplish in a single pass what normally takes Intel two passes with 45nm fabrication, so it’s ultimately a more cost-effective approach.

“This is something we’ve been working on in conjunction with IBM for a number of years now,” says AMD’s Steve Demski, product manager, Server and Workstation Division. “But AMD will be the first company to use immersion lithography in mass production. You don’t need immersion lithography to do 45nm, but it is absolutely a physical requirement to use it to get to 32nm. So in one sense, we have a lead over Intel on that next-generation development.”

The shrink to a smaller fab node usually gives manufacturers two possible benefits. They can either scale to faster frequencies or improve power efficiency. With Shanghai, AMD opted for a middle path. At system idle, Shanghai draws about 35% less power than Barcelona. At a system level, that means about 8% less power. In large deployment applications, such as storage farms or cloud computing, where users typically have massive clusters of systems that may have high activity during the day but have limited activity overnight, this kind of power conservation can be very compelling. Meanwhile, at a time when Shanghai had been widely expected to debut at 2.4 GHz, AMD is announcing launch SKUs up to 2.7 GHz and expects to scale higher quickly.


“The technology and manufacturing team have done incredible work in bringing up 45nm silicon,” writes AMD’s Randy Allen, senior vice president, Computing Solutions, in a memo allegedly leaked through Polish site PCLab.pl. “In fact, the silicon was so healthy and the process so mature that this is the fastest AMD Opteron processor that has gone from first wafer to production parts. Our leading-edge immersion lithography technology helps enable us to deliver dramatic performance and performance-per-watt gains, second only for an AMD processor to the initial transition to AMD dual-core. Our original plan of record for ‘Shanghai’ was to launch at 2.4 GHz in the 75-watt ACP thermal band. We have been able to significantly exceed that frequency target, and the parts are drawing much less power at both full load and idle than we originally expected. We believe many industry watchers will be pleasantly surprised with what they see from AMD at launch.”

   
 

 

Who Says Water and Electronics Don’t Mix?
By using a water layer to further focus a light beam, immersion lithography helped AMD migrate from 65nm down to 45nm fabrication.
   

About Those Thermals
You may have noticed AMD starting to use a new power metric over the past year called Average CPU Power, or ACP. The conventional Thermal Design Power (TDP) metric measures the maximum amount of power a computer’s cooling system can be required to dissipate when running real-world applications. According to AMD, ACP measures “processor power draw on all CPU power rails while running accurate and relevant commercially useful high utilization workloads.” The ACP spans power draw from the cores, memory controller, and HyperTransport links. To illustrate power draw situations, AMD states that ACP includes workloads such as TPC-C, SPECcpu2006, SPECjbb2005, and STREAM.

“The results across the suite of workloads are used to derive the ACP number,” noted AMD’s Brent Kirby, author of the company’s ACP white papers, to DailyTech last December. “The ACP value for each processor power band is representative of the geometric mean for the entire suite of benchmark applications plus a margin based on AMD historical manufacturing experience.”

Citing work loads from synthetic benchmarks seems a bit odd for a real-world metric, but the need for a consistent load in order to obtain replicable results makes sense. AMD says that ACP is a superior metric for gauging high-load work environments like datacenters.

On the other hand, we expect to see more apps like GridIron’s Nucleo Pro 2 reach the market. This background video rendering utility gains its fame from being able to approach 100% utilization across all cores without impairing foreground application performance. Given that, we would caution resellers not to ignore TDP specs, which will be more applicable in top-utilization scenarios. AMD has indicated that it will provide both numbers on its CPU products.

The TDP profiles for Shanghai remain unchanged from Barcelona, with the increase in cache circuitry and frequency more or less counterbalancing the energy savings from the fab shrink. This similarity lets us come up with a quick cheat sheet for translating ACP and TDP: 68W TDP equals 55W ACP, 95W TDP equals 75W ACP, and 120W TDP equals 105W ACP. For the sake of honesty and accuracy, be sure to keep these two metrics straight when comparing Shanghai against alternatives from both AMD and Intel.

That all said, Shanghai improves on Barcelona’s power savings story in two key ways. First, there’s the benefit gained from the move to 45nm—a 35% savings at idle, as stated above. But AMD also saves up to 21% in comparison to Barcelona through another new feature called AMD Smart Fetch.

“Even as a CPU is doing work, there will be periods where it’s waiting for data or instructions to arrive from the system,” explains AMD’s Demski. “Basically, you can set this up in the BIOS. If the CPU is waiting for data—I think the default is 16 clock cycles or 16ns—it’ll flush the contents of the L1 and L2 cache out to the L3, then it’ll shut down the core along with its L1 and L2.”

   
 

Finding Balance.
AMD’s Smart Fetch blends performance with power savings by flushing the L1 and L2 caches into L3 so that their contents stay available but the CPU core tied to them can power down when not needed.

 

Now, recall that Barcelona had a 2MB shared L3 and Nehalem (in the Core i7 version) has an 8MB L3. Intel opted to make its L3 inclusive, meaning that the contents of every L1 (32KB) and L2 (256KB) gets replicated in the L3. The purpose of this is to save on snoop traffic. When core #1 wants a piece of memory data, it first polls its own L1, then L2. If a miss occurs at L2, a search of an inclusive L3 will reveal if any core’s cache holds the data; if not, the request goes out to system memory. With an exclusive L3, as Barcelona uses, core #1 must “snoop” the caches of all other cores after an L3 miss before proceeding out to system memory.

This is one of those small but important differences that define which platform is better suited to a given application, depending on how it utilizes cache resources. The advantage of AMD’s exclusive L3 is that you get more cache to work with, and keep in mind that Barcelona and Shanghai both feature L2 caches twice the size of Nehalem’s. Shanghai delivers 8MB of total cache, and the system gets to use all of it, whereas Nehalem shaves off over 1MB of L3 (double this for eight-core designs) simply for data duplication. Moreover, Demski maintains that “snoop traffic is almost on the noise level in a two-socket system. It does get more appreciable as you go to four- and eight-socket systems.”

In any case, Smart Fetch represents a sort of hybrid between inclusive and exclusive L3. L1 and L2 data from any given core doesn’t get replicated into L3, but it can be migrated to L3 in order to shut down unneeded cores and not incur the power penalty of waking them up when their caches need to be snooped.


Other Innovations
Some aspects of Shanghai are predictable steps up the standards ladder. For example, the integrated DDR2 controller, backed by Barcelona’s Memory Optimizer Technology (sub-division of memory channels, larger memory buffers, optimized paging algorithms, etc.), now hops from DDR2-667 to DDR2-800 support. The core prefetchers that shuffle data directly into L1 in order to decrease latency have been improved again in Shanghai, and the core probe bandwidth has doubled.

   
 

 

Four Cores Up Close.
This is Shanghai’s die shot. Note the increase of real estate devoted to cache memory.

The HyperTransport link gets a big boost under Shanghai. Today’s Barcelona-based Opterons use 8.0 GB/sec HyperTransport. The debut Shanghai models will carry this forward, using multiple links to get the necessary bandwidth. But in the second quarter of 2009, look for Shanghai to slip into HyperTransport 3.0, doubling the link bandwidth to a maximum of 17.6 GB/sec.

To return for a moment to Shanghai’s cache, AMD also introduces a new data integrity feature fetchingly called L3 Cache Index Disable, also due to come online in 2009 once select operating systems support the feature. The idea is that with more physical cache comes the potential for more physical cache errors. If the OS is continually doing ECC corrections on a certain section of L3 cache, Shanghai can automatically shut that area down. The L3 is divided into 16 sections, and AMD will allow up to two of those to be turned off. There will be a slight performance hit because the chip is losing part of its cache, but this will be counterbalanced by a reduction in error corrections. If the feature is implemented properly, the user should be able to detect the fault and have the CPU swapped when convenient. Meanwhile, L3 Cache Index Disable provides a slightly higher level of reliability over Barcelona.

“When you combine all of this with the aggressive pricing Shanghai will offer,” notes AMD’s Allen in his leaked memo, “we like how we are positioned to go after the high-volume 2P server market in addition to further reinforcing our leadership position in the 4P and 8P market. Shanghai is planned to deliver enterprises unparalleled price/performance and set new performance records for the most critical and demanding server workloads. And just like AMD set the standard for power-efficiency in the datacenter starting in 2003 with the launch of the original AMD Opteron processor, we should do the same with what is driving much of the server growth today—virtualization.”

In virtualization, a large part of the performance picture centers on how quickly a system can switch between virtual machines, or “worlds,” and how memory gets accessed is critical in this process. When an application tries to access a memory address, a memory management unit in the CPU monitors the process. If the page isn’t where it’s supposed to be, the memory management unit generates a page fault interrupt. At best, the operating system then tries to resolve the problem; at worst, the program crashes.

To deal with this issue, virtualization typically uses shadow page tables, meaning one page table visible to hardware and maintained by the hypervisor and one invisible to hardware used by the guest operating system. However, because of the extra processing involved in running the hypervisor, shadow page table faults can consume up to 75% of the hypervisor’s time. The optimizations AMD bakes into its modern processors (AMD-V), including Shanghai, virtualize the memory management unit and cache the mappings between the guest OS and physical hardware in order to drop virtualization overhead.

Another performance killer in virtualization is swapping. Swapping, or “world switching” is the process of each virtual machine taking focus of the hardware for the split-second that it runs its instructions. It has to take control of the memory, flush out the data, load its memory, process commands, and then make way for the next virtual machine. Barcelona’s Rapid Virtualization Indexing (RVI), which carries forward into Shanghai, dedicates a given area in RAM solely to one virtual machine. This way a sort of tunnel can be punched straight through the hypervisor, allowing the virtual machine to directly address the physical resources and avoiding the latencies of caching and swapping.

   
 

Weighing Performance and Price.
AMD’s internal benchmarks on integer performance—traditionally an Intel strength—show Shanghai now at parity with its Hapertown competitor, only AMD has a notable price advantage.

 
   

On top of all this, Shanghai breaks new ground with something called tagged, or guest, TLB. The translation lookaside buffer is a component of the CPU’s memory manager designed to assist and accelerate virtual address translation. Tagged TLB functions much like an L2 cache for the primary TLB. Both RVI and the tagged TLB help to accelerate world switching. With Barcelona, though, AMD didn’t have enough space to cache all of the virtual-to-physical memory translations generated by RVI. Tagged TLB under Shanghai provides extra room to cache more of that translation string, so there are fewer memory lookups needed for virtual machine-based applications. All in all, AMD states that world switching times improve under Shanghai by up to 25% versus Barcelona.


Also keep an eye on Shanghai’s virtualization performance from a power standpoint. AMD’s internally run tests on the Barcelona 8360 SE versus the Intel X7350 found that, given a constant workload across the same number of virtual machines, Barcelona used roughly 20% less wattage. Shanghai will only improve on Barcelona’s performance-per-watt equation.

Shanghai...Draw!
In a gun fight, you can’t afford to miss. The Barcelona launch hurt AMD; there’s no getting around that. But the company has struggled and planned, taken a deep breath, drawn itself up to face its adversary, and is now ready to pull the trigger. AMD knows it can’t miss this time, and it shows. Shanghai was supposed to launch in the first quarter of next year; instead, we’re getting product shipping in November.


AMD has released a handful of benchmark results already, but this isn’t the place to go over them. Of course, they look good. So do Intel’s. They always do. We’re firm believers in testing both platforms and seeing which works best for all of your customer’s needs. A simple benchmark graph is not going to tell you that. There are always other pressing concerns to round out a value equation, not least of which is the client’s existing and future infrastructure.

   
 

 

Stay Tuned.
You can look forward to Shanghai’s successor, Istanbul, in 2009. The processor will feature six cores and carry forward all of Shanghai’s improvements.
   

“People are always cautious with their money, but in the current fiscal environment, they’re even more cautious,” says AMD’s Demski. “We view Shanghai as being a really low-risk upgrade from something people are already familiar with, whether it’s the dual-core or Barcelona. But they have a system they’re using. They can either go the Nehalem path, which is all brand new—new chip, new chipset, new memory, new platform, new software. Everything is brand new, and in some instances it may achieve some impressive performance gains. But you also have to rip up the floor plan of your IT infrastructure and redesign around Nehalem, whereas Shanghai is a very evolutionary approach. The software you use today is going to work just as well and better on Shanghai. The systems you have today, you don’t have to worry about your cooling or your power delivery. It’s all going to just plug in and work. Shanghai will provide very good gains over Barcelona but also provide those gains without causing any rift to the business.”

With Shanghai, AMD is back in the game, and it’s time to take notice. With Opteron SKUs available today and Phenom parts sure to follow shortly, there are many applications and customer segments waiting to take advantage of Shanghai’s new benefits. Try the new processor out in your back rooms, learn its many strengths, and impress your customers.


   
     
 
    Back to top    
   
Copyright © 2008 RAM Magazine. All rights reserved.
Do not duplicate or redistribute in any form.