Feature Story    
  COVER STORY
  FEATURE STORY 1
FEATURE STORY 2
EASY UPSELL
VAR SHOWCASE
WHAT MATTERS
MORE STORIES
   

 

 


lARRA-WHAT?
INTEL's Dark horse
goes for graphics GOLD
By  JOHN MARTINEZ

 
 
If you’re a long-time reader of Reseller Advocate Magazine, you know we tend to cover hardware technology once it’s available—once it has been transformed into product that you can turn around and talk about with your customers. But today is special. Intel, the largest graphics vendor in the world, ironically shut out of the add-in card market for years, is making a bold push toward its own high-performance solution that it hopes will give AMD and NVIDIA a run for their money. We’re taking an early look at the architecture in question now that the company is starting to talk details.

 
 

Intel has been in this position before. Ten years ago, the company saw big potential in the AGP interface—enough potential that it was willing to build a graphics architecture specifically to take advantage of texturing over the bus. Unfortunately, the performance penalty associated with moving data over AGP versus onboard memory was high enough to render that card—the Intel 740—uncompetitive next to competing designs. Ever since, Intel’s admittedly overwhelming success in graphics has come from its integrated logic: the low-end stuff that’s not much for gaming, yet manages surprising stability in day-to-day use.

Apparently, Intel is just not satisfied with its spectator’s seat of the performance graphics market, though. More than a year ago, Intel quietly confirmed the existence of a project called Larrabee and hinted at a few of its attributes. Larrabee would be a “many-core” architecture with a number of in-order cores on a single die. It would incorporate a large L2 cache, too. But back then, nobody could really answer the million-dollar question: “What exactly was Larrabee being built to do?” At the time, we could only speculate. Now, Intel is giving us a more concrete idea of what it has in mind.

 



CPUs, Meet Graphics
In a move that really just confirms what we’ve expected all along, Intel is now talking about its Larrabee architecture, making it clearer than ever that the design is going to make a splash in the graphics world, regardless of whether it turns out to be faster than the best from AMD or NVIDIA. Of course, the biggest surprise of all is that Larrabee, the company’s cutting-edge foray into graphics, actually has its roots in ancient Intel history. The design employs an array of cores loosely based on the P54C, pre-MMX Pentium processor.

The ramifications of Intel’s decision will be widely felt if its architecture successfully makes the transition to product and then garners enough support from the software developer community. After all, Larrabee is based on an x86 architecture, unlike AMD’s Radeon or NVIDIA’s GeForce GPUs. But while the technology diverges from familiar graphics products, it’s also dissimilar from today’s most popular CPUs.

     
   
 

Round the Bus
In this early mock-up of a Larrabee configuration, a number of x86 cores with 256KB L2 caches communicate with each other over a 1,024-bit ring bus. Fixed-function logic is also included in the ring.

 
     

To begin, Larrabee’s x86 cores employ in-order execution, which means instructions must be fetched, dispatched, executed, and written to a register in that order. In contrast, Intel’s Core 2 Duo employs out-of-order execution. Whereas in-order designs are prone to stalls when instructions are not yet ready to be dispatched, the out-of-order architecture fills those gaps with instructions that are ready. The trade-off is one of complexity. Because Larrabee is in-order, its cores can be made substantially smaller, allowing more of them on a die and improving performance overall.

Intel thus embarked on a design experiment. How many Larrabee cores could it fit on a die with a size and power budget similar to the 45nm Core 2 Duo? The answer turned out to be 10. The chip, armed with a 4MB L2 cache and a vector processing unit able to handle 16 32-bit operations per clock, could theoretically achieve 160 vectors per clock versus Core 2 Duo’s eight. Why is vector throughput so important? That’s what gives Larrabee so much floating-point muscle versus Intel’s own desktop processors. So, while Larrabee is in many ways CPU-like, it manages to cram 20 times more operations per clock into a comparable die.

 

Larrabee In Numbers
While Intel’s anticipated graphics architecture draws from the Pentium design, it’s no facsimile. Back in 1994 when the P54C emerged, 600 nanometer manufacturing limited the number of transistors that would fit. Today, a 45nm process lets Intel employ 32KB L1 data and instruction caches (versus Pentium’s 8KB repositories) and a 256KB L2 cache (Pentium employed an external L2).

   
 

Larrabee Up Close
An individual core, attached to Larrabee’s ring bus, is in many ways similar to an old Pentium processors with enhancements like multi-threading added to maximize resource utilization.

 
   

Additionally, Intel wraps Simultaneous Multi-Threading into Larrabee (think Hyper-Threading, as included in the Atom and new Nehalem processors). A single core can work on four threads, whereas Core 2 Duo handles a pair. The new architecture is updated to include 64-bit extensions as well—something the Pentium never had at its disposal.

Move out beyond the individual core level and you get a better idea of how Larrabee processors will be arranged, even if Intel isn’t yet talking specifics when it comes to product configurations. Cores and memory will communicate over a 1,024-bit (that’s 512 bits in each direction) ring bus. Now, remember back to AMD’s R520 GPU (it was ATI back in those days), which powered the Radeon X1800-series cards. That graphics processor employed a 512-bit ring bus to deliver lots of memory bandwidth with low latencies across the chip. With the RV670 GPU, AMD shrank the ring bus to 256-bits. And when it launched RV770, the ring bus had been completely replaced by a 256-bit hub approach.

The problem with the ring bus was its complexity—the number of transistors it consumed. But Intel’s decision to adopt a ring bus suggests the need for plenty of fast access to memory. Indeed, maintaining cache coherency and giving the cores access to blocks of fixed-function logic will likely put that bandwidth to good use. To that end, there’s actually very little fixed-function silicon in the architecture. Intel is advocating a highly programmable model that can be handled almost exclusively by the x86 cores.

Texturing is the exception; that’s addressed by fixed-function logic able to perform standard operations like decompression and anisotropic filtering. The texture sampler is attached to the ring bus and communicates with the cores through L2 cache. Why go fixed-function when everything else is programmable? Without the sampler, Intel says filtering operations would take 12 times longer, and decompression would take 40 times longer.

 




The Software Side of Larrabee: A Whole New World

Here’s where things get interesting. As a reseller, you can’t sell your customers hardware unless there is a compelling reason to own it. Remember the hardware PhysX cards AGEIA launched that’d purportedly open new worlds of realism in games? Those $200 add-in boards sounded exciting, but with a short list of apps actually supporting them, they were useless most of the time. And now NVIDIA enables the same functionality for free using the power of its GPUs.

     
   
 

The Beauty of Programmability
In this series of frame captures from the popular horror title F.E.A.R., we see that the graphics workload is constantly changing. Larrabee’s programmability helps overcome the limitations experienced by fixed-function architectures.

 
     

Well, Larrabee faces a similarly uphill battle. Fortunately, the war is being waged by Intel rather than AGEIA. Nevertheless, before hardware based on Larrabee is able to succeed, it needs to work on existing games (which means supporting DirectX and OpenGL) before developers start getting fancy by writing to the hardware directly. We won’t bore you with the specifics of how Intel will achieve compatibility with today’s rasterized 3D apps other than to say DirectX and OpenGL instructions are to be handled by a software renderer—a potential detractor from performance. Of course, the silver lining is that if anyone can develop the software tools needed to make Larrabee perform well with a run-time compiler, it would be Intel.

Certainly more exciting is the potential of Larrabee when ISVs start writing to the hardware itself using C. General-purpose GPU, physics processing, and HPC applications will all be possible as a result of the architecture’s massive floating-point horsepower.

Earlier this year we attended an event at NVIDIA’s headquarters to introduce its latest Tesla computing solutions. One of the company’s big messages at that event was how much better suited many-core processors are to the HPC world than multi-core processors. The example given was a 100 teraflop datacenter. According to NVIDIA, it’d take 1,429 servers armed with quad-core CPUs to achieve such a performance benchmark at a total cost of nearly $6 million. Using 1U Tesla configurations, each equipped with four of its add-in cards, it would only take 25 servers to achieve the same compute power for less than $400,000.

Ironically, now it’s NVIDIA’s competition emerging with a plan to go many-core. Just as the GT200 and its 240 shader processors help power through software compiled with CUDA, so too will Larrabee be able to enable even greater flexibility through what Intel is calling the Larrabee native interface.

 

 

Intel Making It Worth Your While
Before you see software developers go out of their way to program specifically for Intel’s native interface, the company is going to have to prove that Larrabee is not only here, but here to stay—just like NVIDIA is trying to do with CUDA. To that end, Intel says it is working closely with top ISVs to help hash out what they need Larrabee to do. For developers disinterested in optimizing for Larrabee, games handle the hardware like any other DirectX or OpenGL graphics card. Those who do take a step further have full access to the core’s guts and can bend the architecture in any way they want. But it’ll take a concerted effort from Intel’s developer relations team to get the big names in entertainment behind Larrabee.

     
   
 

Larrabee’s Software Stack
Expect the Larrabee micro-architecture to be heavily dependent on software to achieve its best performance. Developers can use the design like any standard DirectX or OpenGL card or write directly to the hardware for even better flexibility.

 
     

Before you see software developers go out of their way to program specifically for Intel’s native interface, the company is going to have to prove that Larrabee is not only here, but here to stay—just like NVIDIA is trying to do with CUDA. To that end, Intel says it is working closely with top ISVs to help hash out what they need Larrabee to do. For developers disinterested in optimizing for Larrabee, games handle the hardware like any other DirectX or OpenGL graphics card. Those who do take a step further have full access to the core’s guts and can bend the architecture in any way they want. But it’ll take a concerted effort from Intel’s developer relations team to get the big names in entertainment behind Larrabee.

Fortunately, the flexibility of Larrabee should help ease its entry into the highly competitive and fast-moving graphics market dominated today by AMD and NVIDIA. What makes the architecture so unique is that, because Larrabee is made of x86 cores and completely programmable, it’s very easily adaptable through software. For instance, adding support for the next version of DirectX is expected to be as straightforward as updating a driver. Of course, that raises another concern: Will Intel’s driver team be able to support the work its architects are doing now? If the company expects to compete against whatever powerful GPUs are around at the beginning of 2009, it’d better make sure Larrabee gets better software support than some of its integrated graphics cores have seen.


The Competition Mobilizes
Perhaps the most fervent critic of Larrabee is NVIDIA, which not at all quietly questions Intel’s hardware design principles, ability to rally software development, and strategy, given the potential for a processor with lots of floating-point horsepower to hurt its own CPU business. The two companies’ paths are coming closer than ever to crossing: NVIDIA leveraging its graphics architecture to challenge the clustered datacenter with Tesla, and Intel stepping away from integrated graphics to tackle the high-performance discrete market. No wonder NVIDIA is spooked.

In a recent email from the GPU vendor, NVIDIA sought to point out how Intel’s work on Larrabee validates its own many-core parallel processors, of which it says there will be 150 million by the time Intel is able to start shipping hardware.

AMD is also interested in having its graphics products handle general-purpose GPU workloads. The company’s FireStream boards are packaged just like NVIDIA’s Tesla—without display outputs. And AMD has its own software development kit, including Brook+ (an open-source variant of C), a Core Math Library, and a Performance Library optimized for video transcoding. AMD is also supporting OpenCL, a computing language developed by Apple.

It might be another year before we see an add-in card centering on Larrabee. By then, AMD and NVIDIA will no doubt have graphics architectures that put today’s GT200 and RV770 cores to shame in games. They may or may not have made headway into the still-young stream computing market. But either way, both graphics giants are feeling the pressure of what Larrabee could mean if it’s successful.

No doubt Intel faces significant challenges as it attempts to create a hardware architecture that can compete for graphics gold and the software infrastructure needed for Larrabee to realize its full potential. Nevertheless, this is a major project for Intel—one that the channel will want to keep an eye on as shipping product edges closer and closer. After all, this is something that’ll interest everyone, from gamers to SMBs to the enterprise folks in datacenters.



 
       
         
    Back to top    
   
Copyright © 2008 RAM Magazine. All rights reserved.
Do not duplicate or redistribute in any form.