A long time ago, in a galaxy far, far away…
While AMD/ATI caught quite an attention, we didn't get much more than periodic "Fermi" news from NVIDIA -at least, on the computer gaming graphics front. NVIDIA promised a lot for Fermi but boards weren't anywhere to be seen. We're entering the Fermi architecture with the GeForce GTX 480 review you're reading today. A few highlights from this review:
- Some details of Fermi architecture
- Details of GF100 chip, based on Fermi
- GeForce GTX 480
- Performance comparisons
- DirectX 11 and CUDA/PhysX
- Meeting "Fehmi, the Mysterious Man"
You all remember the launch of G80 boards of the GeForce family. NVIDIA says that Fermi is a leap forward like G80 in its time. Their new architecture is more than amplifying muscle power with increased processing core count -at least, NVIDIA highlights this in their documents. By the way, we'd like to mention NVIDIA's success in supplying material; I've been digging in numerous documents for days and this article would be much less detailed if they didn't share all this information.
On the documentation from NVIDIA, Fermi is said to be "NVIDIA's Next Generation CUDA Compute Architecture". CUDA term is connected with the structure we frequently mention with its software side: CUDA is not only some software framework but is a complete architecture with software and hardware. So Fermi's basic processing cores are called "CUDA cores".
Fermi's name comes from Italian physicist Enrico Fermi. Enrico Fermi is known for his work on the first nuclear reactor as well as nuclear and particle physics and quantum theory. This isn't the first time NVIDIA uses the name of a great scientist of our history; remember Tesla's name in the previous architecture: Nikola Tesla. We highly suggest you read about both scientists.
NVIDIA's Fermi is about general compute and programmability as well as new generation of gaming graphics. With this respect, both the power of Streaming Multiprocessors which are in turn consist of CUDA cores is increased and they are made more efficient in current graphics problems like tessellation. General programmability is increased to the point of a full C++ support as well.
Let's see a Fermi processor diagram:
This is a full Fermi chip. Green boxes represent CUDA cores which are the basic processing units. A full Fermi chip contains 512 of these -though the first Fermi based GeForce chips do not contain as much yet. CUDA cores are grouped in 32 core banks and there are 16 of these banks: 32 x 16 = 512 cores. These banks of 32 CUDA cores are called Streaming Multiprocessors (SMs). We'll get into some detail about this briefly. Every GPU contains six memory (controller) partitions and memory interface is 384 bits wide. This means that every memory partition has 64 bits wide interfaces. The block labeled with "Host Interface" connects to PCIe interface on your computer. GigaThread block schedules thread blocks to SMs, or with a more accurate term, SM thread schedulers.
There is a common cache in the middle. The blue parts under each SM represents register file and L1 cache. The orange boxes are SM thread schedule and dispatch units.
All of this is implemented with 3.2 billions of transistors, which makes us write the sentence "NVIDIA built the largest gaming graphics core ever" once again.
Let's inspect these SMs.
Boxes labeled "Core" are CUDA cores. These are supported with two Warp scheduling and dispatch units. Warp is groups of threads in CUDA architecture. SMs acquire the parallel threads in groups of 32 and these blocks consisting of 32 threads are called a "warp". There is a two-fold increase in here.
CUDA core count per SM is 32. There were eight CUDA cores per SM in the previous architecture; which means SMs of the new architecture have four times the count in previous arch. At the bottom there is a 64 KB shared memory/L1 cache block. There's a reason for this terminology: This part can be configured in two different styles depending on the requirements of the running code:
- 16 KB of L1 cache and 48 KB of shared memory
- 16 KB of shared memory and 48 KB of L1 cache
This ensures that both the code sensitive to L1 cache and code sensitive to shared memory work with higher performance alike. By the way, every SM gets its own shared memory/L1 cache block. GT200 architecture had 16 KB of shared memory for SMs.
Second Generation Parallel Thread eXecution ISA (PTX 2.0) adds more with respect to programmability to these performance oriented concepts. Previously there were three separate memory address spaces and this brought difficulty in C programming because you couldn't determine where your pointers would refer to. The unified address space brought by Fermi and PTX 2.0 brings full C++ support. Developers just use the unified address space and Fermi's hardware address translation unit maps the pointer references onto the correct address space.
Together with these come others like ECC memory and double precision IEEE 754-2008 floating point calculations and underline Fermi's targeting more than gaming graphics: General compute including scientific applications.