With AMD’s first-wave of Zen 4 CPUs now within the books with the Ryzen 7000 collection, the buyer arm of AMD is now shifting its consideration over to its graphics enterprise. In a presentation that ended moments in the past dubbed “collectively we advance_gaming”, Dr. Lisa Su and different AMD leaders laid out the way forward for AMD’s graphics merchandise. And that future is the RDNA 3 structure, which would be the foundation of the brand new Radeon RX 7900 XTX and Radeon RX 7900 XT video playing cards.
The 2 playing cards, set to be launched on December 13th, would be the first merchandise launched utilizing the RDNA 3 structure. In response to AMD, the brand new flagship 7900 XTX will ship as much as 70% extra efficiency at 4K than their earlier flagship, the 6950 XT. This efficiency enhance comes curtesy of a number of architectural enhancements in RDNA that cumulatively supply 54% larger efficiency per watt than RDNA 2, in addition to larger clockspeeds courtesy of TSMC’s 5nm (and 6nm) processes, and better general energy consumption.
The total-fledged RX 7900 XTX can be hitting the streets at $999. In the meantime the second-tier RX 7900 XT will run for $899.
AMD Radeon RX 7000 Collection Specification Comparability | ||||||
AMD Radeon RX 7900 XTX | AMD Radeon RX 7900 XT | AMD Radeon RX 6950 XT | AMD Radeon RX 6900 XT | |||
Stream Processors | 12288 (96 CUs) |
10752 (84 CUs) |
5120 (80 CUs) |
5120 (80 CUs) |
||
ROPs | ? | ? | 128 | 128 | ||
Recreation Clock | 2.3GHz | 2.0GHz | 2100MHz | 2015MHz | ||
Enhance Clock | ~2.5GHz | ? | 2310MHz | 2250MHz | ||
Throughput (FP32) | 56.5 TFLOPS | 43 TFLOPS | 21.5 TFLOPS | 20.6 TFLOPS | ||
Reminiscence Clock | 20 Gbps GDDR6 | 20 Gbps GDDR6 | 18 Gbps GDDR6 | 16 Gbps GDDR6 | ||
Reminiscence Bus Width | 384-bit | 320-bit | 256-bit | 256-bit | ||
VRAM | 24GB | 20GB | 16GB | 16GB | ||
Infinity Cache | 96MB | 80MB | 128MB | 128MB | ||
Complete Board Energy | 355W | 300W | 335W | 300W | ||
Manufacturing Course of | GCD: TSMC 5nm MCD: TSMC 6nm |
GCD: TSMC 5nm MCD: TSMC 6nm |
TSMC 7nm | TSMC 7nm | ||
Transistor Depend | 58B | 58B – (1 MCD) | 26.8B | 26.8B | ||
Structure | RDNA3 | RDNA3 | RDNA2 | RDNA2 | ||
GPU | Large Navi 3x | Large Navi 3x | Navi 21 | Navi 21 | ||
Launch Date | 12/13/2022 | 12/13/2022 | 05/10/2022 | 12/08/2020 | ||
Launch Worth | $999 | $899 | $1099 | $999 |
AMD’s eagerly anticipated replace to their GPU structure comes as the corporate has been firing on all cylinders for the previous few years. On the CPU aspect of issues the Zen 3 and Zen 4 architectures specifically have confirmed very performant, and in the meantime AMD has been in a position to claw its method again from its graphics stoop with the RDNA household of GPU architectures. RDNA 2, the idea of the Radeon RX 6000 collection, exceeded expectations and proved to be a really sturdy competitor, and now AMD is seemingly setup to exceed expectations as soon as once more, with RDNA 3’s 54% performance-per-watt coming in forward of AMD’s earliest guarantees of a 50% acquire.
AMD Goes Chiplets For GPUs
Whereas at this time’s reveal from AMD was a extra intently guarded occasion than the Ryzen 7000 unveil a few months in the past, AMD has nonetheless given us a fairly a little bit of element on the RDNA 3 structure and the playing cards – greater than we’ve got time to cowl right here – so let’s get began from the highest, with the development of the primary RDNA 3 GPU.
The Navi 3x GPU (AMD just isn’t confirming the particular GPU identify presently) breaks new floor for AMD not solely on the efficiency entrance, however by way of its development. For the primary time from any of the massive 3 GPU producers, AMD is using chiplets within the development of the GPU.
Chiplets are in some respects the holy grail of GPU development, as a result of they offer GPU designers choices for breaking up complicated monolithic GPU designs into a number of smaller elements – permitting for brand spanking new choices for scaling, in addition to mixing and matching the method node utilized in manufacturing. That mentioned, it’s additionally a holy grail as a result of the immense quantity of information that should be handed between totally different elements of a GPU (on the order of terabytes per second) may be very exhausting to do – and really essential to do if you need a multi-chip GPU to have the ability to current itself as a single gadget.
For his or her massive Navi 3x chip, AMD has assembled two sorts of chiplets, basically breaking off the reminiscence features from a basic GPU into their very own chiplets. Which means the core features of the GPU are housed in what AMD is looking the Graphics Compute Die (GCD), which homes all the ALU/compute {hardware}, the graphics {hardware}, in addition to ancillary blocks just like the show and media engines.
As a result of the GCD homes the performance-critical facets of the general GPU, it’s being constructed on TSMC’s 5nm course of. This offers AMD the very best density, energy consumption, and clockspeeds for these elements, although clearly at a better manufacturing value. The GCD die measurement measures 300mm2.
In the meantime the brand new Reminiscence Cache Die (MCD) homes AMD’s infinity cache (L3 cache), in addition to a 64-bit (technically 2×32-bit) GDDR6 reminiscence controllers. The MCD is likely one of the scalable facets of the chiplet design, as Large Navi 3x GPU SKUs could be configured by paring them with extra or fewer MCDs. A full configuration on this case is 6 energetic MCDs, which is what we see within the 7900 XTX. In the meantime the 7900 XT can have 5 energetic MCDs, with a 6th faulty/spacer MCD current for salvaging functions and bodily bundle stability.
A person MCD is 37mm2 in die measurement, and is constructed on TSMC’s 6nm course of. That is an instance of AMD’s course of node flexibility, placing the much less vital GDDR6 reminiscence controllers and Infinity Cache on a less expensive course of node. GDDR6 controllers are a kind of basic examples of a expertise that doesn’t scale very effectively with smaller course of geometries (like most types of I/O), so it’s simple to see why AMD would wish to keep away from constructing it on 5nm for minimal advantages.
Within the full 6 MCD configuration (7900 XTX), Large Navi 3x gives a 384-bit GDDR6 reminiscence bus, together with 96MB of L3 cache. In the meantime a 5 MCD (7900 XT) gives a 320-bit GDDR6 reminiscence bus and 80MB of L3 cache.
For the needs of at this time’s announcement, AMD has not gone into nice depth on how they managed to make a chiplet-based GPU work, however they’ve confirmed just a few essential particulars. At the beginning, so as to supply the die-to-die bandwidth wanted have the reminiscence subsystem positioned off-chip, AMD is utilizing their Elevated Fanout Bridge (EFB) packaging expertise, which AMD first used for his or her Intuition MI200 collection accelerators (CDNA2). On these accelerator elements it was used to hook up the monolithic GPUs to one another, in addition to HBM2e reminiscence. On RDNA 3, it’s getting used to hook up the MCDs to the GCD.
Notably, Elevated Fanout Bridge is a non-organic packaging expertise, which is to say it’s complicated. That AMD is ready to get 5.3TB/second of die-to-die bandwidth through it underscores its utility, however it additionally implies that AMD is undoubtedly paying a superb deal extra for packaging on Large Navi 3x than they had been on Navi 21 (or Ryzen 7000).
Internally, AMD is looking this memory-to-graphics hyperlink Infinity Hyperlink. Which, because the identify implies, is liable for (transparently) routing AMD’s Infinity Cloth between dies.
As talked about earlier than, the cumulative bandwidth right here between the MCDs and GCD is 5.3TB/second. It’s unclear if the constraining issue is the bandwidth of the Infinity Hyperlink, or that the mixed Infinity Cache + GDDR6 reminiscence controllers can’t transfer sufficient knowledge to completely saturate the hyperlink. However regardless, it means there’s basically simply shy of 900GB/second of bandwidth between a person MCD and GCD – greater than all the mixed off-die reminiscence bandwidth of the last-generation Radeon RX 6950 XT (and a couple of.7x greater than Navi 21’s on-die bandwidth).
Whereas we’re with reference to AMD’s L3 Infinity Cache, it’s notable right here that it’s truly a bit smaller on Large Navi 3x than it was on Navi 21, with a most capability of 96MB versus 128MB on the previous. In response to AMD they’ve made additional enhancements to enhance knowledge reuse on the Infinity Cache to offset this drop in capability. At this level it’s not clear if the change is a perform of software program algorithms, or in the event that they’ve made extra basic {hardware} modifications.
Lastly, whereas AMD is quoting die sizes for the GCD and MCD, they aren’t quoting particular person transistor counts. So whereas we all know {that a} full 6 MCD Large Navi 3x configuration is comprised of 58 billion transistors (2.16x greater than Navi 21), we don’t understand how a lot of that’s the GCD versus the MCDs.
AMD RDNA 3 Compute & Graphics Structure: Bringing Again ILP & Bettering RT
Diving down a stage, let’s check out the precise graphics and compute structure backing RDNA 3 and Large Navi 3x.
Whereas nonetheless clearly sharing most of the core design parts of AMD’s overarching RDNA structure, RDNA 3 is in some respects a a lot greater shift in architectural design than RDNA 2 was. Whereas RDNA 2’s compute core was basically unchanged from RDNA (1)’s, RDNA 3 makes just a few massive modifications.
The most important impression is how AMD is organizing their ALUs. In brief, AMD has doubled the variety of ALUs (Stream Processors) inside a CU, going from 64 ALUs in a single Twin Compute Unit to 128 inside the identical unit. AMD is conducting this not by doubling up on the Twin Compute Models, however as an alternative by giving the Twin Compute Models the flexibility to dual-issue directions. In brief, every SIMD lane can now execute as much as two directions per cycle.
However, as with all dual-issue configurations, there’s a trade-off concerned. The SIMDs can solely problem a second instruction when AMD’s {hardware} and software program can extract a second instruction from the present wavefront. Which means RDNA 3 is now explicitly reliant on extracting Instruction Stage Parallelism (ILP) from wavefronts so as to hit most utilization. If the subsequent instruction in a wavefront can’t be executed in parallel with the present instruction, then these further ALUs will go unfilled.
This can be a notable change as a result of AMD developed RDNA (1) partially to get away from a reliance on ILP, which was recognized as a weak spot of GCN – which was why AMD’s real-world throughput was not as quick as their on-paper FLOPS numbers would indicated. So AMD has, in some respects, walked backwards on that change by re-introducing an ILP dependence.
We’re nonetheless ready on extra data from AMD outlining why they made this alteration. However dual-issue is often an inexpensive method so as to add extra throughput to a processor design (you don’t need to do all of the instruction monitoring required for a completely separate Twin Compute Unit), and it may be worthwhile tradeoff if you happen to can make sure you’ll have the ability to dual-issue more often than not. However it implies that AMD’s real-world ALU utilization charge is probably going decrease on RDNA 3 than RDNA 2, as a result of bubbles from not having the ability to dual-issue.
Which to deliver issues again to gaming and the merchandise at hand, it implies that the FLOPS numbers between RDNA 3 and RDNA 2 elements should not going to be completely comparable. 7900 XTX could push 2.6x as many FP32 FLOPs as 6950 XTX on paper, however the actual world benefit on something lower than ideally suited code goes to be much less. Which is likely one of the explanation why AMD is just selling a real-world efficiency uplift of 1.7x for the 7900 XTX.
In any case, SIMDs aren’t the one modifications to the core compute structure of RDNA 3. Feeding the beast, AMD has made the Vector Basic Objective Register (VGPR) financial institution 50% bigger than on RDNA 2.
Extra vital than that’s that AMD is lastly integrating devoted silicon for AI processing on their shopper GPUs. That is an space the place each of AMD’s rivals (NVIDIA and Intel) have already made the funding on their shopper elements, and as using GPU inference in workloads continues to develop, it’s not one thing AMD can ignore any longer.
Given the gaming-centric focus of at this time’s presentation, AMD didn’t spend a lot time speaking in regards to the new AI items. Every RDNA 3 CU can have 2 of those items, and they’ll help new AI directions (some sort of INT8 tensor operation looks like a given). All instructed, AMD is saying that the brand new AI items give the Radeon RX 7900 XTX 2.7x the AI efficiency, which AMD is measuring as bfloat16 efficiency versus the RX 6950 XT.
General, the significance of this to players is one thing that is still to be seen. AMD isn’t at present utilizing AI items for FSR 2 (not like NVIDIA’s DLSS 2). However that might change for future tasks. In any other case, for extra skilled customers (or anybody who likes to mess with Secure Diffusion), that is an addition that’s excellent news.
Transferring on, AMD has additionally up to date their raytracing {hardware} for RDNA 3. The second-generation RT accelerator, as AMD calls it, can deal with 1.5x extra rays in flight. There are additionally new {hardware} field sorting and traversal options that weren’t current in RDNA 2’s preliminary RT performance. AMD’s presentation gave the technical particulars a light-weight remedy, however it definitely appears to be like like AMD is transferring to doing a much bigger a part of the ray tracing course of in devoted {hardware}. Which in flip would assist enhance their efficiency, and hold efficiency steadier by not stealing fairly so many assets from the remainder of the CU.
AMD’s personal efficiency slides tout anyplace between a 47% and 84% improve in RT efficiency. Although it must be famous that AMD’s numbers are with FSR enabled; so we can’t divorce these features from any modifications that enhance FSR efficiency on the 7900 XTX.
Final, however not least, AMD has made an attention-grabbing resolution with clockspeeds on the RDNA 3. In brief, AMD has decoupled their clocks; somewhat than working the complete GCD on the identical clockspeed, AMD can be working the shaders and front-end at totally different clockspeeds. Within the case of the 7900 XTX, it will see the shaders working at 2.3GHz (the marketed sport clock pace), whereas the front-end will run at a barely speedier 2.5GHz (about 9% quicker).
AMD didn’t go into nice element on why they’ve made this alteration, however at a excessive stage it’s all about balancing efficiency versus energy consumption. The shaders might run at 2.5GHz as effectively (certainly, the 7900 XTX’s rated enhance clock is 2.5GHz), however as we’ve seen time and time once more, these last clocks are the most costly by way of energy as you go up the v/f curve. So AMD has made the selection to surrender a little bit of potential efficiency to save lots of quite a bit on energy, as 96 CUs/12288 ALUs is lots of silicon to gentle up. Conversely, the front-end is comparatively small, and with AMD having beefed up their CUs by a lot, spending a bit extra energy on the front-end is presumably value it to maintain them from bottlenecking the remainder of the GPU.
RDNA 3 Show & Media Engines: The Newest and the Biggest
AMD’s core compute/graphics structure was not the one a part of the RDNA 3 structure to get an replace on this era. AMD has additionally used the chance to improve their show and media engines to help new options and new codecs.
On the show engine entrance, AMD’s show engine, which they’re now calling the “AMD Radiance Show Engine” has been upgraded to help DisplayPort 2.1. Particularly, AMD has added help for the DisplayPort 2.x characteristic suite in addition to the UHBR 10 and UHBR 13.5 knowledge charges. Which means RDNA 3 playing cards can supply 2x the DisplayPort bandwidth of their DisplayPort 1.4-enabled predecessors, which in flip permits for larger resolutions and better refresh charges. Notably, this ever so barely exceeds HDMI 2.1’s bandwidth as effectively, placing DisplayPort again into the lead, no less than on AMD playing cards.
Unsurprisingly, AMD is utilizing this performance to push forthcoming larger decision and better refresh charge gaming displays, together with a Samsung ultrawide show set to launch in 2023 with a horizonal 8K decision. So it’s not only for displaying off specs, and AMD and its companions are intending to place it to good use.
AMD has not mentioned something in regards to the complete variety of supported shows. So at this level I anticipate it’s nonetheless a most of 4 shows.
In the meantime on the media engine entrance, AMD has given RDNA 3 help for the most recent and biggest video codecs. Together with the same old H.264 and H.265 help, RDNA 3’s media engines additionally add full AV1 encode and decode help, making this the most recent GPU household to roll out help for the next-generation open format codec. RDNA 3 will have the ability to encode and decode AV1 at as much as 8Kp60.
The general efficiency of the media engine has been elevated considerably. In response to AMD the media engine runs 80% quicker than it did on RDNA 2 elements, permitting for simultaneous encoding (or decoding) of as much as two H.264/H.265 streams. Although it’s unclear if that additionally applies to AV1.
Lastly with reference to AMD’s GPU uncore, whereas not explicitly known as out in AMD’s presentation, it’s value noting that AMD has not up to date their PCIe controller. So RDNA 3 nonetheless maxes out at PCIe 4.0 speeds, with Large Navi 3x providing the same old 16 lanes. Which means regardless that AMD’s newest Ryzen platform helps PCIe 5.0 for graphics (and different PCIe playing cards), their video playing cards gained’t be reciprocating on this era. In reality, which means nobody can have a PCIe 5.0 shopper video card.
Radeon RX 7900 XTX & Radeon RX 7900 XT: Launching December 13th
Bringing at this time’s reveal full circle, let’s flip again to the playing cards themselves, the Radeon RX 7900 XTX and RX 7900 XT.
AMD’s flagship card would be the Radeon RX 7900 XTX. Whereas we’re nonetheless ready on affirmation of this, this might appear to be a fully-enabled Large Navi 3x half, with all the blocks in each the GCDs and the person MCDs themselves enabled. As talked about beforehand, AMD is touting a broad efficiency uplift of as much as 70% versus the previous-generation flagship, the RX 6950 XT.
Internally, this implies 96CUs and 96MB of L3 Infinity Cache can be out there on the cardboard. The sport clock (common clockspeed) can be 2.3GHz, whereas primarily based on different AMD figures, we will infer that the enhance (most) clockspeed can be 2.5GHz. The sport clock specifically is a ~10% enchancment over the 6950 XT, so AMD is having fun with a modest frequency uplift generation-over-generation, however nothing too huge. A lot of the heavy lifting will come courtesy of the structure and reminiscence modifications.
Talking of reminiscence, the RX 7900 XTX can be paired with 24GB of GDDR6 reminiscence working at (no less than) 20Gbps. Apparently, AMD’s companions have the headroom to go even larger than this with manufacturing unit overclocking, however the flooring worth for the half can be 20Gbps general. This can be a modest improve in reminiscence clockspeeds versus the 6950 XT (11%). As an alternative, the majority of the VRAM bandwidth features will come from the 50% bigger reminiscence bus, with the 7900 XTX transferring to a large 384-bit bus. In complete, this implies the 7900 XTX can have 960GB/sec of reminiscence bandwidth, 66% greater than its predecessor. AMD obtained their “free” reminiscence subsystem efficiency enhance within the final era with Infinity Cache, so for this time round, they’re again to needing so as to add extra bodily reminiscence bandwidth to maintain the ever-growing beast correctly fed.
In the meantime, the 7900 XT can be a chip off the block, with fewer CUs, much less VRAM, and decrease clockspeeds. All instructed we’re 84CUs paired with 20GB of 20Gbps GDDR6, and backed by an 80MB infinity cache. The cardboard’s sport clockspeed ranking is 2.0GHz, and we would not have any data on the enhance clockspeed. The mix of a 13% drop in clockspeeds and 13% drop in CUs provides up to what’s, on paper, a 24% deficit in compute/shading efficiency. That mentioned, AMD’s pricing signifies that the real-world efficiency hole shouldn’t be this excessive, and we’re nonetheless lacking some essential particulars equivalent to ROP counts. So for higher or worse, we don’t have a superb body of reference fright now for the way the 7900 XT will carry out relative to anything, current-generation or final.
Unsurprisingly, energy consumption on the excessive finish can be going up. The 7900 XTX can be a 355W card, up 20W from the 335W 6950 XT (and 55W from the 300W 6900 XT). This can be a extra modest energy requirement than on NVIDIA’s high-end RTX 4090 Ti (450W), however we’re nonetheless speaking a few card effectively north of 300W. For players with a barely smaller urge for food for giant energy payments, the 7900 XT can be holding the road at 300W. Each playing cards would require 2 8-pin PCIe energy connectors.
AMD has additionally despatched over photos of each the reference 7900 XTX and 7900 XT. Of explicit be aware, each playing cards will characteristic, for the primary time for an AMD shopper card, a USB-C port for show outputs. Nonetheless not like NVIDIA’s earlier enterprise on this area, this isn’t for VR use, however as an alternative for extra immediately hooking up displays that depend on DisplayPort Alt Mode over USB-C. Rounding out the gathering can be a paid of DisplayPorts (2.1) and an HDMI 2.1 port.
Each playing cards are utilizing a brand new triple fan blower design from AMD. We’re nonetheless ready on additional particulars right here, however AMD has instructed us that the 7900 XTX measures 287mm lengthy, and is 2.5 slots vast.
Wrapping issues up, each playing cards can be launching on December 13th, with AMD planning on having each reference and AIB companions’ playing cards on the shelf for launch day. The 7900 XTX will begin at $999, in the meantime the 7900 XT can be proper behind it at $899. AMD isn’t providing any efficiency comparisons versus NVIDIA playing cards, however at this juncture it looks like the wildcard is the soon-to-launch GeForce RTX 4080 16GB. By the point AMD launches in December, we must always have a a lot better concept of the place AMD and NVIDIA’s dueling lineups stand compared to one another.