How AMD Is Advancing the 30x25 Energy Efficiency Goal in High-Performance Computing and AI
By: Sam Naffziger, Senior Vice President, Corporate Fellow, and Product Technology Architect
In today’s modern world, we have an insatiable appetite for data. As the adoption of artificial intelligence (AI) accelerates and compute-intensive industries seek higher performance, energy demands continue to grow. New data centers are emerging and expanding to support the increasing demand for cloud infrastructures, high performance AI applications, supercomputing and more. As we continue to stretch the boundaries of what’s possible in computing, future improvements are becoming increasingly energy constrained. In fact, we’ve collectively been on a trajectory to consume more energy than the market can support within the next two decades.[i] The need for innovative energy solutions is becoming more and more important.
AMD is setting the pace of innovation for more energy efficient computing performance. We are deeply focused on energy efficiency in our product development, holistically crafting the designs for power optimization across architecture, connectivity, packaging and software. With this focus on energy efficiency, we aim to reduce costs, preserve natural resources, and mitigate climate impacts.
In 2021, we announced our vision to deliver a 30x energy efficiency improvement by 2025 from a 2020 baseline for accelerated data center compute nodes. We call this our 30x25 goal. Built with AMD EPYC™ CPUs and AMD Instinct™ accelerators, these nodes are designed for some of the world’s fastest-growing computing needs in AI training and high-performance computing (HPC) applications. While these applications require increasing amounts of computing power and energy, they’re vital to solving some of the most essential – and challenging – problems humanity has ever grappled with. Supercomputers like the AMD powered Frontier supercomputer at Oak Ridge National Laboratory are leveraging accelerated compute to advance scientific research, genomics, new drug discoveries, climate predictions, as well as key advancements in AI which will be the defining technology to shape the next generation of computing. In fact, AMD now powers eight of the top 10 most energy efficient supercomputers on the latest Green500 list, including the two most powerful supercomputers in the top 10, Frontier and the Adastra supercomputer.
2023 Progress Update
So, how are we doing so far? With today’s launch of the AMD Instinct MI300 series accelerators, we’re thrilled to be making significant new strides toward our goal from the progress we shared last year and remain optimistic we can meet our 30x25 goal in 2025. With the latest performance data, we have achieved a 13.5x improvement from the 2020 baseline using a configuration of four AMD Instinct MI300A APUs (GPU with integrated 4th Gen EPYC™ “Genoa” CPU).[ii] Our goal utilizes a measurement methodology validated by renowned compute energy efficiency researcher and author Dr. Jonathan Koomey.
One of the innovations we’re leveraging in our AMD Instinct MI300 Series accelerators is AMD Smart Shift technology, which adaptively shares power between the CPU and GPU depending on where a specific application – generative AI for example – needs it the most. This is a fantastic example of the power of our APU architecture, which brings together CPU and GPU capabilities in a single package, as well as the innovation we can drive through ambitious energy efficiency goals. The foundation of Smart Shift technology is rooted in our work to achieve our 25x20 energy efficiency goal, which ultimately delivered a 31.7x improvement in the energy efficiency of our mobile processors compared to the 2014 baseline.
Design choices like these are pushing MI300A to deliver a >2x performance per watt advantage over comparable competitor chips.[iii] This brings a host of benefits including significant savings in electricity use, greenhouse (GHG) emissions and total cost of ownership at the solution level.
This is fantastic progress, but there’s still more to be done.
Where do we go from here?
As Moore’s law continues to show diminishing returns, and costs per transistor rise, we’re at an inflection point in the technology improvement trends where we’re seeing a marked slowdown in both density and energy per operation improvements, particularly as we scale past 5nm nodes into 3nm and 2nm and beyond.
In other words, the industry can’t rely solely on smaller transistors to drive significant performance and efficiency increases in future generations of processers. To overcome the slowing of Moore’s law, and to continue innovating new processor designs that deliver significant gains in both power and efficiency, we’re addressing the problem from a holistic design perspective, reimagining everything from our approach to packaging, architecture, memory, software and more.
While there is still more to be done to reach our 30x25 goal, I continue to be pleased by the work of our engineers and encouraged by the results so far. MI300 marks a significant step forward, and I hope you’ll continue to check in with us as we report annually on our progress.
Footnotes
[i] As reported by the SRC decadal plan for semi-conductors 2020. https://www.src.org/about/decadal-plan/
[ii] Includes AMD high-performance CPU and GPU accelerators used for AI training and High-Performance Computing in a 4-Accelerator, CPU hosted configuration. Goal calculations are based on performance scores as measured by standard performance metrics (HPC: Linpack DGEMM kernel FLOPS with 4k matrix size. AI training: lower precision training-focused floating-point math GEMM kernels such as FP16 or BF16 FLOPS operating on 4k matrices) divided by the rated power consumption of a representative accelerated compute node including the CPU host + memory, and 4 GPU accelerators.
[iii] Measurements conducted by AMD Performance Labs as of December 4th, 2023 on the AMD Instinct™ MI300A (760W) APU designed with AMD CDNA™ 3 5nm | 6nm FinFET process technology at 2,100 MHz peak boost engine clock resulted in:
• 122.6 TFLOPs peak theoretical double precision Matrix (FP64 Matrix),
• 61.3 TFLOPs peak theoretical double precision (FP64),
• 122.6 TFLOPs peak theoretical single precision Matrix (FP32 Matrix),
• 122.6 TFLOPs peak theoretical single precision (FP32), floating-point performance.
Published results on Nvidia GH200 1000W GPU:
• 67 TFLOPs peak theoretical double precision tensor (FP64 Tensor),
• 34 TFLOPs peak theoretical double precision (FP64),
• N/A FP 32 Tensor - Nvidia GH200 GPUs don’t support FP32 Tensor. Regular FP32 number used as proxy.
• 67 TFLOPs peak theoretical single precision (FP32), floating-point performance.
Nvidia GH200 source: https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip
GH200 TFLOPs per Watt Calculations (peak wattage of 1000W used):
• FP64 Matrix: 67 TFLOPs / 1000W = 0.067 TFLOPs per Watt
• FP64: 34 TFLOPs / 1000W = 0.034 TFLOPs per Watt
• FP32: 67 TFLOPs / 1000W = 0.067 TFLOPs per Watt
* Nvidia GH200 GPUs don’t support FP32 Tensor.
Actual performance and performance per watt may vary on production systems.