Power efficiency: throttling for Sustainability
December 17th, 2024 • Comments Off on Power efficiency: throttling for SustainabilityThe rise of AI applications has brought us soaring power demands and strains on the environment. Google reported a 13% emission increase thanks to AI, their footprint is huge, some power grids are struggling, new data centers are about to use enormous amounts of energy, and the reporting of all this isn’t great either sometimes (btw try a traditional search before using a chat bot if you care about the environment).
So the challenge is: how can we make AI, and HPC/HTC applications more energy-efficient without sacrificing too much performance?
The good news is, that AI, HPC, and HTC workloads are very flexible – they can e.g. be shifted in time and space to align with the availability of e.g. green energy. HPC systems, using job schedulers like Grid Engine, LSF, and SLURM can be used for this. But what if we could add another dimension to this energy-saving toolkit? Enter CPU and GPU throttling.
Throttling CPU and GPU frequencies to lower power draw levels is not a new idea. Research from Microsoft and others has shown that modest performance reductions can result in substantial energy savings. The underlying principle is dynamic voltage and frequency scaling (DVFS). To illustrate this point, let’s consider a specific example: here I’m running a Phi 3.5 LLM on an Arc GPU for inference. Performance is measured through the token creation time and we’ll observe the P-99 of the same. Just a small adjustment in performance can lead to significant power savings. For instance: by adjusting the GPU frequency, while only sacrificing a 10% in performance, we can observed up to 30% power savings.
Btw, as we’re testing an AI workload, the performance for this example depends heavily on memory bandwidth. With lower GPU frequency, memory bandwidth usage is lower. Again showing that frequency scaling is an effective way to control memory bandwidth. This explains low P99 compute latency that improves over time as frequency is increazed. The application was in general bottle necked. From around 1500MHz onward, memory bandwidth stabilizes, and performance gets more consistent.
While the results shown here use a low-range GPU, the principles hold true for larger systems. For example, a server with 8 GPUs, each with a TDP of ~1000W, could save hundreds of watts per server. Imagine the compounded impact across racks of servers in a data center. By incorporating techniques like throttling, we can make meaningful strides toward reducing the carbon footprint of AI.
Leveraging the flexibility of workloads—such as the AI examples shown here—can greatly improve system effectiveness: run faster when green energy is available and slower when sustainability goals take priority. Together, these strategies pave the way for a more sustainable compute continuum.