December 17th, 2024 • Comments Off on Power efficiency: throttling for Sustainability
The rise of AI applications has brought us soaring power demands and strains on the environment. Google reported a 13% emission increase thanks to AI, their footprint is huge, some power grids are struggling, new data centers are about to use enormous amounts of energy, and the reporting of all this isn’t great either sometimes (btw try a traditional search before using a chat bot if you care about the environment).
So the challenge is: how can we make AI, and HPC/HTC applications more energy-efficient without sacrificing too much performance?
The good news is, that AI, HPC, and HTC workloads are very flexible – they can e.g. be shifted in time and space to align with the availability of e.g. green energy. HPC systems, using job schedulers like Grid Engine, LSF, and SLURM can be used for this. But what if we could add another dimension to this energy-saving toolkit? Enter CPU and GPU throttling.
Throttling CPU and GPU frequencies to lower power draw levels is not a new idea. Research from Microsoft and others has shown that modest performance reductions can result in substantial energy savings. The underlying principle is dynamic voltage and frequency scaling (DVFS). To illustrate this point, let’s consider a specific example: here I’m running a Phi 3.5 LLM on an Arc GPU for inference. Performance is measured through the token creation time and we’ll observe the P-99 of the same. Just a small adjustment in performance can lead to significant power savings. For instance: by adjusting the GPU frequency, while only sacrificing a 10% in performance, we can observed up to 30% power savings.
(Click to enlarge)
Btw, as we’re testing an AI workload, the performance for this example depends heavily on memory bandwidth. With lower GPU frequency, memory bandwidth usage is lower. Again showing that frequency scaling is an effective way to control memory bandwidth. This explains low P99 compute latency that improves over time as frequency is increazed. The application was in general bottle necked. From around 1500MHz onward, memory bandwidth stabilizes, and performance gets more consistent.
(Click to enlarge)
While the results shown here use a low-range GPU, the principles hold true for larger systems. For example, a server with 8 GPUs, each with a TDP of ~1000W, could save hundreds of watts per server. Imagine the compounded impact across racks of servers in a data center. By incorporating techniques like throttling, we can make meaningful strides toward reducing the carbon footprint of AI.
Leveraging the flexibility of workloads—such as the AI examples shown here—can greatly improve system effectiveness: run faster when green energy is available and slower when sustainability goals take priority. Together, these strategies pave the way for a more sustainable compute continuum.
Categories: Personal • Tags: Artificial Intelligence, Cloud, data center, Edge, system effectiveness, system performance • Permalink for this article
September 9th, 2024 • Comments Off on System Effectiveness
In the world of distributed systems, performance has traditionally been the primary measure of success. Engineers have focused on optimizing for latency, throughput, reliability and scalability, pushing systems to be faster and handle more work. While these performance metrics are critical, they tell only part of the story. A more holistic and sustainable concept is system effectiveness, which balances performance with broader considerations like cost efficiency, resource utilization, and environmental impact. Here is a first shot at defining system effectiveness:
System effectiveness measures how efficiently a system performs relative to its resource consumption, financial cost, and environmental impact, while adapting dynamically to context over time to balance performance with sustainability.
In an era of increasing environmental awareness, this broader focus on system effectiveness—not just raw performance—should be center! Hence my thought that we should focus more on this asap!
System effectiveness incorporates the Total Cost of Ownership (TCO), Return on Investment (ROI), and the balance between Capital Expenditures (CAPEX) and Operating Expenses (OPEX). It considers not just how fast and scalable a system is, but also the economic and environmental costs of running it. For instance, somebody might purchase server(s) and generate thousands of euros in revenue from it. While the initial CAPEX seems justified, ongoing OPEX — such as for power and cooling — may appear negligible at first. However, as the system scales, energy usage adds up. These “small” costs accumulate quickly and can have a significant impact on profitability, especially if systems are not optimized for efficiency.
In the context of system effectiveness, it’s important to note that simply turning off servers isn’t always the best solution. The revenue generated by a server running efficiently can often outweigh the costs, even including its energy consumption. However, a well-managed server can be optimized in other ways, such as throttling its performance to ensure it operates using green energy whenever possible. By aligning energy consumption with renewable sources, businesses can reduce their environmental impact while maintaining operational efficiency and profitability
But this raises a question: Why do we still prioritize system performance over effectiveness, when sustainability and efficiency are becoming just as critical? Just looking at it from a OPEX perspective might make it look like a bad business decisions — energy is just still so cheap. But as governance shifts and policies tighten, businesses will need to account for more than just technical performance. Companies must start asking themselves whether they want to wait until regulations force them to change — or if they want to take the lead in defining what responsible, effective computing looks like for the future. The choice to focus on system effectiveness is not only a reflection of good business sense but an opportunity to align with a more sustainable and resilient future.
System effectiveness encourages optimizing resource usage, which keeps ongoing operational costs low while maximizing revenue. Companies that embrace this aren’t just reducing their environmental footprint — they’re positioning themselves for long-term competitiveness. Efficient systems not only lower power consumption and reduce environmental impact but also improve long-term ROI by reducing unnecessary expenses. As consumer preferences shift toward eco-conscious brands and governments that enforce stricter environmental regulations, businesses that focus on system effectiveness will see enhanced customer loyalty and avoid costly penalties. For example, in the European Union, certain companies will be required to report their footprints, and similar regulations are spreading globally. Organizations that adopt system effectiveness today will ensure that those reports look favorable, maintaining a competitive edge while demonstrating responsible stewardship of resources.
Categories: Personal • Tags: system effectiveness, system performance • Permalink for this article