CapMaestro: Exploiting Power Redundancy, Data Center-Wide Priorities, and Stranded Power for Boosting Data Center Performance

Power infrastructure is a critical component of cloud and HPC data centers, and costs as much as tens of millions of US dollars for a large data center. The infrastructure must be highly reliable and must tolerate load variation, which traditionally requires significant redundancy and overprovisioning. This redundant and overprovisioned capacity is significantly underutilized during normal operation (typical load, non-failure mode). Power capping reduces underutilization by adding more servers to the existing power infrastructure, and throttling power consumption in the infrequent cases where demand exceeds the provisioned capacity. However, state-of-the-art power capping solutions are (1) not practical for the properties of real-world redundant power infrastructure in highly-available data centers, and (2) oblivious to differing priorities of workloads across the entire data center when power consumption needs to be throttled. As a result, these solutions are inefficient and can even be unsafe. In this work, we present CapMaestro , a new power management architecture for cloud and HPC data centers. Cap-Maestro has three major new contributions. First, CapMaestro is designed to work with multiple power feeds, and exploits server power capping to independently cap the load on each feed of a server. It exploits the underutilized redundant power
infrastructure commonly employed in data centers to safely accommodate a much greater number of servers. Second, CapMaestro uses a scalable, distributed, multi-level power capping approach, which accounts for power capacity at each level of power distribution hierarchy. It is global priorityaware, ensuring that no high-priority server anywhere in the data center is throttled before all lower-priority servers in the data center are throttled, as long as this can be achieved safely. Third, CapMaestro exploits stranded power (i.e., power budgets that are not utilized) in redundant power infrastructure to boost the performance of applications running in the data center. We deploy CapMaestro in our cloud data center control plane to demonstrate its effectiveness on real-world machines. We then simulate a data center with thousands of servers using published load distribution data, and demonstrate that Cap-Maestro safely increases the number of servers under the existing power infrastructure by 50%.

By: Yang Li, Charles Lefurgy, Karthick Rajamani, Malcolm Allen-Ware, Guillermo J. Silva, Daniel D. Heimsoth, Saugata Ghose,Onur Mutlu

Published in: RC25680 in 2018

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC25680.pdf

Questions about this service can be mailed to reports@us.ibm.com .