GPUWattch: Enabling Energy Optimizations in GPGPUs

# GPUWattch: Enabling Energy Optimizations in GPGPUs ###### tags: `GPUs` `Energy` `CUDA` `GPU architecture` `Power estimation` ###### paper origin: ISCA-2013 ###### papers: [link](http://www.ece.ubc.ca/~aamodt/papers/gpuwattch.isca2013.pdf) ###### slides and video: [link](http://www.gpgpu-sim.org/micro2012-tutorial/) # Introduction **Motivation** * Investigating and optimizing GPU energy-efficiency problems have been difficult owning to the lack of a suitable power modeling infrastructure. * To avoid penalizing performance per watt and develop energy-efficient GPU architectures, we require a robust power model. * Three requirements for a robust power model 1. configurable 2. cycle level 3. strongly validated against existing processor architectures using a rigorous methodology **Paper works** * Introducing GPUWattch, a new power model that addresses all of the aforementioned requirements. * Using a bottom-up methodology to build the initial model. * Comparing the simulated power with the measured hardware power to identify any model inaccuracies. * Resolving these inaccuracies using a special suite of 80 microbenchmarks that are designed to create a system of linear equations that correspond to the total power consumption. * Eliminating the inaccuracies by solving for the unknowns in the system. * Validating the simulated power model's average and runtime estimates against measured results using a comprehensive set of 25 real-world kernels that were not used to originally develop the power model. **Contributions** 1. We propose a power model based on the bottom-up methodology for GPU architecture research that can enable performance per watt energy-efficiency studies. 2. We describe a systematic and rigorous methodology to develop and validate the power model using a large variety of microbenchmarks that stress test the microarchitecture. 3. We demonstrate the opportunities for improving the energy efficiency of GPUs using both traditional techniques (DVFS) and new techniques (lane gating). # Modeling ## GPU Power Modeling * We model GPU architecture similar to NVIDIA's GPUs. * ![](https://i.imgur.com/gBM9oB2.png) * Equation (1) captures at a very high level all aspects of GPU power that we model, which consists of the leakage, idle SM, and all components' dynamic power. (α i : activity factor, Pmax i: peak power) ### Infrastructure * We rely on McPAT to model most microarchitectural bloks. * We stay consistent with the process of abstacting the microarchitectural parameters of each component's circuit implementation in order to ensure configurability. * Many components of McPAT are either not present or are considerably different for GPUs as compared to general-purpose CPUs. * We added or adapted several important blocks in McPAT to more accurately represent the underlying GPU microarchitecture. * We model SRAM array structures using CACTI. ![](https://i.imgur.com/0cnGnlw.png) ## Microbenchmarking the uncertainties ### LSE Problem Formation * A major source of inaccuracies in the initial power model arises from uncertainties due to undocumented design decisions in the target architecture. * We use an iterative process to continuously refine the power model based on inaccuracies that we observe between the power model and the actual hardware power measurements. * ![](https://i.imgur.com/E64QxMg.png) * In equation (3), we model the dynamic power consumption as a linear combination of access rate αi, of each microarchitectural component multiplied by its peak power Pmaxi. * In equation (4), we consider the modeling inaccuracy for a particular microarchitectural component i as an unknown variable xi. * Thus, if there are N access rates for the different microarchitectural components, each microbenchmark will yield a microarchitectural component access-rate vector that constitutes one linear equation. With an arbitrary number of microbenchmarks, say M , this will result in a M × N linear estimation problem, as shown in Equation (5). * With a sufficiently large number of equations and hardware power measurements, we can solve the modeling inaccuracies using the least-square estimation(LSE). # Implementation ## Power Model Validation ### Experimental Setup * We select two NVIDIA GPU cards with different architectures to show our power model's configurability for validation. * ![](https://i.imgur.com/ANyGFRH.png) #### 1. Power Measurement Setup * ![](https://i.imgur.com/UDqgu8f.jpg) * ![](https://i.imgur.com/GFsks4X.png) * GPU cards are connected to the PCIe slot through a PCIe riser card and an ATX power supply. ThePCIe riser card and the ATX power supply have power pins that deliver power to the GPU. * For each power supply source, we measure the instantaneous current and voltage to compute power. We sense the current draw by measuring the voltage drop across a current sensing resistor. * We use a NI DAQ to sample the voltage drop at a rate of 2 Million samples/second. * ![](https://i.imgur.com/NRPB0fo.png) * Figure 7 shows the peripheral components (GPU processor, DRAM modules, voltage regulator module) that we are measuring as part of the total GPU power. * ![](https://i.imgur.com/QFeYWlL.png) #### 2. Simulator Setup * The software power model builds on McPAT 0.8. It is integrated with GPGPU-Sim version 3.2.1. * We use GPGPU-Sim's PTXPlus mode to simulate the native instruction set on Quadro GPU. * We also use the NVIDIA compute profiler to ensure that the microbenchmarks' performance matches the target hardware. ### Constant Power Component * In equation (6), Pconst includes processor leakage power, main memory leakage power, VRM power, and all peripheral circuits' power. * Pconst is independent of processor/memory frequency and Pdyn scales linearly with the proceor/memory frequency. Therefore, we can rewite equation (6) into equation (7). * ![](https://i.imgur.com/q5REWqu.png) * Using Equation (7) with varying frequencies, we can determine the constant power component * ![](https://i.imgur.com/p2A3tZD.png) * With the simplified linear model in the equation, weperformed a linear fit to get the constant power component. * Subtracting the constant power numbers from the measured total power gives us the *dynamic* power component. # Result * We validate the power model using microbenchmarks and real programs from public benchmark suites. * We compare the average power and the dynamic power behavior of several kernels against the measured hardware results. * ![](https://i.imgur.com/aYL6Q7W.png) * Figure 9 shows the comparison between our power model and the hardware for microbenchmarks running on the GTX 480. * The average absolute error for the microbenchmarks on the GTX 480 card is 15%. * ![](https://i.imgur.com/Ga6Bntb.png) * ![](https://i.imgur.com/VvbVpyJ.png) * Simulator typically appears more "noisy" compared to the measured trace because the hardware RLC circuit forms a low-pass filter that smooths out the measured power trace. * ![](https://i.imgur.com/IKbnEwD.png) * ![](https://i.imgur.com/kZoi5Lq.png) * First, we assume a fast responding on-chip regulator that can make P-state transitions quickly within 500 cycles * Second, we consider a conventional off-chip regulator with coarse granularity of 10,000 cycles transition time, or roughly 10 us. * ![](https://i.imgur.com/o4A2ruL.png) * Benchmark HRTWL suffers from load imbalance. * ![](https://i.imgur.com/ay3cEXB.png) ## Conclusion * We demonstrate a configurable, cycle-level and validated power model for GPGPUs that can be used for architecture and software energy-efficiency studies. * The power model achieves the averaged absolute error within 9.9% for GTX 480 card, with 13.4% for Quadro FX5600, respectively. * Using GPUWattch, we show DVFS are useful for reducing dynamic power consumption in GPGPU workloads.