Keynote Speeches

Near Threshold Computing: Reclaiming Moore's Law

Ron Dreslinski

Towards Atomic-Spin Based Computation: Realizing Logic Operations Atom by Atom

Jens Wiebe

Nano-Spintronic pursues the goal to use the spin of electrons bound to atoms or molecules in order to design components of future information technology. The concept promises fast and energy-efficient atomic-scale devices which are compatible with non-volatile storage technology. I will show that we combined bottom-up atomic fabrication with spin-resolved scanning tunneling microscopy to construct and read out atomic-scale model systems performing logic operations, such as NOT and OR.

Energy-aware Software Development for Massive Scale Systems

Torsten Hoefler

The power consumption of HPC systems is an increasing concern as large-scale systems grow in size while voltage scaling slows down and leakage current increases. It is unreasonable to expect that a single system can consume more than 20 MW, which makes the road to larger scales harder. While we may be able to solve the challenge to build an Exascale machine in this power-budget, it's unclear if practical algorithms and implementations can operate at the required power-efficiency. In this talk, we show examples for power-aware algorithm analysis and modeling. We also demonstrate that the network quickly becomes the key concern with regards to performance and power. We show hardware and software techniques to limit power consumption and describe how network-centric programming can potentially further mitigate those concerns. We then describe several techniques for power-aware parallel programming and power modeling as directions of future research in this area.

Invited Talks

Efficiencies of Climate Computing in Australia

Tim Pugh

The presentation sets out to describe the present and future computing work in climate and earth system science, and its effects on present Australian Terascale computing facilities and future Petascale computing facilities coming in late 2012 and beyond. The efficiency of the systems and facilities to support the climate computing will determine the level of investment needed, or scope of climate and earth system science afforded.

Scientific Sessions

Flexible Workload Generation for HPC Cluster Efficiency Benchmarking

Daniel Molka, Daniel Hackenberg, Robert Schöne, Timo Minartz, Wolfgang E. Nagel

The High Performance Computing (HPC) community is well-accustomed to the general idea of benchmarking. In particular, the TOP500 ranking as well as its foundation - the Linpack benchmark - have shaped the field since the early 1990s. Other benchmarks with a larger workload variety such as SPEC MPI2007 are also well-accepted and often used to compare and rate a system's computational capability. However, in a petascale and soon-to-be exascale computing environment, the power consumption of HPC systems and consequently their energy efficiency have been and continue to be of growing importance, often outrivaling all aspects that focus narrowly on raw compute performance. The Green500 list is the first major attempt to rank the energy efficiency of HPC systems. However, its main weakness is again the focus on a single, highly compute bound algorithm. Moreover, its method of extrapolating a system's power consumption from a single node is inherently error-prone. So far, no benchmark is available that has been developed from ground up with the explicit focus on measuring the energy efficiency of HPC clusters. We therefore introduce such a benchmark that includes transparent energy measurements with professional power analyzers. Our efforts are based on well-established standards (C, POSIX-IO and MPI) to ensure a broad applicability. Our well-defined and comprehensible workloads can be used to e.g. compare the efficiency of HPC systems or to track the effects of power saving mechanisms that can hardly be understood by running regular applications due to their overwhelming complexity.

Determine Energy-Saving Potential in Wait-States of Large-Scale Parallel Programs

Michael Knobloch, Bernd Mohr, Timo Minartz

Energy consumption is one of the major topics in high performance computing (HPC) in the last years. However, little effort is put into energy analysis by developers of HPC applications. We present our approach of combined performance and energy analysis using the performance analysis tool-set Scalasca. Scalascas parallel wait-state analysis is extended by a calculation of the energy-saving potential if a lower power-state can be used.

DVFS-Control Techniques for Dense Linear Algebra Operations on Multi-Core Processors

Pedro Alonso, Manuel F. Dolz, Francisco Igual, Rafael Mayo, Enrique S. Quintana-Ortí

This paper analyzes the impact on power consumption of two DVFS-control strategies when applied to the execution of dense linear algebra operations on multi-core processors. The strategies considered here, prototyped as the Slack Reduction Algorithm (SRA) and the Race-to-Idle Algorithm (RIA), adjust the operation frequency of the cores during execution of a collection of tasks (in which many dense linear algebra algorithms can be decomposed) with a very different approach to save energy. A power-aware simulator, in charge of scheduling the execution of tasks to processor cores, is employed to evaluate the performance benefits of these power-control policies for two reference algorithms for the LU factorization, a key operation for the solution of linear systems of equations.

Energy-aware job scheduler for high-performance computing

Olli Mämmelä, Mikko Majanen, Robert Basmadjian, Hermann De Meer, André Giesler, Willi Homberg

In recent years energy-aware computing has become a major topic, not only in wireless and mobile devices but also in devices using wired technology. The ICT industry is consuming an increasing amount of energy and a large part of the consumption is generated by large-scale data centres. In High-Performance Computing (HPC) data centres, higher performance equals higher energy consumption. This has created incentives on exploring several alternatives to reduce the energy consumption of the system, such as energy-efficient hardware or the Dynamic Voltage and Frequency Scaling (DVFS) technique. This work presents an energy-aware scheduler that can be applied to a HPC data centre without any changes in hardware. The scheduler is evaluated with a simulation model and a real-world HPC testbed. Our experiments indicate that the scheduler is able to reduce the energy consumption by 6-16 \% depending on the job workload. More importantly, there is no significant slowdown in the turnaround time or increase in the wait time of the job. The results hereby evidence that our approach can be beneficial for HPC data centre operators without a large penalty on service level agreements.

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific Applications

Charles Lively, Xingfu Wu, Valerie Taylor, Shirley Moore, Hung-Ching Chang, Chun-Yi Su, Kirk Cameron

Predictive models enable a better understanding of the performance characteristics of applications on multicore systems. Previous work has utilized performance counters in a system-centered approach to model power consumption for the system, CPU, and memory components. Often, these approaches use the same group of counters across different applications. In contrast, we develop application-centric models (based upon performance counters) for the runtime and power consumption of the system, CPU, and memory components. Our work analyzes four Hybrid (MPI/OpenMP) applications: the NAS Parallel Multizone Benchmarks (BT-MZ, SP-MZ, LU-MZ) and a Gyrokinetic Toroidal Code, GTC. Our models show that cache utilization (L1/L2), branch instructions, TLB data misses, and system resource stalls affect the performance of each application and performance component differently. We show that the L2 total cache hits counter affects performance across all applications. The models are validated for the system and component power measurements with an error rate less than 3%.

Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency

Hatem Ltaief, Piotr Luszczek, Jack Dongarra

This paper presents the power profile of two high performance dense linear algebra libraries i.e., LAPACK and PLASMA. The former is based on block algorithms that use the fork-join paradigm to achieve parallel performance. The latter uses fine-grained task parallelism that recasts the computation to operate on submatrices called tiles. In this way tile algorithms are formed. We show results from the power profiling of the most common routines, which permits us to clearly identify the different phases of the computations. This allows us to isolate the bottlenecks in terms of energy efficiency.

Towards an Energy-Aware Scientific I/O Interface -- Stretching the ADIOS Interface to Foster Performance Analysis and Energy Awareness

Julian Kunkel, Timo Minartz, Michael Kuhn, Thomas Ludwig

Intelligently switching energy saving modes of CPUs, NICs and disks is mandatory to reduce the energy consumption. Hardware and operating system have a limited perspective of future performance demands, thus automatic control is suboptimal. However, it is tedious for a developer to control the hardware by himself. In this paper we propose an extension of an existing I/O interface which on the one hand is easy to use and on the other hand could steer energy saving modes more efficiently. Furthermore, the proposed modifications are beneficial for performance analysis and provide even more information to the I/O library to improve performance. When a user annotates the program with the proposed interface, I/O, communication and computation phases are labeled by the developer. Run-time behavior is then characterized for each phase, this knowledge could be then exploited by the new library.

Brainware for Green HPC

Christian Bischof, Dieter An Mey, Christian Iwainsky

The reduction of the infrastructural costs of HPC, in particular power consumption, currently is mainly driven by architectural advances in hardware. Recently, in the quest for the EFlop/s, hardware-software codesign has been advocated, owing to the realization that without some software support only heroic programmers could use high-end HPC machines. However, in the topically diverse world of universities, the EFlop/s is still very far off for most users, and yet their computational demands shape the HPC landscape in the foreseeable future. Based on experiences made at RWTH Aachen University and in the context of the distributed Computational Science and Engineering support of the UK HECToR program, we claim based on economic considerations that HPC hard- and software installations need to be complemented by a "brainware" component, i.e., trained HPC specialists supporting performance optimization of users' codes. This statement itself is not new, and the establishment of simulation labs at HPC centers echoes this fact. However, based on our experiences, we quantify the savings resulting from brainware, thus providing an economic argument that sufficient brainware must be an integral part of any "green" HPC installation. Thus, it also follows that the current HPC funding regimes, which favor iron over staff, are fundamentally flawed, and long-term efficient HPC deployment must emphasize brainware development to a much greater extent.

Design Space Exploration Towards a Realtime and Energy-Aware GPGPU-based Analysis of Biosensor Data

Constantin Timm, Frank Weichert, Peter Marwedel, Heinrich Müller

In this paper, novel objectives for the design space exploration of GPGPU applications are presented. The design space exploration takes the combination of energy efficiency and realtime requirements into account. This is completely different to the commonest high performance computing objective, which is to accelerate an application as much as possible. As a proof-of-concept, a GPGPU based image processing and virus detection pipeline for a newly developed biosensor, called PAMONO, is presented. The importance of realtime capable and portable biosensors increases according to rising number of worldwide spreading virus infections. The local availability of biosensors at e.g. airports to detect viruses in-situ demand to take costs and energy for the development of GPGPU-based biosensors into account. The consideration of the energy is especially important with respect to green computing. The results of the conducted design space exploration show that during the design process of a GPGPU-based application the platform must also be evaluated to get the most energy-aware solution. In particular, it was shown that increasing numbers of parallel running cores need not decrease the energy consumption.

Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Graphics Processors

Hartwig Anzt, Vincent Heuveline, Maribel Castillo, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Orti, Francisco D. Igual

In this paper, we analyze the power consumption of different GPU-accelerated iterative solver implementations enhanced with energy-saving techniques. Specifically, while conducting kernel calls on the graphics accelerator, we manually set the host system to a power-efficient idle-wait status so as to leverage dynamic voltage and frequency control. While the usage of iterative refinement combined with mixed precision arithmetic often improves the execution time of an iterative solver on a graphics processor, this may not necessarily be true for the power consumption as well. To analyze the trade-off between computation time and power consumption we compare a plain GMRES solver and its preconditioned variant to the mixed-precision iterative refinement implementations based on the respective solvers. Benchmark experiments conclusively reveal how the usage of idle-wait during GPU-kernel calls effectively leverages the power-tools provided by hardware, and improves the energy performance of the algorithm.

Analysis and Optimization of the Power Efficiency of GPU Processing Element for SIMD Computing

Daqi Ren, Reiji Suda

Estimating and analyzing the power consuming features of a program on a hardware platform is important in High Performance Computing (HPC) program optimization. A reasonable evaluation can help to handle the critical design constraints at the level of software, choosing preferable algorithm in order to reach the best power performance. In this paper we illustrate a simple experimental method to examine SIMD computing on GPU and Multicore computers. By measuring the power of each component and analyzing the execution speed, power parameters are captured, the power consuming features are analyzed and concluded. Thereafter power efficiency of any scale of this SIMD computation on the platform can be simply evaluated based on the features. The precision of above approximation is examined and detailed error analysis has been provided. The power consumption prediction has been validated by comparative analysis on real systems.

Measuring power consumption on IBM Blue Gene/P

Michael Hennecke, Wolfgang Frings, Willi Homberg, Anke Zitz, Michael Knobloch, Hans Böttiger

Energy efficiency is a key design principle of the IBM Blue Gene series of supercomputers, and Blue Gene systems have consistently gained top GFlops/Watt rankings on the Green500 list. The Blue Gene hardware and management software provide built-in features to monitor power consumption at all levels of the machine's power distribution network. This paper presents the Blue Gene/P power measurement infrastructure and discusses the operational aspects of using this infrastructure on Petascale machines. We also describe the integration of Blue Gene power monitoring capabilities into system-level tools like LLview, and highlight some results of analyzing the production workload at Research Center Jülich (FZJ).

Technical Sessions

Energy efficient computing with NVIDIA GPUs

Axel Koehler, NVIDIA

Highly Efficient Datacenters: Status and Trends

Frank Baetke, HP

Energy Efficiency Metrics and Cray XE6 Application Performance

Wilfried Oed, Cray

The leverage of right choice of DRAM in improving performance and reducing power consumption of HPC systems

Gerd Schauss, Samsung

SuperMUC at LRZ

Klaus Gottschalk, IBM

HPC systems energy efficiency optimization thru hardware-software co-design based on Intel technologies

Andrey Semin, Intel

Successful design of energy efficient HPC solutions should get into account complex interactions between the wide range hardware and software components beyond traditional focus on separate CPUs, memory, cooling infrastructure, other components power efficiencies. As many aspects of the HPC parallel clusters span outside of the CPU and a single server design, the role of software in overall result of energy consumption is becoming critical. Thus the hardware features should be developed based on the context of software being executed, and the software may need to be adapted to make best use of the datacenter design and available resources. In the presentation we'll outline the main challenges and provide an overview of Intel's work in processors, platforms, and datacenter designs and software techniques towards optimization of energy utilization of the HPC systems.

New AMD Opteron Power Management Overview

André Heidekrüger, AMD

Energy-Efficient Dataintensive Supercomputing

Ernst M. Mutke, Convey/HMK

Energy Efficient HPC Storage - A possibility or a pipe dream

Torben Kling Petersen, Xyratex