Keynote

Exploring Emerging Technologies in the HPC Extreme Scale Co-Design Space

Jeffrey S. Vetter, Oak Ridge National Laboratory and Georgia Institute of Technology

Abstract:

Concerns about energy-efficiency and reliability have forced our community to reexamine the full spectrum of architectures, software, and algorithms that constitute our ecosystem. While architectures and programming models have remained relatively stable for almost two decades, new architectural features, such as heterogeneous processing, nonvolatile memory, and optical interconnection networks, will demand that software systems and applications be redesigned so that they expose massive amounts of hierarchical parallelism, carefully orchestrate data movement, and balance concerns over performance, power, resiliency, and productivity. In what DOE has termed 'co-design', teams of architects, software designers, and applications scientists, are working collectively to realize an integrated solution to these challenges. To tackle this challenge of power consumption, we are investigating the design of future memory hierarchies, which include nonvolatile memory. In this talk, I will sample these emerging memory technologies and discuss how we are preparing applications and software for these upcoming systems with radically different memory hierarchies.

CV:

Jeffrey Vetter, Ph.D., holds a joint appointment between Oak Ridge National Laboratory (ORNL) and the Georgia Institute of Technology (GT). At ORNL, Vetter is a Distinguished R&D Staff Member, and the founding group leader of the Future Technologies Group in the Computer Science and Mathematics Division. At GT, Vetter is a Joint Professor in the Computational Science and Engineering School, the Principal Investigator for the NSF-funded Keeneland Project that brings large scale GPU resources to NSF users through XSEDE, and the Director of the NVIDIA CUDA Center of Excellence. His papers have won awards at the International Parallel and Distributed Processing Symposium and EuroPar; he was awarded the ACM Gordon Bell Prize in 2010. His recent book “Contemporary High Performance Computing” surveys the international landscape of HPC. See his website for more information: http://ft.ornl.gov/~vetter/.

Paper Presentations

On the Potential of Significance-Driven Execution for Energy-Aware HPC

Philipp Gschwandtner, Charalampos Chalios, Dimitrios Nikolopoulos, Hans Vandierendonck,Thomas Fahringer

Dynamic Voltage and Frequency Scaling (DVFS) exhibits fundamental limitations as a method to reduce energy consumption in computing systems. In the HPC domain, where performance is of highest priority and codes are heavily optimized to minimize idle time, DVFS has limited opportunity to achieve substantial energy savings. This paper explores if operating processors Near the transistor Threshold Voltage (NTV) is a better alternative to DVFS for breaking the power wall in HPC. NTV presents challenges, since it compromises both performance and reliability to reduce power consumption. We present a first of its kind study of a significance-driven execution paradigm that selectively uses NTV and algorithmic error tolerance to reduce energy consumption in performance-constrained HPC environments. Using an iterative algorithm as a use case, we present an adaptive execution scheme that switches between near-threshold execution on many cores and above-threshold execution on one core, as the computational significance of iterations in the algorithm evolves over time. Using this scheme on state-of-the-art hardware, we demonstrate energy savings ranging between 35% to 67%, while compromising neither correctness nor performance.

Exploring Energy-Performance-Quality Tradeoffs for Scientific Workflows With In-situ Data Analyses

Georgiana Haldeman, Ivan Rodero, Manish Parashar, Sabela Ramos, Eddy Z. Zhang, Ulrich Kremer

Power and energy are critical concerns for high performance computing systems from multiple perspectives, including cost, reliability/resilience and sustainability. At the same time, data locality and the cost of data movement have become dominating concerns in scientific workflows. One potential solution for reducing data movement costs is to use a data analysis pipeline based on in-situ data analysis. However, the energy-performance-quality tradeoffs impact of current optimizations and their overheads can be very hard to assess and understand at the application level. In this paper, we focus on exploring performance and power/ energy tradeoffs of different data movement strategies and how to balance these tradeoffs with quality of solution and data speculation. Our experimental evaluation provides an empirical evaluation of different system and application configurations that give insights into the energy-performance-quality tradeoffs space for in-situ data-intensive application workflows. The key contribution of this work is a better understanding of the interactions between different computation, data movement, energy, and quality-of-result optimizations from a power-performance perspective, and a basis for modeling and exploiting these interactions.

Reducing the Cost of Power Monitoring with DC Wattmeters

M. Asunción Castaño, Sandra Catalán, Rafael Mayo Gual, Enrique S. Quintana-Ortí

The use of internal DC wattmeters, connected to the ATX lines that distribute power from the supply unit to the computer components, is a luring method to profile power in server configurations due to the accurate and complete information provided by this approach. In this paper we enhance the appeal of this type of power meters by addressing one of their main drawbacks, namely, their high cost per node for cluster facilities. In particular, we provide a practical demonstration that it is possible to obtain accurate information for the total instantaneous power dissipation of a platform (and, therefore, the total energy consumption) by composing the information obtained from a few ATX lines into a reduced model. Additionally, we formulate a systematic methodology to build this model, based on a small number of calibration runs involving three standard benchmarks, that allows i) to detect the minimum number of lines to profile; ii) to identify/select the most appropriate lines; and iii) to assign weights in order to build the reduced model. Our hypothesis is contrasted and experimentally validated using the complete collection of multithreaded codes in PARSEC, on two low-cost servers equipped with Intel© and AMD© multicore technology.

Towards a Generic Power Estimator

Leandro Cupertino, Georges Da Costa, Jean-Marc Pierson

Data centers play an important role on worldwide electrical energy consumption. Understanding their power dissipation is a key aspect to achieve energy efficiency. Some application specific models were proposed, while other generic ones lack accuracy. The contributions of this paper are threefold. First we expose the importance of modelling alternating to direct current conversion losses. Second, a weakness of CPU proportional models is evidenced. Finally, a methodology to estimate the power consumed by applications with machine learning techniques is proposed. Since the results of such techniques are deeply data dependent, a study on devices' power profiles was executed to generate a small set of synthetic benchmarks able to emulate generic applications' behaviour. Our approach is then compared with two other models, showing that the percentage error of energy estimation of an application can be less than 1%.

Benchmarking for power consumption monitoring

Michele Weiland, Nick Johnson

This paper presents a set of benchmarks that are designed to measure power consumption in parallel systems. The benchmarks range from low-level, single instructions or operations, to small kernels. In addition to describing the motivation behind developing the benchmarks and the design principles that were followed, the paper also introduces a metric to quantify the power-performance of a parallel system. Initial results are presented and help to illustrate the contribution of the paper.

Wakeup Latencies for Processor Idle States on Current x86 Processors

Robert Schöne, Daniel Molka, Michael Werner

During the last decades various low-power states have been implemented in processors. They can be used by the operating system to reduce the power consumption. The applied power saving mechanisms include load-dependent frequency and voltage scaling as well as the temporary deactivation of unused components. These techniques reduce the power consumption and thereby enable energy efficiency improvements if the system is not used to full capacity. However, an inappropriate usage of low-power states can significantly degrade the performance. The time required to re-establish full performance can be significant. Therefore, deep idle states are occasionally disabled, especially if applications have real-time requirements. In this paper, we describe how low-power states are implemented in current x86 processors. We then measure the wake-up latencies of various low-power states that occur when a processor core is reactivated. Finally, we compare our results to the vendor's specifications that are exposed to the operating system.

Monitoring Energy Consumption With SIOX -- Autonomous Monitoring Triggered by Abnormal Energy Consumption

Julian Kunkel, Alvaro Aguilera, Marc Christopher Wiedemann, Nathanael Hübbe, Michaela Zimmer

In the face of the growing complexity of HPC systems, their growing energy costs, and the increasing difficulty to run applications efficiently, a number of monitoring tools have been developed during the last years. SIOX is one such endeavor, with a uniquely holistic approach: Not only does it aim to record a certain kind of data, but to make all relevant data available for analysis and optimization. Among other sources, this encompasses data from hardware energy counters and trace data from different hardware/software layers. However, not all data that can be recorded should be recorded. As such, SIOX needs good heuristics to determine when and what data needs to be collected, and the energy consumption can provide an important signal about when the system is in a state that deserves closer attention. In this paper, we show that SIOX can use Likwid to collect and report the energy consumption of applications, and present how this data can be visualized using SIOX's web-interface. Furthermore, we outline how SIOX can use this information to intelligently adjust the amount of data it collects, allowing it to reduce the monitoring overhead while still providing complete information about critical situations.

Measuring energy consumption using EML (Energy Measurement Library)

Alberto Cabrera, Francisco Almeida, Javier Arteaga, Vicente Blanco

Energy consumption and efficiency is a main issue in High Performance Computing systems in order to reach exascale computing. Researchers in the field are focusing their effort in reducing the first and increasing the latter while there is no current standard for energy measurement. Current energy measurement tools are specific and architectural dependent and this has to be addressed. By creating a standard tool, it is possible to generate independence between the experiments and the hardware, and thus, researchers effort can be focused in energy, by maximizing the portability of the code used for experimentation with the multiple architectures we have access nowadays. We present the EML library (Energy Measurement Library), a software library that eases the access to the energy measurement tools and can be easily extended to add new measurement systems. Using EML, it is viable to obtain architectural and algorithmic parameters that affect energy consumption and efficiency. The use of this library is tested in the field of the analytic modeling of the energy consumed by parallel programs.

A power measurement environment for PCIe accelerators. Application to the Intel Xeon Phi

Francisco D. Igual, Luis M. Jara, J. Ignacio Gómez, Luis Piñuel, Manuel Prieto

We describe and validate a complete hardware/software environment for power consumption analysis of PCIe-based accelerators, using the Intel Xeon Phi co-processor as the target platform. Our environment is flexible and affordable--based on commodity instrumentation--, and provides both accuracy and transparency for the user, which enables an easy instrumentation of existing codes from the power consumption perspective. We present empirical power traces for two well known scientific codes (LINPACK and libflame) that give insights not only on the benefits of the presented environment, but also on the power profile of the Intel Xeon Phi co-processor under different workloads.

Performance and Power Consumption Evaluation of Concurrent Queue Implementations in Embedded Systems

Lazaros Papadopoulos, Ivan Walulya, Paul Renaud-Goud, Philippas Tsigas, Dimitrios Soudris, Brendan Barry

Embedded and HPC systems face many common challenges. One of them is the synchronization of the memory accesses in shared data. Concurrent queues have been extensively studied in the HPC domain and they are used in a wide variety of HPC applications. In this work, we evaluate a set of concurrent queue implementations in an embedded platform, in terms of execution time and power consumption. Our results show that by taking advantage of the embedded platform specifications, we achieve up to 28.2% lower execution time and 6.8% less power dissipation in comparison with the conventional lock-based queue implementation. We show that HPC applications utilizing concurrent queues can be efficiently implemented in embedded systems and that synchronization algorithms from the HPC domain can lead to optimal resource utilization of embedded platforms.

Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System

Joseph Charles, William Sawyer, Manuel F. Dolz, Sandra Catalán

In this paper we investigate the energy footprint and performance profiling of COSMO-ART on various HPC platforms. This model is an extension of the operational weather forecast model of the German weather service (DWD), developed for the evaluation of the interactions of reactive gases and aerosol particles with the state of atmosphere at the regional scale. Different measurement devices and energy-aware techniques are described to evaluate both time and energy to solution of the considered application and to gain detailed insights into power and performance requirements. Our motivation is to improve corresponding code sections to sustain performance while minimizing energy-to-solution. This preliminary work sets the basis for subsequent studies to tackle challenges related to energy efficient high performance computing in the framework of the Exa2Green project.

Are our Dense Linear Algebra Libraries Energy-Friendly? Time-Power-Energy Trade-Offs in BLAS and LAPACK

Jose I. Aliaga, Maria Barreda, Manuel F. Dolz, Enrique S. Quintana-Ortí

In this paper we conduct a detailed analysis of the sources of power dissipation and energy consumption during the execution of current dense linear algebra kernels on multicore processors, binding these two metrics together with performance to the arithmetic intensity of the operations. In particular, by leveraging the RAPL interface of an Intel E5 ("Sandy Bridge") six-core CPU, we decompose the power-energy duo into its core (mainly due to floating-point units and cache), RAM (off-chip accesses), and uncore components, performing a series of illustrative experiments for a range of memory-bound to CPU-bound high performance kernels. Additionally, we investigate the energy proportionality of these three architecture components for the execution of linear algebra routines on the Intel E5.

Panel Discussion

Imminent Topics in Energy Efficiency Research

Thomas Ludwig

Research on energy efficiency became very diverse over the years and as a consequence seems to have lost some momentum. We do not see much of a special funding for energy efficiency in HPC. For the discussion with funding agencies, what would be our top research priorities? Could we set up a roadmap for HPC energy efficiency research? We would love to discuss this with the audience of EnA-HPC.

Vendor Talks

Beyond Power Efficiency: the Path to True Datacenter Sustainability

Nicolas Dubé, HP

The last decade saw the emergence of the widely known PUE metric. But now that datacenter operators are arguing over PUEs under 1.1, are we really driving the right behaviour in a broader eco-responsible sense? Are power and energy efficiency necessarily equivalent? Although evaporative cooling works superbly compared to compressor-based chillers, can we consume water that freely? Could the carbon footprint of electricity drive site selection of datacenters? Ultimately, what would be the vision for a true net-zero datacenter...

Intel technologies, tools and techniques for power and energy efficiency analysis

Andrey Semin, Intel

HPC systems power and energy efficiency should be optimized using holistic approach that take into account complex interactions between the wide range of hardware and software components of systems and the datacentre infrastructure. New metrics such as TUE and ITUE allow architects to size energy usage effectiveness for specific application workloads. Intel's HPC platforms have a great set of integrated technologies for power monitoring, analysis and optimization. In the presentation we'll outline Intel's techniques and show examples of tools for ITUE analysis and optimization for several HPC application workloads

Monitoring and Controlling Power Usage on Cray XC30

Wilfried Oed, Cray

The Cray XC30 system provides comprehensive monitoring — logging where power is consumed and by which users. Power consumption is sampled periodically and reported via the Hardware Supervisory System. As each job completes, the monitoring subsystem logs both the energy consumed by the job and the CPU time. This infrastructure enables Cray XC30 sites to account the total cost of each job. Energy efficiency measures such as the Green500 report performance on the Linpack benchmark in megaflops per watt, an interesting measure, but not one that necessarily relates to sustained performance. The Cray XC30 allows users, administrators, and funding agencies to measure and account for the energy efficiency of their systems on their production workload.

Power consumption is monitored at the blade level with each blade reporting usage by the nodes and the network. Data is aggregated by the cabinet controllers and reported out-of-band to the system management workstation where it is logged for subsequent processing. Information on the allocation of nodes to jobs is logged at the same time, enabling energy consumption to be added to the data collected for each job.

Enhanced Power Monitoring with Megware SlideSX

Thomas Blum, Megware

During the past years the measurement of the power consumption on the AC side within server and HPC environments has been getting an important factor to foster energy efficient systems in general. With the MEGWARE SlideSX® computing platform we introduce a way to measure the power on the DC side for every single compute node with a very fine grained resolution. This talk will introduce the computing platform including the measurement and monitoring facilities that comes built into the system and gives an overview about how this information can be accessed and further used and which functionalities are under development.

High Definition Energy Efficiency Monitoring

Marc Simon, Bull

While the cost of energy becomes the limiting factor in growth of HPC cluster capacity, it becomes more and more important to have a precise measurement of the power consumption of each component of the cluster. The need for more precision implies a higher sampling rate on the sensor values. In addition, a better time resolution allows the user to match power consumption to the internals of his code. The higher speed and the volume of data generated make it necessary to move a part of the measuring work closer to the components on the managed board. A first step is to move the work of calibrating the sensors and aggregating data at the BMC level. In a second step, a dedicated FPGA keep a track of power consumption at high speed and make it available to the BMC. While aggregated data may still be read through the administration network for accounting purpose, a high frequency recording of current variations is made available to the host through a PCIe link. After a prototype based on water cooled B710 blades for HPC and IvyBridge processors, this solution for high precision energy measurement will be available on the next generation of B720 blades with Haswell processors.