Extreme Scale Systems for Machine Learning
This project focuses on the greatest challenges in utilizing HPC, especially the upcoming Aurora exascale supercomputer at Argonne National Laboratory, for machine learning. It seeks to optimize methods for exploiting data parallelism, model parallelism, ensembles, and parameter search.
In recent years, the models and data available for machine learning applications have grown dramatically. Extreme Scale HPC systems offer the opportunity to further accelerate performance and deepen understanding of large data sets through machine learning. However, current literature and public implementations are mostly focussed on either cloud‐based or small‐scale GPU environments. For example, these implementations do not make the best use of low latency inter-node communication in HPC environment (e.g., RMDA), one of the biggest advantages of a supercomputer. To leverage extreme scale system for ML applications, serious advances are required in both algorithms and their scalable, parallel implementations. Examples include training large models on large scientific data, facilitating very large ensembles and addressing inherent problems (e.g., noisy, missing data, scalable ingest) associated with large datasets.
RAMSES: Robust Analytic Models for Science at Extreme Scales
At the Mathematics and Computer Science in the Argonne National Laboratory. I develop end-to-end analytical performance models to transform understanding of the behavior of science workflows in extreme-scale science environments. These models are developed to predict the behavior of a science workflow before it is implemented, to explain why performance does not meet design goals, and to architect science environments to meet workflow needs.
I focus on:
- Modeling and simulating end-to-end data transfers over wide-area network;
- Analyzing www.globus.org transfer log towards explaining the wide-area network data transfer performance;
- Building modeling and simulation program that can effectively and efficiently explain the behavior of scientific workflows over a distributed infrastructure.
AMASE: Architecture and Management for Autonomic Science Ecosystems
Scientific computing systems are becoming significantly more complex, and have reached a critical limit in manageability using current human-in-the-loop techniques. The current state-of-the-art for managing HPC infrastructures does not leverage the remarkable advances in machine learning to more accurately predict, diagnose, and improve computational resources in response to user computation. The DOE science complex consists of thousands of interconnected systems that are geographically distributed. As distributed teams and complex workflows now span resources from telescopes and light sources to fast networks and smart IoT sensor systems, it is clear that a single, centralized, administrative team and software stack cannot coordinate and manage all of the resources. Instead, resources must begin to respond autonomically, managing and tuning their behavior in response to scientific workflows. This research proposal outlines a plan to explore the architecture, methods, and algorithms needed to support future scientific computing systems that self-tune and self-manage. We propose to make the science ecosystem smart by incorporating the functions of sensing, intelligence, and control. Our aim is threefold:
- Design a scalable architecture for smart science ecosystems.
- Embed intelligence in relevant sub-systems via light-weight machine learning.
- Explore methods for distributed and autonomous management of the systems.
We believe the outcome of this research to design and prototype a smart distributed science ecosystem has many benefits:
- Scientists using DOE computing infrastructure will be able to run workflows on automatically selected resources that are dynamically configured and tuned for their application.
- Facility and network operators will have the ability to predict and diagnose problems before they cause downtime.
High Performance Computing and Simulation
At the Computational Sciences and Engineering Division in the Oak Ridge National Laboratory. I worked on:
- A framework for efficient simulation on multi-GPU and multi-Core clusters, it is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We design it to support agent-based simulation and 3D finite difference method based simulation (stencil computation);
- Interactive, graphical processing unit-based evaluation (faster than real time) of evacuation scenarios at the state scale; [code & demo ]
- Implemented an earthquake wave propagation model on multiple GPUs using CUDA and the framework described in (1). [code & demo ]
Agent-Based Model and Simulation of Emergency Department
My PhD thesis entitled: Modeling and Simulation for Healthcare Operations Management using High Performance Computing and Agent-Based Model, supervised by Emilio Luque, is about high performance computing based simulation for the decision support of healthcare system operations management. Specifically, simulating the Emergency Department (ED) by using agent-based modeling techniques and making it work as a part of decision support system. I accomplished the modeling, implementation, calibration and validation work.
Since ED is a typical complex system, agent-based simulation technique was used to model the ED directly from an individual level, i.e., the behavior of staff, physical resources and patients. The system-level behavior, that of the system as a whole was considered to be emerged from these individual level behavior. Through this way, the model can represent more detailed information from bottom-up and capable to identify root causes of problems from individual behavior level.
The model has been verified and validated for an ED in Catalunia, Spain. High Performance Computing technique was used to simulate multi-scenarios simultaneously, and optimize unknown model parameters under data scarcity. By this means, the simulator can execute a large number of simulation scenarios in an acceptable period.
Automatic Model-parameter Calibration
One of the key issues in calibration is the acquisition of valid source information from the target system. We developed an automatic calibration (tuning) tool that is released with the general emergency department model. This tool enables the simulation users to calibrate model parameters for their own emergency department system without the involvement of model developers. We believe that the tool is promising for promoting the application of simulation on emergency department related studies.
In addition, I am also particularly interested in embedded system. I worked as an embedded engineer for three years in industrial area. I am experienced on both hardware design and embedded software (firmware) developing. I am very optimistic about the application of embedded devices on high performance computing, such as FPGA and DSP for specific speedup.
I am quite enthusiastic about using artificial intelligence techniques to solve practical problems and make our life better. I keep learning AI related advances in my spare time solely driven by interests. Recently, we used machine learning algorithms to explain / predict wide area data transfer performance. I also worked to use deep reinforcement learning to achieve a smart data transfer node, we consider it as the first step to smart HPCC.