Zhengchun Liu (刘正春)

Senior Machine Learning Scientist at AWS AI Labs

Home Research Publication Experience Team News

ML-Powered Automatic Workload Management in Cloud Data Warehouse

The Automatic Workload Management in Redshift was achieved by the prediction of characteristics of queries submitted to the warehouse. These characteristics include, but not limitted to, memory requirement, execution time, scalability etc. In this project, I lead the design and implementation of Mixture of Expert based algorithms for such predictions to support billions of workload management decisions everyday for customer’s workloads. These new algorithms had been released to customer as part of Amazon Redshift patch 179. We have seen significant improvement to the workload management effectiveness from various aspects: 1/ the median queue time was improved by 3x in the whole fleet; 2/ the number of queue-free clusters was increased by 10%; 3/ execution time spent on queries with insufficient resource allocation was reduced to its half. I will share part of our solutions (which are publish-able), and lessons we learnt from the project in Sigmod’2024. See you in Satiago, Chile!

AI-driven Autonomous Data Warehouse

Amazon Redshift Serverless uses AI techniques to scale automatically with workload changes across all key dimensions—such as data volume changes, concurrent users, and query complexity—to meet and maintain price performance targets. I led the Research and Development of Machine Learning in Redshift for AI-driven Autonomous Data Warehouse. Part of our solutions are publised in Sigmod’2024. With these new AI-driven scaling and optimizations, Amazon Redshift Serverless learns workload patterns based on dimensions like query complexity and data volumes. It continually adjusts resources throughout the day to apply tailored performance optimizations, and it automatically and proactively adjusts the capacity based on actual workload needs. Internal tests demonstrate that these optimizations can give up to 10x better price performance for variable workloads without manual intervention. This product is lunched at AWS ReInvent 2023.

Actionable Information from Sensor to Data Center

This project will enable an AI-powered real-time information extraction (e.g., BraggNN ) pipeline, with microsecond-scale latency, that is scalable to handle the TB/s-level throughputs generated by the high-rate imaging detectors of LCLS-II-HE and APS-U. By integrating advances in experimental methods and computational capabilities, we will realize a new approach to experimental data analytics, one that leverages resources across the DOE complex to unlock the full advantage of the LCLS and APS upgrades. A key innovation is the online linking of (1) light source data system architectures configured to perform rapid ML/AI inference on streaming data with (2) AI-enabled high-performance computing (HPC) systems designed for high-speed ML/AI model training. The latter one is my focus to the project.

Tomographic reconstruction with accelerator technology and deep learning

This project focuses on exploring the use of accelerator technology for deep learning. Specific works include, Aurora accelerator for deep learning training and testing at scale, the utilization of HPC, especially Aurora architecture, for machine learning. It seeks to optimize methods for exploiting data parallelism, model parallelism, ensembles, and model architecture and hyperparameter search.

Extreme Scale Computing Systems for Machine Learning

This project focuses on the greatest challenges in utilizing HPC, especially the upcoming Aurora exascale supercomputer at Argonne National Laboratory, for machine learning. It seeks to optimize methods for exploiting data parallelism, model parallelism, ensembles, and parameter search.

In recent years, the models and data available for machine learning applications have grown dramatically. Extreme Scale HPC systems offer the opportunity to further accelerate performance and deepen understanding of large data sets through machine learning. However, current literature and public implementations are mostly focussed on either cloud‐based or small‐scale GPU environments. For example, these implementations do not make the best use of low latency inter-node communication in HPC environment (e.g., RMDA), one of the biggest advantages of a supercomputer. To leverage extreme scale system for ML applications, serious advances are required in both algorithms and their scalable, parallel implementations. Examples include training large models on large scientific data, facilitating very large ensembles and addressing inherent problems (e.g., noisy, missing data, scalable ingest) associated with large datasets.


RAMSES: Robust Analytic Models for Science at Extreme Scales

At the Mathematics and Computer Science in the Argonne National Laboratory. I develop end-to-end analytical performance models to transform understanding of the behavior of science workflows in extreme-scale science environments. These models are developed to predict the behavior of a science workflow before it is implemented, to explain why performance does not meet design goals, and to architect science environments to meet workflow needs.

I focus on:

  • Modeling and simulating end-to-end data transfers over wide-area network;
  • Analyzing www.globus.org transfer log towards explaining the wide-area network data transfer performance;
  • Building modeling and simulation program that can effectively and efficiently explain the behavior of scientific workflows over a distributed infrastructure.

AMASE: Architecture and Management for Autonomic Science Ecosystems

Scientific computing systems are becoming significantly more complex, and have reached a critical limit in manageability using current human-in-the-loop techniques. The current state-of-the-art for managing HPC infrastructures does not leverage the remarkable advances in machine learning to more accurately predict, diagnose, and improve computational resources in response to user computation. The DOE science complex consists of thousands of interconnected systems that are geographically distributed. As distributed teams and complex workflows now span resources from telescopes and light sources to fast networks and smart IoT sensor systems, it is clear that a single, centralized, administrative team and software stack cannot coordinate and manage all of the resources. Instead, resources must begin to respond autonomically, managing and tuning their behavior in response to scientific workflows. This research proposal outlines a plan to explore the architecture, methods, and algorithms needed to support future scientific computing systems that self-tune and self-manage. We propose to make the science ecosystem smart by incorporating the functions of sensing, intelligence, and control. Our aim is threefold:

  1. Design a scalable architecture for smart science ecosystems.
  2. Embed intelligence in relevant sub-systems via light-weight machine learning.
  3. Explore methods for distributed and autonomous management of the systems.

We believe the outcome of this research to design and prototype a smart distributed science ecosystem has many benefits:

  1. Scientists using DOE computing infrastructure will be able to run workflows on automatically selected resources that are dynamically configured and tuned for their application.
  2. Facility and network operators will have the ability to predict and diagnose problems before they cause downtime.

High Performance Computing and Simulation

At the Computational Sciences and Engineering Division in the Oak Ridge National Laboratory. I worked on:

  • A framework for efficient simulation on multi-GPU and multi-Core clusters, it is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We design it to support agent-based simulation and 3D finite difference method based simulation (stencil computation);
  • Interactive, graphical processing unit-based evaluation (faster than real time) of evacuation scenarios at the state scale; [code & demo ]
  • Implemented an earthquake wave propagation model on multiple GPUs using CUDA and the framework described in (1). [code & demo ]

Model Parameters Calibration

One of the key issues in calibration is the acquisition of valid source information from the target system. We developed an automatic calibration (tuning) tool that is released with the general emergency department model. This tool enables the simulation users to calibrate model parameters for their own emergency department system without the involvement of model developers. We believe that the tool is promising for promoting the application of simulation on emergency department related studies.

                           
Prototype before polishing. Get it working before you optimize it.
HTML Counter unique visitors since March 2015