Future Adaptive Machine Learning Systems
- Dr. Luo Mai, University of Edinburgh
- Time: 2021-05-19 14:00
- Host: Dr. Hao Dong
- Venue: Room 101, Courtyard No.5, Jingyuan+Online Talk
When using distributed machine learning (ML) systems to train models on a cluster, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program.
In this talk, we will first introduce KungFu, a distributed ML library for TensorFlow that is designed to enable adaptive training. KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronization strategy updates). KungFu is an initial step towards an adaptive ML cluster. As the next step, we will explore designs that can achieve adaptive resource provisioning for machine learning systems. The realization of such designs can largely benefit from emerging Actors (e.g., Ray) and Serverless platforms.
Luo Mai is an Assistant Professor at the School of Informatics, University of Edinburgh, where he leads the large-scale AI systems group. His research has led to publications in OSDI, NSDI, VLDB and USENIX ATC, and popular open-source AI libraries, such as TensorLayer, HyperPose and KungFu. Before joining Edinburgh, Luo was a research associate at Imperial College London and a visiting researcher at Microsoft Research. Luo holds a PhD from Imperial College London, and his study was supported by a Google PhD Fellowship.