ICML 2019 Expo Talk

July 23, 2023

Expo 2019 Schedule »

Arion: a next-generation distributed deep learning virtual machine

Sponsor: Petuum Inc.

Hao Zhang (Petuum)

Hao Zhang (Petuum)


"We present Arion, the next-gen distributed machine learning system developed in-house from Petuum. Arion is a system-model-algorithm codesigned distributed system. Arion draws insights from many past projects -- different models exhibit different runtime characteristics, and different learning algorithms may demonstrate different computational patterns, both demanding model and algorithm-aware system treatments for optimal distributed execution performance.

At a lower level, Arion supports a variety of communication primitives ranging from parameter server, MPI to Sufficient Factor broadcasting and ring allreduce. Given a piece of deep learning (DL) code (e.g., in TensorFlow or Pytorch), Arion analyzes and extracts the model definition and algorithms therein and cast them into an intermediate representation (IR) with distributed execution semantics. We designed and implemented a compiler within Arion to further optimize and translate the IR into a set of distributed execution strategies. We propose a variety of rules and machine learning-based methods to generate the optimal execution strategies depending on the given model, algorithm, cluster specifications, resource constraints, and beyond.

Interface-wise, Arion adopts the philosophy of minimizing users’ code modification, exposing a set of interfaces that allows distributing arbitrary single-node version TensorFlow or PyTorch code with little-to-zero modification.

The Arion discussion will also include a short a 5 - 10 mins of live showcasing of the usage and performance of Arion during the talk. "