Timezone: »

On Efficient Constructions of Checkpoints
Yu Chen · Zhenming LIU · Bin Ren · Xin Jin

Thu Jul 16 07:00 AM -- 07:45 AM & Thu Jul 16 06:00 PM -- 06:45 PM (PDT) @ Virtual #None

Efficient construction of checkpoints/snapshots is a critical tool for training and diagnosing deep learning models. In this paper, we propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint). LC-Checkpoint simultaneously maximizes the compression rate and optimizes the recovery speed, under the assumption that SGD is used to train the model. LC-Checkpoint uses quantization and priority promotion to store the most crucial information for SGD to recover, and then uses a Huffman coding to leverage the non-uniform distribution of the gradient scales. Our extensive experiments show that LC-Checkpoint achieves a compression rate up to 28× and recovery speedup up to 5.77× over a state-of-the-art algorithm (SCAR).

Author Information

Yu Chen (College of William and Mary)
Zhenming LIU (College of William & Mary)
Bin Ren (College of William and Mary)
Xin Jin (Johns Hopkins University)