"On-device machine learning has been challenging, as it not only has more constraints (power, memory, thermal) compared to the cloud server, but the quality and efficiency bar can be relatively high. With the recent trend of larger language models, it further pushes the limit of the hardware.
ExecuTorch, a native on-device solution from PyTorch, is designed in such a way that researchers and ML engineers can fully optimize the models to the target hardware within the PyTorch ecosystem. The design principles are:
The Executor is focused on execution. Whereas obvious, this means we won’t tag any other intent to the program representation nor to the runtime’s APIs, such as human readability of the program or recomputing a memory plan. Any extra concerns are thus implemented as support libraries and not on the executable program. The program is unidirectional and the runtime’s program immutable. The program starts from the source code, and any change to the runtime program has to start again from the source or some other higher level representation. It basically works as a binary. The runtime’s behavior is entirely defined by (a small) instruction set. Any and all capabilities are determined by the runtime’s instruction set. The runtime makes no decisions. Its behavior is pre-defined by the instructions it supports and the program it runs. Debugging and program understanding is handled by specific tools to analyze the executable program according to its original source (i.e. the pytorch source).
We already enable a list of models (See appendix 1) in one or more backends. And we showcase SOTA CPU performance for llama2 on device, and early support with promising numbers for llama3 (See appendix 2). Moving forward, we target to enable more models, more LLMs, multi modality, and further accelerate LLMs by lowering to hardware accelerators.
We’d like to walk through the colab notebook to demonstrate how to export the model and deploy on device."