Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference
Zifan He ⋅ Rui Ma ⋅ Yizhou Sun ⋅ Jason Cong
Abstract
Modern large language model (LLM) serving increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: *Prepare Memory*, *Compute Relevancy*, *Retrieval*, and *Apply to Inference*. Through systematic profiling, we identify a 22\%-97\% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system achieves $1.04\sim2.2\times$ speedup and $1.11\sim4.7\times$ energy reduction across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.
Successful Page Load