InertialAR: Autoregressive 3D Molecule Generation with Inertial Frames
Abstract
Transformer-based autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) how to tokenize molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) how to design an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. It first performs generation-oriented canonical tokenization by aligning each molecule to a canonical inertial frame and reordering atoms, thereby converting arbitrary 3D structures into a unique, SE(3)- and permutation-invariant sequence of tokens for autoregressive generation. Built upon this canonical tokenization, we propose geometric rotary positional encoding (GeoRoPE), which endows Transformer attention with 3D geometric awareness. Finally, InertialAR utilizes a hierarchical autoregressive paradigm to decode the next atom, consecutively predicting the atom type and 3D coordinates via Diffusion Loss. Experimentally, InertialAR achieves state-of-the-art performance on 8 of the 10 evaluation metrics for unconditional generation across QM9, GEOM-Drugs, and B3LYP. Moreover, it significantly outperforms baselines in controllable generation for targeted chemical functionality, attaining state-of-the-art results across all 5 metrics.