Dissecting Post-Training: Uncovering the Complementary Roles of SFT and RL for Document Parsing
Abstract
Document parsing, the task of extracting diverse content from PDFs while preserving their structural integrity, has been significantly advanced by Multimodal Large Language Models (MLLMs). These models have achieved remarkable success, largely driven by extensive post-training on massive datasets. This paper therefore undertakes a deep analysis of the two dominant adaptation strategies, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), prompted by a puzzling observation on the PDF-to-Markdown task: SFT makes a negligible impact, especially on parsing complex tables and formulas, while RL achieves substantial overall gains. To unravel the reasons, our systematic investigation reveals a clear and complementary division of labor: SFT primarily operates as a structure learner, biased towards mastering the low-entropy syntax of document layouts. While it learns the format of a table, it struggles to ensure the fidelity of its high-entropy cell content. Conversely, RL excels as a content refiner by optimizing a holistic reward that reflects final accuracy. We further ground this phenomenon in the distinct theoretical nature of their respective objective functions. Based on these findings, we introduce a unified strategy that explicitly harnesses their individual strengths while mitigating their weaknesses. This work shows that a deep understanding of post-training methods is key to unlocking performance beyond what data scaling alone can achieve.