Poster
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Jake Poznanski ⋅ Aman Rangapur ⋅ Jon Borchardt ⋅ Jason Dunkelberger ⋅ Christopher Wilhelm ⋅ Kyle Lo ⋅ Luca Soldaini

2025 Poster
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning

Project Page [ Poster] [ OpenReview]

Abstract

PDF documents contain trillions of novel, high-quality tokens valuable for language model training, but their diverse formats and layouts complicate content extraction. Traditional open source tools yield lower quality results than vision-language models (VLMs), yet the best VLMs are costly (e.g., over 6,240 USD per million PDF pages for GPT-4o) or inaccessible when working with proprietary documents. We present olmOCR, an open-source toolkit for converting PDFs into clean, linearized plain text in natural reading order while preserving structure such as sections, tables, and equations. Our toolkit uses a fine-tuned 7B VLM trained on 260,000 pages from over 100,000 varied PDFs, including graphics, handwritten text, and poor scans. olmOCR is optimized for large-scale batch processing, converting a million pages for only 176 USD. We find olmOCR outperforms even top VLMs including GPT-4o, Gemini Flash 2, and Qwen-2.5-VL on OLMOCR-BENCH, a curated set of 1,400 challenging PDFs with fine-grained unit tests that remain difficult even for the best tools and VLMs. We openly release all components of olmOCR: our fine-tuned VLM model, training code and data, an efficient inference pipeline that supports vLLM and SGLang backends, and the benchmark.

Chat is not available.