How2Everything: Mining the Web for How-to Procedures to Evaluate and Improve LLMs
Abstract
Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our pipeline How2Mine extracts and rewrites 351K procedures from 980K web pages across 14 topics, and can scale to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. We also introduce How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier judge into an open 8B model achieving 80.5\% agreement with human annotators. How2Bench reveals clear scaling trends across model size and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three base models without systematic regressions on standard benchmarks, with gains not primarily explained by source-document memorization or superficial format compliance. We release all code and data upon acceptance.