ICML Expo Talk Panel Model Optimization Flywheel: Continuously Self-Improving LLMs in Production

Expo Talk Panel

Model Optimization Flywheel: Continuously Self-Improving LLMs in Production

Andrew McNamara ⋅ Cody Mazza-Anthony ⋅ Shuying Sun

HALL D1

[ Abstract ]

Sun 5 Jul 7:30 p.m. PDT — 8:30 p.m. PDT

Abstract:

We present Shopify's Model Optimization Flywheel, a practical methodology for turning frontier-quality LLM behavior into faster, cheaper, and continuously improving production systems. The flywheel starts with reliable evaluation: LLM-as-judge evaluators grounded in human-labeled data become the canonical metrics for prompt optimization, distillation, and production regressions.
Using Tangle-powered experimentation workflows, we optimize frontier-model system prompts, collect training data from production A/B traffic and synthetic merchant/user rollouts, and distill smaller models with SFT, on-policy distillation, and GRPO. These models can replicate, and in some cases exceed, frontier-model behavior at much lower serving cost. We then compress prompts with gist tokens to reduce context overhead and improve latency.After deployment, the loop continues by sampling low-scoring production conversations, using stronger reasoning models to critique and "heal" them, folding repaired examples back into training, and re-running distillation. This flywheel has reduced serving cost and latency while improving production quality. We will share concrete recipes, quality-cost-latency trade-offs, and a blueprint for building self-improving LLM systems that get better and cheaper over time.

Live content is unavailable. Log in and register to view live content