ICML 2025 Sunday 07/13

Registration Desk

Registration West

11:00 AM - 5:30 PM

Expo Talk Panel

Thursday Night Football on Prime Video – Broadcast Innovation

Tal Darom

2:00 PM - 3:00 PM

This lecture will present the advanced analytics and machine learning-powered features developed by Amazon to enhance the live viewing experience for Thursday Night Football on Prime Video. Leveraging player tracking data, the team developed three novel features: Defensive Vulnerability, which identifies defensive weak spots before the snap; Pressure Alerts, a deep learning model that predicts quarterback pressure; and Coverage Prediction, which forecasts man vs. zone coverage. These features harness Amazon's cloud and edge computing capabilities to deliver real-time insights to both casual and avid fans. The results highlight how sports broadcasting is shifting towards a data-driven approach powered by the latest advancements in artificial intelligence.

... more

Expo Talk Panel

Gen AI Applications in Amazon Pharmacy

Yifu Chen · Cristobal Pais

2:00 PM - 3:00 PM

We present two innovative applications of Large Language Models (LLMs), at Amazon Pharmacy: (1) an AI assistant for customer support and (2) a medication direction copilot for patient safety. The Pharmacy AI Assistant, launched in March 2024 based on LLM and RAG, led to an 11% reduction in human support contact rate and a 15% improvement in issue resolution rates. We later introduced a new feature: a hybrid architecture integrating multi-armed bandits with LLMs to dynamically suggest follow-up questions, addressing challenges in customer inquiry articulation while balancing exploration and exploitation in conversational flows, which led to an additional 9% reduction in contact rate; (2) MEDIC (medication direction copilot), a system that emulates pharmacist reasoning by fine-tuning an LLM with a compact set of expert-annotated directions to accurately extract and communicate core prescription components. When compared against two state-of-the-art LLM-based benchmarks, competitors, they recorded 1.51 and 4.38 times more near-miss events (errors caught before reaching our patients) than MEDIC. Production deployment demonstrated a 33% reduction in medication direction errors. These systems demonstrate how LLMs, when enhanced with domain expertise and strategic architectural decisions, can significantly improve both customer experience and patient safety in online pharmacy operations.

... more

Expo Talk Panel

OmniLong: A Resource-Effective Context Scaling Framework for Multimodal LLM Fine-tuning

Yin Song · Chen Wu

2:00 PM - 3:00 PM

This presentation introduces OmniLong, a novel computational framework addressing the fundamental challenge of context length scaling in multimodal large language models (MLLMs). While extended contextual understanding across high-frame-rate videos and lengthy documents represents a critical frontier for practical applications, current approaches necessitate substantial computational infrastructure that creates significant barriers to entry. OmniLong offers a paradigm shift through a cohesive architecture that simultaneously extends context across textual and visual modalities while reducing computational requirements. Through advanced sequence parallelism and strategic CPU-GPU memory management techniques, OmniLong demonstrates superior computational efficiency by successfully fine-tuning models on high-density video content comprising up to 2048 sampled frames while utilizing only 8 A100 GPUs. Empirical evaluation shows OmniLong-enhanced models consistently outperform their foundational counterparts on established benchmarks, with OmniLong-Qwen2.5-VL-7B achieving particularly notable results on the VideoMME leaderboard for video analysis tasks. This talk will present a comprehensive analysis of OmniLong's technical architecture, optimization methodology, and broader implications for democratizing access to state-of-the-art multimodal AI capabilities across research institutions and industrial applications with diverse resource constraints.

... more

Expo Talk Panel

Distributed Computing Architectures as a Solution to AI's Energy Crisis: Empirical Analysis of Decentralized Training

Greg Osuri

4:00 PM - 5:00 PM

The exponential growth in AI model size has created unprecedented energy demands that challenge traditional computing infrastructure. Recent industry reports have estimated that by 2040, AI inference and training will collectively require 600 terawatt-hours annually—equivalent to the energy consumption of a medium-sized industrial nation. Current hyperscaler architectures introduce critical bottlenecks: geographically concentrated energy demands, transmission constraints, and concerning environmental impacts, with some facilities resorting to fossil fuel consumption to meet power requirements.

Greg Osuri, founder and core contributor of Akash Network, will discuss how decentralized marketplaces efficiently allocate resources across geographically dispersed nodes. He will demonstrate how Akash has achieved approximately 70% resource utilization rates across heterogeneous hardware configurations, including recent breakthroughs in distributed training algorithms that overcome previous limitations in heterogeneous compute environments. The presentation will include a technical analysis of small modular data center architectures optimized for distributed AI workloads, including their integration with renewable energy sources. This will highlight how decentralized approaches can address current energy constraints while democratizing access to compute resources, potentially preventing market concentration that threatens open innovation in AI research.

... more

Expo Talk Panel

A Unified Framework for Generative AI Safety

Pin-Yu Chen

4:00 PM - 5:00 PM

Large language models (LLMs) and Generative AI (GenAI) are at the forefront of frontier AI research and technology. With their rapidly increasing popularity and availability, challenges and concerns about their misuse and safety risks are becoming more prominent than ever. In this talk, we introduce a unified computational framework for evaluating and improving a wide range of safety challenges in generative AI. Specifically, we will show new tools and insights to explore and mitigate the safety and robustness risks associated with state-of-the-art LLMs and GenAI models, including (i) safety risks in fine-tuning LLMs, (ii) LLM jailbreak mitigation, (iii) prompt engineering for safety debugging, and (iv) robust detection of AI-generated content.

... more

Expo Talk Panel

JokeEval: Are the Jokes Funny? Review of Computational Evaluation Techniques to improve Joke Generation

Sulbha Jain

4:00 PM - 5:00 PM

Humor is a nuanced and essential facet of human communication, often relying on
incongruity, surprise, and cultural context to elicit amusement. This paper presents
JokeEval, a computational framework designed to evaluate the quality of AIgenerated
jokes. Through empirical experiments on both synthetic and open-source
datasets, we demonstrate that machine learning techniques—particularly a hybrid
Convolutional Neural Network with recurrent layers—can effectively distinguish
between “Funny” and “Not Funny” jokes, achieving a statistically significant
F1-score of 71.2% on the ColBERT dataset. Our methodology leverages highdimensional
vector embeddings, crowd-sourced human annotations, and diverse
evaluation pipelines—including supervised classifiers, deep neural networks, and
LLM-as-a-judge protocols—to assess humor at scale. In doing so, we highlight both
the promise and current limitations of AI in understanding and generating humor.
The results pave the way for more engaging, human-aligned content generation
and offer a feedback loop to iteratively improve joke-writing capabilities in virtual
assistants and other AI-driven systems.

... more

Expo Talk Panel

Distillation Scaling Laws

Dan Busbridge · Jason Ramapuram · Russ Webb · Floris Weers

5:00 PM - 6:00 PM

Smaller models are cheaper to serve, faster, use less battery, produce less heat, have a lower inference carbon footprint, and are easier to study for academics. Historically, small, capable models have been expensive to train. Knowledge distillation can reduce pretraining costs, yet is poorly understood - we find a distillation scaling law, enabling efficient pretraining strategies, bringing our understanding of distillation closer to our understanding of supervised pretraining.

... more

Expo Talk Panel

Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation

Rachel Lawrence

5:00 PM - 6:00 PM

Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true “reasoning” or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE: a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

... more

Expo Talk Panel

Human-Aligned Long-Form Evaluation (HALF-Eval): Framework for Assessing AI-Generated Content

Sulbha Jain

5:00 PM - 6:00 PM

Evaluating the quality of long-form AI-generated content remains a significant
challenge, particularly in achieving consistent alignment with human judgment
across diverse formats. This paper presents the Human-Aligned Long-Form Evaluation
(HALF-Eval) framework, a generalize, scalable and systematic methodology
for assessing the quality of AI-generated long form contents e.g. articles, blogs,
and essays. HALF-Eval utilizes a structured checklist-based evaluation to capture
essential dimensions of content quality, including depth, coherence, relevance, and
evidence support. By leveraging human-annotated data, the framework trains machine
learning models to aggregate individual checklist scores into comprehensive
quality assessments, enabling automated and reliable classification of content as
high- or low-quality. Experimental results demonstrate that HALF-Eval outperforms
conventional LLM-based scoring approaches, achieving closer alignment
with human evaluators and providing actionable feedback for iterative content
improvement. The proposed framework offers a robust foundation for advancing
grounded, human-centric evaluation systems and supports the scalable generation
of high-quality AI-driven long-form content.

... more