firstbacksecondback
54 Results
Poster
|
Thu 2:30 |
Benchmarking Deletion Metrics with the Principled Explanations Yipei Wang · Xiaoqian Wang |
|
Poster
|
Thu 4:30 |
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution Alex Gu · Baptiste Roziere · Hugh Leather · Armando Solar-Lezama · Gabriel Synnaeve · Sida Wang |
|
Poster
|
Wed 2:30 |
Position: Benchmarking is Limited in Reinforcement Learning Research Scott Jordan · Adam White · Bruno da Silva · Martha White · Philip Thomas |
|
Poster
|
Thu 4:30 |
FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning Wenzhe Li · Zihan Ding · Seth Karten · Chi Jin |
|
Poster
|
Tue 2:30 |
OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution Shift Lin Li · Yifei Wang · Chawin Sitawarin · Michael Spratling |
|
Poster
|
Thu 2:30 |
LCA-on-the-Line: Benchmarking Out of Distribution Generalization with Class Taxonomies Jia Shi · Gautam Rajendrakumar Gare · Jinjin Tian · Siqi Chai · Zhiqiu Lin · Arun Balajee Vasudevan · Di Feng · Francesco Ferroni · Shu Kong |
|
Poster
|
Wed 2:30 |
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark Yihua Zhang · Pingzhi Li · Junyuan Hong · Jiaxiang Li · Yimeng Zhang · Wenqing Zheng · Pin-Yu Chen · Jason Lee · Wotao Yin · Mingyi Hong · Zhangyang “Atlas” Wang · Sijia Liu · Tianlong Chen |
|
Poster
|
Wed 4:30 |
CurBench: Curriculum Learning Benchmark Yuwei Zhou · Zirui Pan · Xin Wang · Hong Chen · Haoyang Li · Yanwen Huang · Zhixiao Xiong · Fangzhou Xiong · Peiyang Xu · Shengnan liu · Wenwu Zhu |
|
Poster
|
Wed 2:30 |
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI Kaining Ying · Fanqing Meng · Jin Wang · Zhiqian Li · Han Lin · Yue Yang · Hao Zhang · Wenbo Zhang · Yuqi Lin · Shuo Liu · jiayi lei · Quanfeng Lu · Runjian Chen · Peng Xu · Renrui Zhang · Haozhe Zhang · Peng Gao · Yali Wang · Yu Qiao · Ping Luo · Kaipeng Zhang · WENQI SHAO |
|
Poster
|
Thu 4:30 |
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks Guanhua Zhang · Moritz Hardt |
|
Poster
|
Wed 4:30 |
TravelPlanner: A Benchmark for Real-World Planning with Language Agents Jian Xie · Kai Zhang · Jiangjie Chen · Tinghui Zhu · Renze Lou · Yuandong Tian · Yanghua Xiao · Yu Su |
|
Oral
|
Thu 7:45 |
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark Dongping Chen · Ruoxi Chen · Shilin Zhang · Yaochen Wang · Yinuo Liu · Huichi Zhou · Qihui Zhang · Yao Wan · Pan Zhou · Lichao Sun |