Poster Tue, Jul 15, 2025 • 11:00 AM – 1:30 PM PDT

LOB-Bench: Benchmarking Generative AI for Finance - an Application to Limit Order Book Data

Peer Nagy · Sascha Frey · Kang Li · Bidipta Sarkar · Svitlana Vyetrenko · Stefan Zohren · Anisoara Calinescu · Jakob Foerster

Project Page [ OpenReview]

Abstract

While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we present LOB-Bench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains "market impact metrics", i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes.

Lay Summary

High-frequency trading data is an interesting task to consider for models which aim to generate sequential events, i.e. where a given generated piece of data depends on all the previously generated data. There are a number of models which attempt to generate this kind of high-frequency financial data using different approaches, but it is very difficult to compare them. This paper aims to provide a series of evaluations to allow for the comparison of data generated by different models. The benchmark consists of three main parts. The first considers different features that can be measured from a sequence of generated events. For example, the number of orders to buy or a sell a stock in a time period. Measuring this feature across a large number of sequences allows for the construction of a distribution, in practice a histogram. This distribution can be constructed for both real and generated data, and metrics can be applied to measure how similar or different these distributions are. We also consider the case, where a neural network itself learns the distinguishing features of the data. Secondly, we consider the price impact of different order types. A sign that a generative model is able to reproduce sequences well is if the arrival of a given order type moves the price in an expected way, on average, over some time-period. Finally, we use generated data to see how this affects the learning of a trend forecasting task. We compare both cutting-edge models for sequence generation with more traditional models and find that the newer models outperform the traditional ones. Having access to this sort of benchmark is very important as it allows researchers to compare how good their models are in this application in a standardised way.

Video

Chat is not available.