Skip to yearly menu bar Skip to main content

Workshop: Complex feedback in online learning

Dynamical Linear Bandits for Long-Lasting Vanishing Rewards

Marco Mussi · Alberto Maria Metelli · Marcello Restelli


In many real-world sequential decision-making problems, an action might not immediately reflect on the feedback and spread its effects over a long time horizon. For instance, in online advertising, investing in a platform produces an increase in awareness, but the actual reward (e.g., a conversion) might occur far in the future. Furthermore, whether a conversion takes place depends on several factors: how fast the awareness grows; possible awareness vanishing effects; synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates into the future, disregarding possible hidden dynamical effects. In this paper, we introduce a novel setting, Dynamical Linear Bandits (DLB), an extension of linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and the action, and the hidden state evolves according to a linear dynamics. In this way, the effects of each action are delayed by the system evolution, persist over time, and the interplay between the action components is taken into account. After introducing the setting and discussing the notion of optimal policy, we provide an any-time optimistic regret minimization algorithm Dynamical Linear Upper Confidence Bound (DynLin-UCB) that suffers regret of order O(c d sqrt(T)), where c is a constant dependent on the properties of the linear dynamical evolution. Finally, we conduct a numerical validation on a synthetic environment to show the effectiveness of DynLin-UCB in comparison with bandit baselines.

Chat is not available.