Bilinear Bandits with Partially Observable Features
Wooseong Cho ⋅ JiHyeong Park ⋅ Min-hwan Oh
Abstract
We study bilinear bandits with partially observable features on both the user and item sides. In each of $T$ rounds, the learner selects an arm and observes only the reward for the chosen pair. The reward model is linear in the user and item features with an unknown parameter matrix. Existing literature commonly reduces this problem to a linear bandit via a Kronecker product representation of user and item features, at the cost of increased dimensionality. We propose \texttt{BiRoLF}, an algorithm robust to latent features, which directly leverages the bilinear structure without such linearization. It enhances feature selection by augmenting the null space of the observed features and employs doubly robust (DR) estimation to impute unobserved rewards for unselected arms, constructing unbiased pseudo-rewards. We estimate the parameters using Lasso regularization, which promotes sparsity in the coefficients of latent components orthogonal to the observed features. \texttt{BiRoLF} achieves a $\tilde{O}(\sqrt{(d_x + d_{h_x})(d_y + d_{h_y}) T})$ regret bound, where $d_x$ and $d_y$ are the dimensions of the observable feature vectors, $d_{h_x}$ and $d_{h_y}$ denote the numbers of nonzero coefficients in the components orthogonal to the observed features. We segment cases by the relationship between observable and latent features and find that \texttt{BiRoLF} achieves strong regret performance while outperforming competing methods in computational metrics, reducing the overhead of feature linearization.
Successful Page Load