Root Cause Analysis of Failures in Microservices via Bayesian Root Cause Discovery
Kenneth Lee ⋅ Zihan Zhou ⋅ Murat Kocaoglu
Abstract
Modern cloud systems rely on architectures with many interconnected microservices, which enable scalability and flexibility but make troubleshooting failures difficult. Identifying the root cause requires navigating complex dependencies, often beyond the capacity of domain experts. Causal models offer a principled approach to root cause analysis (RCA), but prior methods are typically sample inefficient, as they assume access to the full causal graph or require large numbers of post-failure interventions. We introduce Bayesian Root Cause Discovery (BRCD), which leverages a partial causal structure (a CPDAG learned during the pre-failure period) and performs Bayesian inference without enumerating all DAGs from each interventional Markov equivalence class ($\mathcal{I}$-MEC) for each root cause candidate. Using a recent uniform DAG sampling framework (Wienöbst et al., 2023), BRCD provides the first statistical consistency guarantees for nonparametric RCA, with both identifiability and finite-sample posterior bounds under $\varepsilon$-vanishing approximation. Empirically, across synthetic benchmarks and three microservice systems (Online Boutique, Sockshop, Petshop), BRCD achieves state-of-the-art top-$l$ accuracy while remaining effective in low-failure-sample regimes and scaling to large graphs.
Successful Page Load