Abstract

Tractable Learning of Large Bayes Net Structures from Sparse Data
Anna Goldenberg - Carnegie Mellon University Andrew Moore - Carnegie Mellon University
This paper addresses three questions. Is it useful to attempt to learn a Bayesian network structure with hundreds of thousands of nodes? How should such structure search proceed practically? The third question arises out of our approach to the second: how can Frequent Sets \cite{agrawal:sigmod93}, which are extremely popular in the area of descriptive data mining, be turned into a probabilistic model? Large sparse datasets with hundreds of thousands of records and attributes appear in social networks, warehousing, supermarket transactions and web logs. The complexity of structural search made learning of factored probabilistic models on such datasets unfeasible. Wepropose to use Frequent Sets to significantly speed up the structural search.Unlike previous approaches, we not only cache n-way sufficient statistics, butalso exploit their local structure. We also present an empirical evaluation ofour algorithm applied to several massive datasets.

Tractable Learning of Large Bayes Net Structures from Sparse Data

Anna Goldenberg - Carnegie Mellon University
Andrew Moore - Carnegie Mellon University

This paper addresses three questions. Is it useful to attempt to learn a Bayesian network structure with hundreds of thousands of nodes? How should such structure search proceed practically? The third question arises out of our approach to the second: how can Frequent Sets \cite{agrawal:sigmod93}, which are extremely popular in the area of descriptive data mining, be turned into a probabilistic model? Large sparse datasets with hundreds of thousands of records and attributes appear in social networks, warehousing, supermarket transactions and web logs. The complexity of structural search made learning of factored probabilistic models on such datasets unfeasible. Wepropose to use Frequent Sets to significantly speed up the structural search.Unlike previous approaches, we not only cache n-way sufficient statistics, butalso exploit their local structure. We also present an empirical evaluation ofour algorithm applied to several massive datasets.