Abstract

Semi-Supervised Learning Using Randomized Mincuts
Avrim Blum - Carnegie Mellon University John Lafferty - Carnegie Mellon University Mugizi Rwebangira - Carnegie Mellon University Rajashekar Reddy - Carnegie Mellon University
In many application domains there is a large amount of unlabeled data but onlya very limited amount of labeled training data. One general approach that hasbeen explored for utilizing this unlabeled data is to construct a graph on allthe data points based on distance relationships among examples, and then touse the known labels to perform some type of graph partitioning. One naturalpartitioning to use is the minimum cut that agrees with the labeled data (Blumand Chawla, 2001), which can be thought of as giving the most probable labelassignment if one views labels as generated according to a Markov Random Fieldon the graph. Zhu et al. (2003) propose a cut based on a relaxation of thisfield, and Joachims (2003) gives an algorithm based on finding an approximatemin-ratio cut. In this paper, we extend the mincut approach by adding randomness to the graphstructure. The resulting algorithm addresses several shortcomings of thebasic mincut approach, and can be given theoretical justification from both aMarkov random field perspective and from sample complexity considerations. Incases where the graph does not have small cuts for a given classificationproblem, randomization may not help. However, our experiments on severaldatasets show that when the structure of the graph supports small cuts, thiscan result in highly accurate classifiers with good accuracy/coveragetradeoffs. In addition, we are able to achieve good performance with a verysimple graph-construction procedure.

Semi-Supervised Learning Using Randomized Mincuts

Avrim Blum - Carnegie Mellon University
John Lafferty - Carnegie Mellon University
Mugizi Rwebangira - Carnegie Mellon University
Rajashekar Reddy - Carnegie Mellon University

In many application domains there is a large amount of unlabeled data but onlya very limited amount of labeled training data. One general approach that hasbeen explored for utilizing this unlabeled data is to construct a graph on allthe data points based on distance relationships among examples, and then touse the known labels to perform some type of graph partitioning. One naturalpartitioning to use is the minimum cut that agrees with the labeled data (Blumand Chawla, 2001), which can be thought of as giving the most probable labelassignment if one views labels as generated according to a Markov Random Fieldon the graph. Zhu et al. (2003) propose a cut based on a relaxation of thisfield, and Joachims (2003) gives an algorithm based on finding an approximatemin-ratio cut. In this paper, we extend the mincut approach by adding randomness to the graphstructure. The resulting algorithm addresses several shortcomings of thebasic mincut approach, and can be given theoretical justification from both aMarkov random field perspective and from sample complexity considerations. Incases where the graph does not have small cuts for a given classificationproblem, randomization may not help. However, our experiments on severaldatasets show that when the structure of the graph supports small cuts, thiscan result in highly accurate classifiers with good accuracy/coveragetradeoffs. In addition, we are able to achieve good performance with a verysimple graph-construction procedure.