Timezone: »
Oral
Categorical Feature Compression via Submodular Optimization
Mohammad Hossein Bateni · Lin Chen · Hossein Esfandiari · Thomas Fu · Vahab Mirrokni · Afshin Rostamizadeh
In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target binary labels and we furthermore show that its solution is guaranteed to be within a $1-1/e \approx 63\%$ factor of the global optimal solution. Although in some settings, entropy-based set functions are known to be submodular, this is not the case for the mutual information objective we consider (mutual information with respect to the target labels). To address this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and also design a data structure to query the submodular function in amortized $O(\log n )$ time (where $n$ is the input vocabulary size). Our complete algorithm is shown to operate in $O(n \log n )$ time. Additionally, we design a distributed implementation in which the query data structure is decomposed across $O(k)$ machines such that each machine only requires $O(\frac n k)$ space, while still preserving the approximation guarantee and using only logarithmic rounds of computation. We also provide analysis of simple alternative heuristic compression methods to demonstrate they cannot achieve any approximation guarantee. Using the large-scale Criteo learning task, we demonstrate better performance in retaining mutual information and also verify competitive learning performance compared to other baseline methods.
Author Information
Mohammad Hossein Bateni (Google Research)
Lin Chen (Yale University)
Hossein Esfandiari (Google Research)
Thomas Fu (Google)
Vahab Mirrokni (Google Research)
Afshin Rostamizadeh (Google)
Related Events (a corresponding poster, oral, or spotlight)
-
2019 Poster: Categorical Feature Compression via Submodular Optimization »
Fri. Jun 14th 01:30 -- 04:00 AM Room Pacific Ballroom #142
More from the Same Authors
-
2021 : Label differential privacy via clustering »
Hossein Esfandiari · Vahab Mirrokni · Umar Syed · Sergei Vassilvitskii -
2022 Poster: Tight and Robust Private Mean Estimation with Few Users »
Shyam Narayanan · Vahab Mirrokni · Hossein Esfandiari -
2022 Oral: Tight and Robust Private Mean Estimation with Few Users »
Shyam Narayanan · Vahab Mirrokni · Hossein Esfandiari -
2022 Poster: Massively Parallel $k$-Means Clustering for Perturbation Resilient Instances »
Vincent Cohen-Addad · Vahab Mirrokni · Peilin Zhong -
2022 Spotlight: Massively Parallel $k$-Means Clustering for Perturbation Resilient Instances »
Vincent Cohen-Addad · Vahab Mirrokni · Peilin Zhong -
2022 : Closing Remarks »
Vahab Mirrokni -
2022 : Private Algorithms Q/A »
Peilin Zhong · Alessandro Epasto · Vahab Mirrokni -
2022 : Graph Mining Q/A »
Vahab Mirrokni -
2022 : New Challenges in Graph Mining: Scalability, Stability, and Privacy Applications »
Vahab Mirrokni -
2022 Expo Talk Panel: Challenges Of Applying Graph Neural Networks »
Bryan Perozzi · Vahab Mirrokni -
2022 : Graph Mining at Google »
Vahab Mirrokni -
2021 Poster: Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time »
Laxman Dhulipala · David Eisenstat · Jakub Łącki · Vahab Mirrokni · Jessica Shi -
2021 Spotlight: Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time »
Laxman Dhulipala · David Eisenstat · Jakub Łącki · Vahab Mirrokni · Jessica Shi -
2021 Poster: Active Covering »
Heinrich Jiang · Afshin Rostamizadeh -
2021 Spotlight: Active Covering »
Heinrich Jiang · Afshin Rostamizadeh -
2021 Poster: Regularized Online Allocation Problems: Fairness and Beyond »
Santiago Balseiro · Haihao Lu · Vahab Mirrokni -
2021 Spotlight: Regularized Online Allocation Problems: Fairness and Beyond »
Santiago Balseiro · Haihao Lu · Vahab Mirrokni -
2021 Poster: Revenue-Incentive Tradeoffs in Dynamic Reserve Pricing »
Yuan Deng · Sébastien Lahaie · Vahab Mirrokni · Song Zuo -
2021 Spotlight: Revenue-Incentive Tradeoffs in Dynamic Reserve Pricing »
Yuan Deng · Sébastien Lahaie · Vahab Mirrokni · Song Zuo -
2020 Poster: Robust Pricing in Dynamic Mechanism Design »
Yuan Deng · Sébastien Lahaie · Vahab Mirrokni -
2020 Poster: Dual Mirror Descent for Online Allocation Problems »
Santiago Balseiro · Haihao Lu · Vahab Mirrokni -
2020 Poster: Bandits with Adversarial Scaling »
Thodoris Lykouris · Vahab Mirrokni · Renato Leme -
2019 Poster: Non-monotone Submodular Maximization with Nearly Optimal Adaptivity and Query Complexity »
Matthew Fahrbach · Vahab Mirrokni · Morteza Zadimoghaddam -
2019 Oral: Non-monotone Submodular Maximization with Nearly Optimal Adaptivity and Query Complexity »
Matthew Fahrbach · Vahab Mirrokni · Morteza Zadimoghaddam -
2019 Poster: Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling »
Shanshan Wu · Alexandros Dimakis · Sujay Sanghavi · Felix Xinnan Yu · Daniel Holtmann-Rice · Dmitry Storcheus · Afshin Rostamizadeh · Sanjiv Kumar -
2019 Poster: Distributed Weighted Matching via Randomized Composable Coresets »
Sepehr Assadi · Mohammad Hossein Bateni · Vahab Mirrokni -
2019 Oral: Distributed Weighted Matching via Randomized Composable Coresets »
Sepehr Assadi · Mohammad Hossein Bateni · Vahab Mirrokni -
2019 Oral: Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling »
Shanshan Wu · Alexandros Dimakis · Sujay Sanghavi · Felix Xinnan Yu · Daniel Holtmann-Rice · Dmitry Storcheus · Afshin Rostamizadeh · Sanjiv Kumar -
2018 Poster: Parallel and Streaming Algorithms for K-Core Decomposition »
Hossein Esfandiari · Silvio Lattanzi · Vahab Mirrokni -
2018 Poster: Accelerating Greedy Coordinate Descent Methods »
Haihao Lu · Robert Freund · Vahab Mirrokni -
2018 Poster: Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions »
Shuaiwen Wang · Wenda Zhou · Haihao Lu · Arian Maleki · Vahab Mirrokni -
2018 Oral: Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions »
Shuaiwen Wang · Wenda Zhou · Haihao Lu · Arian Maleki · Vahab Mirrokni -
2018 Oral: Accelerating Greedy Coordinate Descent Methods »
Haihao Lu · Robert Freund · Vahab Mirrokni -
2018 Oral: Parallel and Streaming Algorithms for K-Core Decomposition »
Hossein Esfandiari · Silvio Lattanzi · Vahab Mirrokni -
2018 Poster: Proportional Allocation: Simple, Distributed, and Diverse Matching with High Entropy »
Shipra Agarwal · Morteza Zadimoghaddam · Vahab Mirrokni -
2018 Oral: Proportional Allocation: Simple, Distributed, and Diverse Matching with High Entropy »
Shipra Agarwal · Morteza Zadimoghaddam · Vahab Mirrokni -
2017 Poster: Tight Bounds for Approximate Carathéodory and Beyond »
Vahab Mirrokni · Renato Leme · Adrian Vladu · Sam Wong -
2017 Talk: Tight Bounds for Approximate Carathéodory and Beyond »
Vahab Mirrokni · Renato Leme · Adrian Vladu · Sam Wong