Learning and Data Selection in Big Datasets
Hossein Shokri Ghadikolaei · Hadi Ghauch · Inst. of Technology Carlo Fischione · Mikael Skoglund

Thu Jun 13th 04:20 -- 04:25 PM @ Room 102

Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.

Author Information

Hossein Shokri Ghadikolaei (KTH Royal Institute of Technology)
Hadi Ghauch (Royal Institute of Technology, KTH)

I am a postdoc at the school of Electrical Engineering and Computer Science, at KTH. I received my MS in Information Networking in 2011 from Carnegie Mellon University, USA, and PhD in Electrical Engineering from KTH in 2016. My research includes optimization for learning, machine learning for resource allocation, millimeter-wave communication, and distributed optimization of wireless networks

Inst. of Technology Carlo Fischione (Royal Inst. of Technology, KTH)
Mikael Skoglund (KTH Royal Institute of Technology)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors