We thank reviewers and AC for their time and effort. All the additional experiments will be added to the paper.$
To R1
Q1: Results on more datasets
--
Additional results are reported on SVHN. MBA has test error of 1.8%. [ReLU], [Maxout], [NIN], [DropConnect] and [DSN] have errors of 2.55, 2.47, 2.35, 1.94 and 1.92.

Q2: Table 1 is not fair, MBA has much more parameters than APL and vanilla conv net. Need comparisons to make APL deeper or wider.
--
Table 1 is to verity how MBA and APL boost the vanilla net. Comparison can be made in three ways. (1) Let MBA and APL have the same amount of bias terms. It’s our original result in Table 1 (denoted as [] below). (2) Let MBA, APL, and vanilla net have the same number of feature maps in each layer (as @4x). (3) Increasing the number of feature maps of APL (making it wider), so its parameter size is similar to MBA (as @similar). Additional experiments (2) and (3) are suggested by R1. 

Model		Params	MNIST     CIFAR-10
[ Vanilla ] 	             93k		2.1		34.27
Vanilla@4x	594k	0.95		22.75
[ APL ]		120k		1.08		28.72
APL@4x	        620k	1.15		23.8
APL@similar	358k	1.17		31.53
[ MBA ]	        369k	0.83		22.39

MBA outperforms APL with the similar parameter size (@similar) or even with less computation (@4x), which is also stated in line 242, 533, 542, and Figure 2. 

Q3: Vanilla conv net (error=2.1%) in Table 1 performs much worse than I expect.
--
It was obtained by using 0.1 learning rate, the same as MBA. After carefully tuning the learning rate, 1.4% error can be achieved, which is still much worse than MBA (0.8%). Will clarify this in the final version. 

Q4: Clarify whether results are copied from papers in Table 2. Methods are in different settings and thus comparison is unfair. Hard to conclude that the good performance is from MBA.
--
Comparison with various baselines under the same experimental settings is reported in Table 1, 3, 4. Table 2 is from a different perspective. It was copied from published results (will make it clear), since the purpose was to compare with state-of-the-art and we assume they have carefully tuned the net on validation set to report the best result. Some methods (NIN and DSN) used different structures and their techniques are orthogonal to ours. It might not make sense to have ‘controlled’ comparison on these methods. If their hyper-parameters are chosen the same as MBA, suboptimal settings lead to worse results. ReLU and APL are most relevant work and have been fairly compared in Q2.

Q5: Hyper-parameter selection, early stopping, validation set
-- 
Hyper-params and early stopping were selected from the validation set. We follow the standard procedure in the Maxout (Goodfellow 2013), which will be clarified.

Q6: MBA learns for every feature map and the disadvantage is that more conv kernels are introduced for subsequent layers. 
--
The above statement might contradict with the reviewer’s comment “MBA is well motivated… A standard CNN … requires more conv kernels. MBA is a cheap way.” The key of MBA is to save unnecessary parameters by understanding the working principle of networks and by introducing better design of local structures. Our contribution of saving a large portion of redundant filters while keeping the expressive power is significant. This is similar to why CNN is better than fully connected nets. 

To R3
Q1: Compare thoroughly the computational cost of MBA vs APL. Having k=4 increases the number of conv kernels by 4x for the layer above.
--
Saving computation cost means that if both MBA and APL have the same number of output feature maps, which are also input maps of next layer, MBA only requires 1/4 filters and thus 1/4 convolutions. 

Q2: Is the MNIST result the best using MBA? 
--
No. In Table 1, we didn’t carefully tune hyper-params on MNIST since we aim at showing MBA is effective, not pushing the state-of-the-art.

Q3: Fractional max-pooling paper got slightly better results
--
It used much deeper CNN, model averaging, and various data augmentation tricks. We didn’t use these tricks in order to single out the effectiveness of MBA more clearly. For a relatively fair comparison, we achieve a 26.14% (Table 2) error rate where they have 26.39% on CIFAR-100 w/o data augmentation.

To R4
Q1: Better results from the link.
--
Please refer to Q3 in R3.

Q2: Mouths of different people will also have different biases.
--
True. How many biases are needed depends on the application. If the goal is to distinguish mouth and eyes, two biases could be enough. The model automatically finds separation boundary and doesn’t need to learn more distinct bias values to separate different people. Even given more biases, they have similar values after learning. It won’t cause overfitting much, since the param number of MBA is small. If the goal is face recognition (the case the reviewer considered), more biases are needed which are helpful to distinguish from different people.