Paper ID: 866 Title: Differential Geometric Regularization for Supervised Learning of Classifiers Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a geometric regularization method for binary and multi-class classification. Major novelty lies in regularizing the classifier by minimizing the volume of the sub-manifold in the product space of input features and class-probabilities. Experimental results on eight low-dimensional datasets show clear advantages of the proposed method over competing regularization methods in terms of classification accuracy. The results on high-dimensional data are less encouraging, perhaps because of the limitation of the features used in the experiments. Clarity - Justification: The presentation of Section 3.2 is not clear. I suggest that the calculation of P_G and \Delta P_G is detailed in the case of RBF kernel, as the calculations are in closed-form in this case. Significance - Justification: The method of regularizing the classifier in the product space of input features and class-probabilities appears new and interesting. Significant improvements in classification accuracy are obtained on eight low-dimensional datasets, as compared to other regularization methods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper presents a novel and interesting regularization method for classification with any finite number of classes. The experimental results on eight ow-dimensional datasets clearly demonstrate its superior performance over other regularization methods. > Typos: - Line 153: "In the follow" => "... following" - Equation (5): G^{-1} => G ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a geometric approach to estimate class probabilities. While some concepts of the paper are interesting and somewhat novel the presentation is not very clear and directed (see detailed comments). Clarity - Justification: More or less the main direction of the paper is well described and clear. Significance - Justification: As explained in the detailed comments I am not very happy with the last 30% of the paper. While the main motivation is Ok and interesting there are some assumptions which may not be so wide spread (realistic), some related work is missing and the evaluation is not only in the wrong direction (classification accuracy should not be your primal focus) but the results are also not very striking. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Major comments: - 'without any a priori assumptions on the geometry of the submanifold' - hm, I guess you are expecting (implicit) at least some smoothness on the manifold and that there is sufficient data coverage - related work is only weakly covered in sec 2.3 and not well aligned with the paper - this makes it a bit hard to judge the novelty and relevance w.r.t. the field - I wonder how your approach is related to conformal prediction - an approach which basically also aims on valid probabilistic values for classification decisions (not necessary limited to classification problems only) by recalibrating the output probabilities of a given classifier - could your approach also be extended to regregression problems ? or ordinal classification - from the motivation and some argumentation in the paper I would expect that the paper is linked to tangential-distance approaches (see work of Haasdonk and others) - some discussion and maybe comparison appears to be meaningful - e.g. by considering tangential distances in a probabilistic classifier - I think it is necessary to address the computational costs of the approach (considering the algorithm in Alg 1 - in the experiments the SVM classifier is not a quite good choice because it is traditionally not a probabilistic method and the various approaches to squeeze probabilistic output from the SVM are not very reliable (see e.g. work of Vovk, Gammerman around conformal prediction). In my view it would have been better to chose an approach like the Probabilistic Classification Vector Machine (PCVM) (or similar the Relevance Vector Machine - by Tipping) - by H. Chen - The selected UCI data are all fairly simple and widely overanalyzed (you can find most of them in the experiments of H Chen as well); To make the approach relevant you not only should select more datasets (see e.g. the benchmark used in the PCVM) but also focus on different data characteristics (multiclass, high dimensiona, large scale, multi-modal) You have this a bit in sec 4.2. but the Table 1 results are not very useful - You should maybe also use simulated data like checkerboard or spiral or so to show in simple cases how your approach behaves - Table 1 should include standard deviations -> aligned with a significance test --> I also think it is a bit problematic that you focus only on classification errors ... - I thought the paper is about estimating class probabilities. E.g. it would be interesting to see if data clouds which are overlapping by some degree (e.g. for simulated data) get better probability estimates (which you have under control / ground truth due to the simulation) - compared to e.g. squeezed probabilistic outputs of the SVM or the probability estimates obtained from the PCVM ... -> a measure how to compare this would be to check e.g. the kullback-leibler divergence in comparing probability distribution and how far they are aligned to the true PDF (class dependent) Minor comments: - although widely spread I would not say 'one versus all' - if you actually mean 'one vs rest' - there are some grammatically mistakes in the paper (e.g. in the list 'In summary, our contributions are:') - but there are more - so please check the paper once more - 'to solve for it.' --> to solve it. - personally I would prefer N for the number of samples and e.g. D or d for the feature space dimension of the data; Maybe with another variable to capture the intrinsic dimensionality of the data - although addressed in the text P_{\tauM} is given explicit a bit late in sec 3.1. in the paper - please add a reference already in sec 2 to indicate where P_\tauM can be found - Table 2 should be shifted at the beginning of the paper ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a novel approach for regularizing classification models using a geometric (volume-based) regularization of the prediction function, where the graph of the prediction function is viewed as a manifold. The approach is motivated via robustness arguments, where manifolds with smaller local volume are preferred. The authors specify why this is different from the more common Sobolev norm based regularization. The problem is solved using a geometric flow-based optimization and implemented for radial basis function networks. Clarity - Justification: Clear explanation and justification of the proposed approach, and useful conceptual comparison to Sobolev-based approaches. Significance - Justification: I think the idea of geometric volume-based regularization is quite clever, and has some potential to improve predictive modelling in many domains. A more detailed assessment of significance will require further empirical evaluation and/or theoretical analysis. The results of this paper are a promising start. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Issues: ** I am confused about the discussion about the simplex constraint in lines 521 - 542. While the gradient of the function “f” may not lie on the simplex, the geometric flow is implemented via the Jacobian wrt. the RBF function “h” — which is an Euclidean matrix. Further, the RBF guarantees that any changes to H correspond to a point on the simplex. If my understanding is correct, why is the additional simplex projection required? ** In my view, the weakest part of this paper are insufficient real world experiments. The experiments choose only one of several potential baselines, and do not make it easy to fairly compare the benefits of the proposed approach. It would also be great to compare with the Laplacian and Sobolev based regularization approaches - some of which have publicly available code. Minor writing issues: ** The symbol II is used in line 489 without prior definition ** I suggest you fix the http links in the references to avoid running over. Suggestions: ** I suggest the authors extend the analysis to general function approximation e.g including regression, since (beyond the simplex representation in the final layer), the approach is not specific to classification. The authors might also consider other input manifolds, beyond Euclidian space. * I think “Robustness” is a loaded term in statistics and machine learning, and the authors should avoid that term unless they specifically address robustness issues - either via experiments or theoretical arguments. =====