Paper ID: 135 Title: Asymmetric Multi-task Learning based on Task Relatedness and Confidence Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a method for multi-task learning, i.e. learning different tasks simultaneously, where the amount of transfer from task A to task B can be larger than the other way around. The authors refer to this setting as asymmetric multi-task learning. To achieve this, the authors focus on linear models (linear regression and logistic regression), which are learned in a regularization framework. The main idea of such a regularization framework is that the weights of one tasks are encouraged to be a sparse linear combination of the other tasks’ weights and the amount of transfer between tasks (given by the corresponding rows / columns of the weight matrix) can be asymmetric. The resulting formulation (given in Equation 1) yields a biconvex optimization problem and the paper proposes two different approaches: (a) an alternating optimization framework, where a set of parameters is optimized at a time; and (b) a curriculum-learning approach, where the tasks can be learned in an ordered fashion. Their main theoretical analysis of the proposed objective function is given in Theorem 1, which states that the task with the smaller loss will have larger amounts of transfer. Clarity - Justification: The paper is well written, clearly specifying the claimed contributions and with a very concise and well-covered related work section. Significance - Justification: It seems that this paper addresses an important problem in MTL, which so far has been under-explored, the problem of minimizing negative transfer across tasks. The framework seems technically sound in terms of what they want to achieve, although weighting the loss function in the way designed by Equation (1) can yield undesirable results such a loss in performance for some tasks, at least when compared to other multi-task learning settings. This is, in fact, reflected in the experiments where other benchmarks such as GO-MTL can perform better in some other tasks. The authors claim that their main objective is that of minimizing negative transfer but this can happen at the expense of not achieving the best performance for some of the tasks. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (a) Overall, unfortunately, although the experiments emphasize their point of minimizing negative transfer, they are not very convincing when evaluating the method as a multi-task learning approach. This is seen, for example, in Figure 1 where GO-MTL can be significantly better than their proposed approach. Also, in Table 1, their better performance figures do not seem to be of either statistical or practical significance. (b) The paper claims to address multi-task learning problems, but the real datasets focus mainly on multi-class classification and only the School dataset seems to be well recognized in previous literature for multi-task learning settings. (c) The analysis of the learned graphs in Figure 3 does not seem to be very relevant, as their optimization process can yield different solutions if run multiple times (it is a non-convex problem in general). (d) It is unclear why the authors claim their approach is based on ‘confidence’, as probabilities are not directly used to drive their MTL approach. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper the authors propose a multi-task method that allows for asymmetric transfer, i.e. the amount of transfer from task a to task b can be different from that from b to a, and thus reduces the effects of negative transfer. Clarity - Justification: I find the clarity of the presentation and, in particular, of the experimental evaluation rather low. First, the motivation for the method is to encourage transfer from more confident tasks to less confident ones. However, the optimisation problem (1) encourages transfer from tasks with low training error, which is not an indicator of high confidence of the prediction per se. Therefore, while I do find the idea of multiplying the regulariser with the loss quite interesting, I would suggest to not use the word "confidence" where what is actually meant is "training error". Second, the authors state (in line 127) that they provide a theoretical justification of the proposed AMTL method. However, Theorem 1 does not provide such a justification, but rather states the properties of the optimisation (1) as a biconvex problem. In addition, the level of the description of the experimental evaluation limits the reproducibility of the results. In particular, the real datasets used in the experiments are multi-class rather than multi-task (apart from the School dataset). However, there is no description of what the tasks are and how the test/train data is split between the tasks. I would assume that the classification tasks are binary (one-versus-all reduction of the multi-class problem), as this is the only case discussed in the paper (lines 514-518), however than it is not clear how the test error can be above 50% (AwA experiment). Lastly, it is not clear how exactly the task weights are specified in the last experiment with imbalanced tasks and why only a subset of baselines is evaluated there. In addition, the authors mostly highlight as an advantage of their method the asymmetry of the relatedness matrix B. However, in the experimental evaluation, the performance of the AMTL-noLoss method, which (if I understand correctly) is a modification of AMTL with the regularisation constant mu being fixed to 0, is not better than that of the symmetric method SMTL. Therefore I would conclude that the asymmetry of the matrix B is not enough and more important is the multiplication of the sparsity regulariser with the loss. This is indeed an interesting idea and I would suggest the authors to highlight its importance more explicitly. Significance - Justification: The problem of negative transfer, addressed in the paper, is important and, to the best of my knowledge, the proposed objective function is new. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I find the addressed problem important for the multi-task learning and the proposed solution interesting. However, I am concerned about the quality of the presentation and the empirical evaluation. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a method to integrate asymmetric transfer in multi-task learning. The basic idea is that not all the tasks are equally related and in each pair of tasks there can be a preferred direction for information sharing. The proposed algorithm optimizes both over the task models and on the asymmetric matrix expressing their relation. The problem is solved either by iterative alternating optimization or by searching for the best task order with a curriculum learning strategy. Clarity - Justification: The paper is clear and well written. The theoretical part is accurately explained and the experiments contain enough details to reproduce the results. Significance - Justification: The idea of an asymmetric transfer in MTL is valuable and the paper introduce some novelty with respect to previous approaches in Kumar et al. 2012 and Kang. et al 2011. The proposed approach is theoretically sound. Overall the experiments show an advantage with respect to the baselines: in some cases it appears to be quite small, but consistent. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - Although I appreciate the idea of integrating MTL with asymmetric transfer, I wonder if this setting is providing a real advantage with respect to treating each task separately and allowing it to transfer from all the others. In more practical terms, together with the STL baseline it would be interesting to check an STTL (single-task transfer learning) baseline where each task is solved separately with a transfer learning algorithm (e.g. using Tommasi et al. 2010) -- without curriculum learning. - minor: clarify \rho in Theorem 1. =====