$We thank the reviewers for their efforts and constructive feedbacks. All reviewers find the idea and method novel and interesting. We address general comments as well as individual comments from each reviewer below:


General comments


Use of the term task confidence is confusing as probabilities are not used. Better to replace the term with ‘training error’.

- We agree with the reviewers and will replace the term with task loss (or error).

Real datasets focus mainly on the multi-class classification.

- We mainly followed the experimental settings of well-cited prior multi-task learning papers, such as [Kang et al. 2011], and [Kumar et al. 2012]. 

R1:
Theorem 1 does not provide a theoretical justification of AMTL

- We meant a justification on the form of method: From the property of bi-convex problem, it is guaranteed that the task with smaller loss will have larger transfers, as mentioned in Theorem 1.


The authors seem to highlight the advantage of the method as the asymmetry of the related matrix B

- We tried to put emphasis on the importance of considering both task relatedness and loss when learning asymmetric transfer graph, by making it clear in the title. We further emphasized this point by showing that both the models with symmetric task relations, or no task loss, are less effective than our full model (Table 1). 


The level of description of the experimental evaluation limits the reproducibility of the results.

- We report classification errors as multi-class classification error, following the conventions of [Kang et al. 2011], and [Kumar et al. 2012]. This is a standard reporting procedure and is more reasonable for multi-class setting since one-vs-all prediction error for each class will be meaningless when most of the instances are negative. We will clarify this point.


- For the imbalanced set experiments in Table 2, the task weight c_t was chosen from 1/sqrt(n_t), 1/n_t, and 1/(n_t^2) using cross-validation (See  L341 and L847). We did not include other baselines for this experiment simply because we consider GO-MTL as our main competitor throughout the entire experimental section. As shown below, AMTL-imbal significantly outperforms all baselines on both imbalanced datasets.


Synth-Imbal    Room

54.05±0.82     45.85±1.36   STL

55.44±0.94     47.95±1.20   MTFL        

54.05±0.82     45.25±1.26   Curriculum-simple

54.09±0.85     46.00±1.16   SMTL

53.93±0.76     48.20±1.33   AMTL-noLoss

55.18±0.85     47.05±1.35   GO-MTL    

53.09±0.76     40.00±1.71   AMTL-imbal


- Regarding training/test splits, this is a misunderstanding since we have described them in L577-578 for the synthetic dataset, and L626-654, for all four real datasets. 


- We will make the codes and synthetic dataset public, if the paper gets accepted, for even better reproducibility. 


R3:
Clarify \rho in Theorem 1

- We apologize for the confusion. \rho is a typo that should be removed. 


Does AMTL provide real advantage over the model that treats each task separately and allows each model to transfer from all others? 

- This is indeed an insightful suggestion, and we implemented the STTL baseline as suggested, by first training STL models and then performing transfer once using our relatedness+loss-based transfer method. We report the STTL performance below.


STTL performance

20.33±0.35    Synthetic    

13.80±0.55    MNIST

12.08±0.62    USPS

10.33±0.06    School

58.08±0.46    AWA

53.83±0.81    Synth-imbal

41.15±1.88    Room 


STTL outperforms STL w/o transfer but still significantly underperforms AMTL on all datasets, which clearly shows the advantage of jointly learning the predictors in a multi-task learning framework.

R4:
GO-MTL can be significantly better (Figure 1)

- Please note Fig.1 (and Fig. 2) compares two methods on *each* task. While GO-MTL has achieved greater improvement on *some* tasks, it suffers from the negative transfers. In terms of the overall performance, it has worse performance (19.18 RMSE) compared to AMTL (18.80).

AMTL’s better performance does not seem statistically significant

- AMTL’s improvements over existing MTL methods’ performances are statistically significant on all datasets, excluding the School dataset. The reported deviations are standard errors at 95% confidence intervals, and thus are 1.96 times larger than the standard error that are often reported. 

Further, AMTL’s improvement over STL is even larger and thus AMTL could be highly practical in situations where losing performance on any of the tasks is undesirable.


Graphs in Figure 3 does not seem to be very relevant
- Although getting exactly the same plots at every run is not guaranteed due to non-convexity, we consistently obtained graphs where transfers are made from tasks with lower training errors to those with higher errors, as confirmed in Theorem 1. Figure 3 is one example of such cases.