We thank all the reviewers for their valuable comments. We would like to clarify a few points as follows.$
@reviewer_1

Regarding the choice of datasets:
We mainly focus on cases where the matrix A is explicitly given. It will be very interesting to experiment with datasets where A is computed from feature vectors, which we leave to future work.

Regarding the clarity of Section 3:
Thanks for the suggestion. We agree that it will be clearer to describe the objective before the sampling algorithm, and will make corresponding changes in our final version.


@reviewer_2

Regarding novelty (Questions 1 and 2):
We have novel contribution in three folds:
1. Previous works on SSL largely depend on graph Laplacian regularization. We propose a novel approach by joint training of classification and graph context prediction.
2. It is difficult to generalize graph embeddings to novel instances. Perozzi et al. addressed this issue by retraining the embeddings incrementally, which is time consuming and does not scale (and not inductive essentially). We instead propose a novel approach by conditioning embeddings on input features.
3. We empirically show substantial improvement over existing methods (up to 8.5 points and on average 4.1 points), and even more significant improvement in the inductive setting (up to 18.7 points and on average 7.8 points).
We will clarify the contributions of the paper in the final version.

Regarding inductive v.s. transductive (Question 3):
There are a couple of reasons why we emphasize the difference:
1. Since learning graph embeddings is transductive in nature, it is not straightforward to do it in an inductive setting. This is also one of the challenges we address in this work.
2. Being applicable to an inductive setting is important in large-scale real tasks. For example, machine reading systems very frequently encounter novel entities on the Web and it is not practical to train an SSL system on the entire Web.

Regarding the clarity of Section 3 (Question 4):
Thanks for the comment. We will clarify parts of Section 3, including describing the objective before presenting the sampling algorithm, and elaborating on the sampling algorithm.

Regarding NLP v.s. learning conferences (Question 5):
Although each of the distinct tasks we consider is NLP-related, we believe the learning framework is generally useful for the machine learning community, as shown by the experimental improvements obtained on the various benchmarks.  We note that our formalization follows the widely-adopted GSSL definition (Zhu et al. ICML'03, Zhou et al. NIPS'04).


@reviewer_3

Regarding ablation study:
Comparing Feat with Planetoid-I (e.g., row 1 v.s. row 4 in Table 4), we show the relative contribution of embeddings. Comparing Planetoid-T with Planetoid-G (e.g., row 8 v.s. row 9 in Table 4), we show the relative contribution of input features. By comparing Planetoid-G with GraphEmb (e.g., row 7 v.s. row 8 in Table 4), we show the relative contribution of joint training. We will clarify the final draft to highlight these results.

Regarding the model robustness:
We have experimented with some of the model hyperparameters, including different architectures and hyperparameters in Alg 1, and found that our results are not very sensitive to different choices of hyperparameters. We will provide a more systematic robustness analysis in our final submission.