We thank the reviewers for their reviews and detailed feedback.$ In response to specific comments: Reviewer_2: In general, we can use any flexible nonlinear function class for the message mapping operator. We use the neural networks in the paper because they are universal function approximators and can be easily trained for large datasets. The formulation in Eq (12) takes the neighborhood embedding and node features into account, which follows from the embedded version of mean field inference. We’ll include more detailed explanations for neural network parameterization. In Song et al 2009&2011, they also embedded the messages into feature space, but *fix* the nonlinear message mapping operator (called conditional embedding operator) beforehand. In contrast, we *learn* the nonlinear message mapping operator using supervised information. Furthermore, our goal is to measure the similarity between structured data, which is different from the graphical model inference problem in their work. It will be interesting to modify the kernel BP to deal with our classification and regression task in our future work. Reviewer_4: We clarify the misunderstandings below: We are different from Chen et al. in several key aspects: 1. The settings of learning tasks are different. In Chen et al, the graphical model is defined on the *structured output* values, and the inputs are fixed dimensional data (e.g., images). In contrast, we are dealing with prediction problems with single scalar output, and the graphical model is built for *structured input* data. Furthermore, we allow different data point to have different graphical models. 2. Chen et al. parameterized the potentials of the graphical model via neural networks, and learn the *original* graphical models with such parameterization using message passing. In contrast, our key idea is directly targeting on the ultimate task, and *avoid* the original graphical model learning. We only need to learn the message mapping operator induced by the conditional independence structure and the chosen inference algorithm. This is closer to Song et al. 2009 & 2011 which also embedded messages into feature spaces and used nonlinear message mapping operators. But Song et al *fixed* the nonlinear message mapping operator while we *learn* them with supervision signals. 3. Chen et al. considered feedforward neural network parameterization for graphical model potentials. We considered nonlinear and neural network parameterization of message passing steps which are carried out in iterations. Thus, our model and learning algorithm are more similar to that of recurrent neural networks. The graphical model embedding idea is crucial for avoiding the learning of the original graphical model. As we emphasized, we are interested in the ultimate discriminative tasks rather than original graphical models. With the embedded form of graphical models, we bypass the original graphical model learning intentionally, which might produce sub-optimal parameters in terms for discriminative performance, and directly target on the ultimate goal with flexible parameterization. The embedding procedure also designs the computation. The structure in the data point determines the conditional independence structure of the graphical model. For the same graphical model, different inference algorithms will use different parts of the structured inputs and the messages from the previous iteration. In turn, these inference algorithms allow us to figure out the exact dependence of the message mapping operators on input parts and embedded messages from previous iterations. Therefore, changing an inference algorithm will lead to different dependence. We only explained two inference algorithms in the paper, which already show such difference. Following the same recipe, many other more advanced inference algorithms, such as generalized BP, can also be used and lead to new algorithms. We will include the reference and the discussion above in the paper. We have added the fisher kernel experiment with implementation by Shogun. It achieves AUC=0.86623+/-0.0879 in SCOP dataset, which is inferior to our algorithm. We’ll compare more thoroughly in our final version. We did not just take a small subsample of the data for training. We do weighted or importance sampling for the training data (90% of 2.3m samples) to make the output value more balanced. That is to say, we learn the model parameters using mini-batch stochastic gradients, and we allow data points with rare output values to have more chance to be sampled than using uniform sampling. Reviewer_5: Thanks for the overall positive assessment of our work. In section 4 we derive the function mappings of embedded marginal from the two approximate inference algorithms. We here use neural networks to represent the message mapping operators. We’ll explain more details for this section and also address the word division problem.