$We are thankful to all reviewers for their careful and constructive comments. 1) For Reviewer_4's concern on technical novelty of the relaxation: The contribution of this paper not only lies in the relaxation step (Eq 6 to Eq 7) but also the discrete reformulation (Eq 6) that represents the domain of MSA (and MD) as the intersection of two atomic sets. To our knowledge, this is the first link built between literature of Multiple Sequence Alignemt (MSA), Motif Discovery (MD) and that of atomic-norm optimization. 2) For Reviewer_4's concern on our relaxation "Conv(A) \intersect Conv(P)" being a superset of the tigher relaxation "Conv(A \intersect P)": The problem obtained from relaxation "W \in Conv(A \intersect P)" is as hard as the original NP-hard problem "W \in A \intersect P". In particular, there is always an optimal solution on the cornor point of Conv(A \intersect P) when minimizing the linear objective, and by using perturbation to induce unique solution, one would obtain exact solution of the NP-hard problem from the relaxation, which leads to contradiction. On the other hand, though we haven't come up with proof of quality for the rounding result of our relaxation, strong empirical evidence is provided on its superior performance over local search and heuristic greedy method, which validates our contribution on the novel formulation and algorithm. 3) For Reviewer_4's questions about convergence result (Theorem 2) of Greedy Direction Method of Multiplier (GDMM): (I) The expectation in Theorem 2 is from the randomness of step 1 in Algorithm 1's main loop, where we randomize the order of solving sub-problems (14), (15) for the ease of analysis. (II) For the approximation quality on the original convex program (Eq 11), note that Theorem 2 gives number of iterations required to achieve \Delta_p+\Delta_d \leq \eps, and since \Delta_d \leq \eps, we have Y^t being \eps sub-optimal for the dual problem. At the same time, it implies O(\sqrt{\eps}) precision in primal infeasibility \|W_1-W_2\| and objective (D\dot W_1) via the following reasoning: With augmented term \frac{\rho}{2}\|W_1-W_2\|^2, the dual problem has gradient being \nabla d(Y) = W_1^* - W_2^* where W_1^*, W_2^* are the minimizer of L(W_1,W_2,Y) given Y. Also, \nabla d(Y) is (1/\rho)-Lischitz-continuous (which can be found in, for example, Lemma 2.2 of [1]). Therefore, d(Y')-d(Y) \geq (\nabla d(Y))^T(Y'-Y) - \frac{1}{2\rho}\|Y'-Y\|^2. Maximizing both sides w.r.t. Y' gives \Delta_d:= d^*-d(Y) \geq \frac{\rho}{2} \|\nabla d(Y)\|^2, which implies \|W_1^*-W_2^*\|^2 \leq \frac{2}{\rho}\Delta_d \leq \frac{2\eps}{\rho}. Now to bound \|W_1^t-W_2^t\|^2, let W=[W_1,W_2], AW=W_1-W_2, and let f(W) be the objective that takes value infinity for infeasible W and value (D \dot W_1) otherwise. We have \|AW^t\|^2-\|AW^*\|^2 = 2(A^TAW^*)^T(W^t-W^*) + \|AW^t-AW^*\|^2 and f(W^t)-f(W^*) \geq g^* \dot (W^t-W^*) for some sub-gradient g^* \in \partial f(W^*) satisfying g^* + \rho A^TAW^* = 0 (by optimality of W^*). Using the 3 conditions above, we obtain \Delta_p := L(W^t,Y^t)-L(W^*,Y^t) \geq \frac{\rho}{2}\|A(W^t-W^*)\|^2. Therefore, \|(W_1^t-W_2^t)-(W_1^*-W_2^*)\|^2 \leq \frac{2\eps}{\rho}, and thus, \|W_1^t-W_2^t\| \leq \|W_1^*+W_2^*\| + \|(W_1^t-W_2^t)-(W_1^*-W_2^*)\| \leq 2\sqrt{ 2\eps / \rho }, that verifies \|W_1^t-W_2^t\| = O(\sqrt{\eps}). Furthermore, the primal objective has bound (D\dot W_1^t) - (D\dot W_1^*) \leq (D\dot W_1^t) - L(W^*,Y^t) \leq L(W^t,Y^t) - L(W^*,Y^t) + \|Y^t\|\|W_1^t-W_2^t\| \leq \eps + 2\|Y^t\| \sqrt{ 2\eps/\rho }, that is, W^t has O(\sqrt{\eps}) approximate primal objective. [1] M.H. and L.Z., "On the linear convergence of the alternating direction method of multipliers." arXiv (2012). (III) The constant \kappa > 0 in Theorem 2 depends on the "condition number (diameter over pyramidal width) [2]" of polyhedral set W_1\in Conv(A), W_2\in Conv(P). The diameter is linear to the size of problem O(NTL), while finding pyramidal width is a non-trivial task, which requires much effort even for simple polyhedrons such as simplex and cubics [2]. Investigating pyramidal width of the convex relaxation is definitely one of our ongoing work. [2] S.L. and M.J., "On the Global Linear Convergence of Frank-Wolfe Optimization Variants". NIPS 2015. 4) For Reviewer_1 and Reviewer_5's concern on scalability of the algorithm: Our current implementation is not optimized in terms of efficiency: the cube in Fig 1 is densely represented despite that most of its entries are 0, so it has O(NTL|\Sigma|^2) complexity and scales only up to N, T of hundreds. However, efficiency can be much improved by exploiting the sparse (atomic) representation of iterates. We will state and discuss the current limitation, and include a link to our improved implementation in the next version of this paper.