Trees are widely used as interpretable models. However, when they are greedily trained they can yield suboptimal predictive performance. Training soft trees, with probabilistic splits rather than deterministic ones, provides a way to supposedly globally optimize tree models. For interpretability purposes, a hard tree can be obtained from a soft tree by binarizing the probabilistic splits, called hardening. Unfortunately, the good performance of the soft model is often lost after hardening. We systematically study two factors contributing to the performance drop: first, the loss surface of the soft tree loss has many local optima (and thus the logic for using the soft tree loss becomes less clear), and second, the relative values of the soft tree loss do not correspond to relative values of the hard tree loss. We also demonstrate that simple mitigation methods in literature do not fully mitigate the performance drop.