Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation
Abstract
Tabular data within a domain often exhibit heterogeneous schemas yet shared semantics, posing a key challenge: determining what should remain invariant across tables and what should preserve instance-level distinctions. Existing token- or row-centric encoders conflate these roles, leading to schema sensitivity or weakened discriminability. We introduce the segment, a header–value pair, as an atomic unit that captures both functional roles and semantic content. Using value entropy, we treat low-entropy segments as domain anchors and high-entropy segments as entity-specific signals. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header–value coupling and selective semantic alignment. Experiments on in-domain heterogeneous tables demonstrate improved performance on discriminative and generative tasks, yielding stable and interpretable representations.