Poster
in
Workshop: 2nd Workshop on Generative AI and Law (GenLaw ’24)
Building a Long-Text Privacy Policy Corpus with Multi-Class Labels
David Stein · Florencia Marotta-Wurgler
This work introduces a new hand-coded dataset for the interpretation of privacy policies. The dataset captures the contents of 162 privacy policies, including documents they incorporate by reference, on 64 dimensions that map onto commonly found terms and applicable legal rules. The coding approach is designed to capture complexities inherent to the task of legal interpretation that are not present in current privacy policy datasets. These include addressing textual ambiguity, indeterminate meaning, interdependent clauses, contractual silence, and the effect of legal defaults. This paper also introduces the suite of open-source, online tools we developed to build the dataset. The tools are explicitly designed to allow non-technical domain experts to create similar datasets.