Position: Enabling Fair Revenue Sharing for Data Providers in GenAI Systems
Abstract
GenAI systems, particularly LLMs, rely heavily on vast amounts of publicly available digital content as training data. A significant portion of this content is protected by copyright. While large-scale data scraping may be lawful under certain jurisdictions, the use of copyrighted works to generate outputs that compete with or replicate original creations raises unresolved legal, economic, and ethical concerns. In this position paper, we argue that data providers should be fairly compensated based on their measurable contribution to inference-time outcomes, rather than through coarse, one-time licensing or blanket agreements. We examine alternative perspectives on data ownership, fair use, and model training, and discuss why existing approaches fail to align incentives between GenAI developers and content creators. We then outline concrete roadmaps for developing decentralized systems that enable contribution-aware revenue sharing, including mechanisms for attribution, accounting, and payout at scale. We argue that fair revenue distribution for data providers will not only help resolve ongoing legal disputes surrounding GenAI systems, but also foster a new era of collaboration, rather than competition, between model developers and data creators. By incentivizing the production and sharing of high-quality datasets, such mechanisms can ultimately accelerate the development of more robust, trustworthy, and socially sustainable GenAI systems.