法学家 ›› 2025, Vol. 0 ›› Issue (5): 27-41.

• 主题研讨一:数字法学研究的多维视角 • 上一篇    下一篇

论大模型训练数据的合理使用

李铭轩   

  • 出版日期:2025-09-15 发布日期:2025-09-15
  • 作者简介:*李铭轩,法学博士,中国人民大学交叉科学研究院、国家治理大数据和人工智能创新平台讲师。
  • 基金资助:
    本文系最高人民法院司法研究重大课题“生成式人工智能服务提供者侵权责任问题研究”(GFZDKT2024C08-1)的阶段性研究成果。

On Fair Use of Training Data for Large Language Models

LI Mingxuan   

  • Online:2025-09-15 Published:2025-09-15
  • About author:Li Mingxuan, Ph.D. in Law, Lecturer of School of Interdisciplinary Studies, and Big Data and Responsible Artificial Intelligence for National Governance, Renmin University of China.

摘要: 大模型训练数据的主要来源是网络上的公开数据,开发者一般通过爬取公开网页和收集开源数据来大规模获取训练数据。随着数据财产权益保护的强化,获取海量训练数据的主要方式面临着合法性挑战。数据财产权益人众多、数据使用行为难追溯导致交易成本升高,大模型开发者无法通过市场机制获得数据财产权益人的许可来确保训练数据的合法性。在市场失灵的情形下,允许开发者合理使用数据进行大模型训练,可以增进社会福利,且一般不会损害数据财产权益人的市场利益。采取集体管理或法定许可等替代方案给数据财产权益人带来的收益非常有限,却会产生更高的制度成本,并给我国大模型的发展造成不利影响。因此,我国应当建立大模型训练数据的合理使用制度,为技术发展提供合法性预期。在规则设计上,大模型训练数据合理使用的对象应限于公开数据;目的应限于预训练;方式应包括训练涉及的数据处理行为;应允许数据财产权益人以技术措施选择退出合理使用。

关键词: 大模型, 训练数据, 数据财产权益, 合理使用, 市场失灵

Abstract: The primary sources of training data for large language models are publicly available data on the internet.Developers typically collect these data on a large scale through web crawling and aggregation of open-source datasets.However, as the protection of data property rights becomes increasingly reinforced, the legitimacy of this approach faces growing legal challenges.The large number of data rightsholders and the difficulty in tracing data usage significantly increase transaction costs, making it impractical for developers to obtain individual licenses through market mechanisms to ensure lawful use of training data.In this context of market failure, permitting the fair use of data for training large language models can increase social welfare and generally does not harm the market interests of data rightsholders.Alternatives such as collective management or statutory licensing offer limited benefits to rightsholders while imposing higher institutional costs and potentially hindering the development of large language models in China.Therefore, a fair use for training data should be established to provide legal certainty for technological innovation.In terms of rule design, fair use should be limited to publicly available data, be solely for the purpose of pretraining, include data processing methods involved in training, and allow data rightsholders to opt out through technical measures.

Key words: Large Language Models, Training Data, Data Property Right and Interest, Fair Use, Market Failure