摘要: 大模型训练数据的主要来源是网络上的公开数据,开发者一般通过爬取公开网页和收集开源数据来大规模获取训练数据。随着数据财产权益保护的强化,获取海量训练数据的主要方式面临着合法性挑战。数据财产权益人众多、数据使用行为难追溯导致交易成本升高,大模型开发者无法通过市场机制获得数据财产权益人的许可来确保训练数据的合法性。在市场失灵的情形下,允许开发者合理使用数据进行大模型训练,可以增进社会福利,且一般不会损害数据财产权益人的市场利益。采取集体管理或法定许可等替代方案给数据财产权益人带来的收益非常有限,却会产生更高的制度成本,并给我国大模型的发展造成不利影响。因此,我国应当建立大模型训练数据的合理使用制度,为技术发展提供合法性预期。在规则设计上,大模型训练数据合理使用的对象应限于公开数据;目的应限于预训练;方式应包括训练涉及的数据处理行为;应允许数据财产权益人以技术措施选择退出合理使用。
关键词:
大模型,
训练数据,
数据财产权益,
合理使用,
市场失灵
Abstract: The primary sources of training data for large language models are publicly available data on the internet.Developers typically collect these data on a large scale through web crawling and aggregation of open-source datasets.However, as the protection of data property rights becomes increasingly reinforced, the legitimacy of this approach faces growing legal challenges.The large number of data rightsholders and the difficulty in tracing data usage significantly increase transaction costs, making it impractical for developers to obtain individual licenses through market mechanisms to ensure lawful use of training data.In this context of market failure, permitting the fair use of data for training large language models can increase social welfare and generally does not harm the market interests of data rightsholders.Alternatives such as collective management or statutory licensing offer limited benefits to rightsholders while imposing higher institutional costs and potentially hindering the development of large language models in China.Therefore, a fair use for training data should be established to provide legal certainty for technological innovation.In terms of rule design, fair use should be limited to publicly available data, be solely for the purpose of pretraining, include data processing methods involved in training, and allow data rightsholders to opt out through technical measures.
Key words:
Large Language Models,
Training Data,
Data Property Right and Interest,
Fair Use,
Market Failure
李铭轩. 论大模型训练数据的合理使用[J]. 法学家, 2025(5): 27-41.
LI Mingxuan. On Fair Use of Training Data for Large Language Models[J]. The Jurist, 2025(5): 27-41.