On August 28, 2025, the Guidelines for High-Quality Dataset Development (hereinafter referred to as the “Guidelines”) was officially released during the “High-Quality Dataset Exchange Event” at the 2025 China International Big Data Industry Expo. Under the guidance of the National Data Administration (NDA), the Guidelines was jointly compiled by the China Academy of Information and Communications Technology (CAICT), the National Institute of Data Development (NIDD), the China Electronics Standardization Institute (CESI), the State Information Center, the Innovation-Driven Development Center of the National Development and Reform Commission (NDRC), and the China Center for Information Development (CCID).
The Guidelines outline a reference pathway for high-quality dataset development, addressing key areas such as the background, application requirements, current status, methodologies and practices, operational frameworks, and implementation strategies. Designed to steer and advance the development of high-quality datasets, it aims to support the sustained growth and deep advancement of artificial intelligence. The reference pathway is specifically consisting of one set of methodological frameworks and one integrated operational system. By analyzing typical models, core processes, key technologies, and quality evaluation, it outlines a concrete approach aimed at providing enterprises with clear and actionable practices for constructing high-quality datasets.
The methodological frameworks include three core components: six core processes, five key technologies, and one comprehensive quality system. The core processes include data requirement, data planning, data collection, data preprocessing, data labeling, and model validation. The highlighted technologies cover data collection, data transformation, data cleaning, feature selection, and data labeling. Finally, the quality system incorporates its implementation procedure, a set of evaluation metrics, and a management system. In addition, the integrated operational system is built on three pillars: systematic planning, engineering implementation, and operational management. It begins with system planning—constructing knowledge indexes, inventorying data resources, and establishing standards to create a clear development blueprint. Engineering implementation then provides technical assurance, focusing on data development, delivery, and maintenance while advancing cutting-edge technologies to overcome application constraints and scale production. Finally, operational management ensures sustainable growth through end-to-end governance, enabling timely demand response, precise cost control, trusted quality and security, and collaborative ecosystem value creation.
In summary, the Guidelines advocate for a systematic approach to optimize the planning of high-quality dataset development, infrastructure-based solutions to facilitate data circulation and utilization, and an ecosystem-oriented environment to ensure sustainable growth. The ultimate goal is to establish a comprehensive and integrated framework that covers the entire lifecycle and interconnects all phases of high-quality dataset construction. Moving forward, NDA will continue to guide various stakeholders to actively participate in the development of high-quality datasets, enhance the quality of data supply, and strengthen the foundation for artificial intelligence advancement.
Original source: https://xbitly.com/NzDlT