
The Ministry of Digital Affairs has proposed the “Data Innovation and Utilization Promotion Act” and held a briefing on July 15 to explain key aspects of the draft legislation. Among the highlights is the plan to build Taiwan’s sovereign AI corpus by establishing standard licensing terms and conducting data inventories across government agencies. The corpus is scheduled to launch in the fourth quarter.
The public consultation period for the “Data Innovation and Utilization Promotion Act” runs until August 15. According to the Ministry, the main objectives of the act include: (1) enhancing the quality of open data to support AI model training; (2) fostering cross-sectoral data sharing to increase data value—such as by reducing data acquisition costs and implementing guidance, incentives, or subsidies; (3) requiring each government agency to adopt measures that promote innovative data use; and (4) cultivating a data innovation ecosystem.
Deputy Minister Lin Yi-jing emphasized that AI is progressing rapidly and Taiwan has long aspired to develop AI models that reflect a local perspective and can be applied domestically. However, training AI requires vast amounts of data. By amending the law, government-held copyrighted content can be released without compromising personal data privacy.
The Ministry stressed that the draft stipulates the use of standardized licensing terms for government open data to facilitate development and use in emerging technologies like AI. It also mandates that data shared by the government be provided under non-exclusive licenses, not granted to any specific individual or entity.
In response to public interest in the progress of the sovereign AI corpus, Director-General of the Department of Data Innovation, Chuang Ming-fen, stated that the approach mirrors the early stages of open data development more than a decade ago, when each agency started by releasing five datasets and gradually built momentum. A similar phased strategy is being applied now, but it will take time.
Chuang explained that while over 50,000 open data sets are currently available, training large language models (LLMs) requires a different type of data—coherent, complete textual content rather than the structured format common to open data. Such textual content often involves copyright issues. Therefore, the government is establishing a licensing framework specifically for the sovereign corpus. Agencies will first inventory their holdings and then release applicable content under the new licensing terms.
The upcoming corpus will incorporate both newly released content and existing open data, including approximately 1,000 textual datasets suitable for LLM training. These include resources like TAIDE (Trustworthy AI Dialogue Engine) that are freely usable. Chuang added that government policy reports, plans, and key publications—many of which are high-quality texts suitable for training—will also be included. From July to August, the Ministry will actively engage with other government bodies to assist in the process. Agencies such as the Hakka Affairs Council, Ministry of Education, Council of Indigenous Peoples, and Ministry of Culture are already conducting inventories. The sovereign corpus is expected to be released in Q4.
Resource: 數發部拚主權AI語料庫 Q4上線
