NEWS  /  Analysis

China’s AI Chip Ecosystem Needs Improvement, Says Member of Chinese Academy of Engineering

By   xinyue  Jul 09, 2024, 3:27 a.m. ET
Threads (6)

He explained that the computing power required by large models spans four main stages: model development, model training, model fine-tuning, and model inference. Consequently, computing power is essential throughout the lifecycle of a large model.

AsianFin--The AI large models are evolving from single-modality to multi-modality, with increasing applications leading to an explosive growth in computing power, said Zheng Weimin, an academician of the Chinese Academy of Engineering and a professor in the Department of Computer Science and Technology at Tsinghua University.

Compared to Nvidia, the domestic AI chip ecosystem is not as developed, he pointed out.

Zheng made the remarks at the 2024 annual seminar of the China Information Technology Association (ChinaInfo100) on Sunday.

He explained that the computing power required by large models spans four main stages: model development, model training, model fine-tuning, and model inference. Consequently, computing power is essential throughout the lifecycle of a large model.

He highlighted the high costs associated with computing power. For instance, GPT-4 used 800 Nvidia A100 GPUs, with a monthly development cost of $2 million. The cost of training with 10,000 A100 GPUs reaches $200 million, and the daily inference cost for ChatGPT is $700,000. In large model enterprises, computing power accounts for 70% of model training costs and 95% of inference costs.

At the model training level, Zheng identified three support systems. The first is the GPU systems based on Nvidia chips. They have excellent hardware performance and a robust programming ecosystem but are not sold to China, making them hard to obtain and significantly more expensive. 

The second is the systems based on domestic AI chips. Although domestic chips have made significant progress in both software and hardware, users are reluctant to adopt them due to the underdeveloped ecosystem.

Zheng had elaborated on this issue at a sub-forum of the 2024 World Artificial Intelligence Conference. Despite the efforts of over 20 domestic companies, including Shanghai Tianjic and MetaX, the ecosystem for domestic AI systems remains immature, especially in software.

Zheng defined a good ecosystem as one where AI CUDA software originally written for Nvidia can be easily ported to domestic systems with similar methodologies. If this porting process takes one to two years, the ecosystem is not good.

Currently, the domestic ecosystem is inadequate, leading to low user adoption. Zheng emphasized the need for robust system design and software optimization across ten areas: programming frameworks, parallel acceleration, communication libraries, operator libraries, AI compilers, programming languages, schedulers, memory allocation systems, fault-tolerant systems, and storage systems.

He argued that manufacturers of AI chips must excel in these ten areas to gain user acceptance. In his view, if domestic AI chips achieve 60% of the performance of foreign chips but have a well-developed software ecosystem, customers would still be satisfied.

Zheng noted that most tasks do not noticeably suffer from a 40% performance deficit in chips; the primary issue is the ecosystem. Even if the hardware performance is 120% of that of competitors, poor software support will deter potential users.

The third is supercomputer-based systems. China has 14 national supercomputing centers, but their machines often have high idle rates.

Zheng suggested that using supercomputers for large model training is feasible with coordinated hardware and software design, potentially reducing training costs. Demonstrations with Llama-7B and Baichuan large models showed that training costs could be reduced by about 82% compared to Nvidia clusters.

Apart from computing power, storage is also critical throughout the lifecycle of large models, including data acquisition, preprocessing, training, and inference. Zheng stressed the importance of memory for AI inference, noting that improvements in storage systems could significantly enhance performance, reducing the need for many GPUs.

He argued that domestic chips should balance half-precision (FP16) and double-precision (FP64) floating-point computation performance, with a ratio of 100:1 to accommodate a broader range of AI algorithms. Additionally, large model tasks often require multi-GPU interconnection, making network parameters, architecture, and storage performance increasingly crucial at the chip level.

Zheng identified several technical challenges that domestic AI chips must address, including network balance design, I/O subsystem balance design, and memory design.

Despite recent releases and mass production of new products by GPU startups like Tianjic, MetaX, Moore Threads, and Baidu Kunlun, sales remains in the doldrums due to the underdeveloped software ecosystem.

Meanwhile, according to Yicai Global, Nvidia is expected to deliver over one million "China-special" H20 chips, with total AI chip sales in China reaching about $12 billion this year. These chips are not subject to U.S. export controls, and each H20 chip is priced between $12,000 and $13,000.

Zheng also stressed the importance of balanced design in large model infrastructure. Proper design could reduce the required number of GPUs from 10,000 to 9,000, whereas poor design could increase it to 30,000 to achieve similar results.

Related threads