NEWS  /  Analysis

Chinese AI Startup DeepSeek Unveils Open-Source Optical Compression Model for LLM Training

By  xinyue  Oct 22, 2025, 3:47 a.m. ET

This demonstrates that compact language models can effectively decode compressed visual text, potentially enabling larger models to adopt similar capabilities with fewer resources. The model is also highly scalable: a single A100-40G GPU can reportedly generate more than 200,000 pages of training data per day.

Chinese artificial intelligence firm DeepSeek has released DeepSeek-OCR, an open-source model designed to extract and compress text from images and PDFs, aiming to provide large-scale, high-quality datasets for training large language models (LLMs) and vision-language models (VLMs) while dramatically reducing computational requirements.

The model was made publicly available on GitHub yesterday, accompanied by a research paper titled DeepSeek-OCR: Contexts Optical Compression.

The technology behind DeepSeek-OCR leverages optical compression to encode textual information into visual representations, which are stored in an optical format.

According to the company, this approach addresses the major computational bottlenecks LLMs face when processing long-form content such as research papers, legal contracts, financial reports, and dialogue histories. By converting text into images, the system allows models to process extensive documents more efficiently, simulating a gradual forgetting mechanism similar to human memory.

Performance metrics shared in the research indicate that DeepSeek-OCR can achieve over 96% accuracy with a tenfold reduction in data, 90% accuracy at compression rates of 10–12 times, and around 60% accuracy with a 20-fold reduction.

This demonstrates that compact language models can effectively decode compressed visual text, potentially enabling larger models to adopt similar capabilities with fewer resources. The model is also highly scalable: a single A100-40G GPU can reportedly generate more than 200,000 pages of training data per day.

DeepSeek-OCR’s ability to compress long-form textual content opens new possibilities for LLM training, particularly for scenarios requiring the processing of massive amounts of data. By converting dialogues, research materials, and multi-page documents into images, the approach reduces token counts and computational overhead, potentially allowing models to handle larger datasets without a corresponding spike in GPU demand.

The open-source release has already attracted attention within the AI community, with DeepSeek-OCR garnering over 1,400 stars on GitHub shortly after its debut.

Analysts note that while the model represents a significant technical advancement, DeepSeek has been relatively slow in rolling out new models like R2. Some experts speculate that this may suggest the company is temporarily falling behind in the rapidly evolving AI field.

Others, however, interpret the cautious pace as a deliberate strategy to strengthen internal capabilities and lay the groundwork for a next-generation AI model.

 

Please sign in and then enter your comment