1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
2Meituan
3School of Automation, Beijing Institute of Technology
4MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences
(1) The first token-level image text dataset (TokenIT ) is proposed, which consists of 20M images and 1.8B high-quality token-mask pairs.
(2) The first token-level text image foundation model, TokenFD , is proposed to support various downstream tasks.
(3) The image-as-text semantic capability inspires us to develop TokenVL, a VQA-based MLLM tailored for document understanding.
(4) Extensive experiments demonstrate the effectiveness of our proposed TokenFD and TokenVL. Specifically, TokenFD shows exceptional "zero-shot" capabilities and flexibility compared to other VFMs, such as CLIP, SAM, and InternViT2.5. TokenVL with 8B parameters, incorporating TokenFD as the VFM, achieves performance gains of 38 on OCRBench and an average of 8.8% across ten VQA tasks. Similarly, TokenVL with 2B parameters results in performance gains of 17 on OCRBench and an average of 13.34% on the ten VQA tasks.
We develop a token-level foundation model, TokenFD , specifically tailored for text-image-related tasks (path 2), in contrast to previous general foundation models (path 1). TokenFD is trained on a substantial self-built dataset, TokenIT , comprising 20 million images and 1.8 billion token-mask pairs. This well-learned model is capable of supplanting other VFMs in related downstream tasks.
In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenFD , the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT , comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks.
Note: The feature map shown on this page consists of token-level image features and token-level language features aligned within the same semantic space.
An overview of the self-constructed token-level TokenIT dataset, comprising 20 million images and 1.8 billion text-mask pairs. (a) provides a detailed description of each sample, including the raw image, a mask, and a JSON file that records BPE token information. We also count (b) the data distribution, (c) the number of selected BPE tokens, and (d) a word cloud map highlighting the top 100 BPE tokens.
Pipeline. An overview of the proposed TokenFD , where the token-level image features and token-level language features are aligned within the same semantic space. This "image-as-text" alignment seamlessly facilitates user-interactive applications, including text segmentation, retrieval, and visual question answering.
The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit method makes it difficult for these models to have a precise understanding. In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM's localization awareness.
Implementation Details. To pre-train the TokenFD foundation model, we employ the AdamW optimizer alongside a cosine learning rate schedule, with a base learning rate set at 5e-4. The model undergoes pre-training for two epochs on the TokenIT dataset. For TokenVL , we leverage the well-trained TokenFD as the visual foundation model and InternLM as the language model. Specifically, during the LLM-guided token alignment stage, InternLM remains frozen while we train the TokenFD and newly introduced modality connector. This stage involves training for one epoch on the TokenIT dataset, utilizing a base learning rate of 2e-4. In the subsequent supervised instruction tuning stage, all parameters are fully trainable, with a base learning rate of 1e-5. These experiments are executed on 64 H800 GPUs. Additional implementation details about other downstream experiments are provided within each respective subtask and Supplementary Material.
Text segmentation tasks
Tasks | Method | #Param | TextSeg | TotalText | HierText | Average |
---|---|---|---|---|---|---|
ZS | CLIP-L-336px | 304M | 19.71 | 13.56 | 13.39 | 15.55 |
CLIP-L-448px | 304M | 20.50 | 13.91 | 13.19 | 15.86 | |
CLIP-L-1024px | 304M | 21.35 | 14.33 | 11.77 | 15.81 | |
TokenFD-448px | 323M | 38.27 | 33.10 | 26.46 | 32.61 | |
TokenFD-1024px | 323M | 38.28 | 33.54 | 31.95 | 34.59 | |
LP | SAM-H | 632M | 40.82 | 36.83 | 25.87 | 34.51 |
InternViT2.5 | 300M | 49.77 | 42.54 | 34.31 | 42.21 | |
TokenFD | 323M | 55.66 | 47.53 | 43.11 | 48.77 |
Text segmentation experiments of various visual foundation models. "ZS" refers to the zero-shot task. "LP" denotes the linear probe experiment.
VQA tasks
Method | #Param | DocVQA | InfoVQA | TextVQA | ChartQA | Average |
---|---|---|---|---|---|---|
SAM-H | 632M | 17.0 | 23.1 | 33.1 | 30.1 | 25.82 |
CLIP-L | 304M | 64.9 | 38.6 | 80.7 | 65.2 | 62.36 |
InternViT2.5 | 300M | 77.3 | 49.3 | 84.4 | 74.0 | 71.25 |
TokenFD | 323M | 78.9 | 50.0 | 85.6 | 74.4 | 72.21 |
The ANLS results of various visual foundation models on VQA tasks.
Linear probe tasks
Tasks | Methods | #Param | CTR (EN) | CSVTRv2 (CH) | Average |
---|---|---|---|---|---|
LP | CLIP-L | 304M | 1.21 | 6.03 | 3.62 |
InternViT2.5 | 300M | 4.21 | 22.37 | 13.29 | |
TokenFD | 323M | 43.04 | 84.19 | 63.62 |
Linear probe experiments of various VFMs on text retrieval tasks. All VFMs are frozen.
OCRbench benchmark
8B-Model | ShareGPT4V | Cambrian | MM1.5 | POINT1.5 | GPT-4o | Gemini-1.5-Pro | GLM-4v | Claude3.5 | InternVL2.5 |
---|---|---|---|---|---|---|---|---|---|
Score | 398 | 614 | 635 | 720 | 736 | 754 | 776 | 788 | 822 |
8B-Model | TextMonkey | DocOwl-1.5 | TextHawk2 | TokenVL(ours) | 2B-Model | MiniMonkey | InternVL2.5 | TokenVL(ours) |
---|---|---|---|---|---|---|---|---|
Score | 561 | 599 | 784 | 860 | Score | 802 | 804 | 821 |
Comparison results of our TokenVL with other MLLMs on the OCRbench benchmark.
Text-rich image understanding task
Model | Size | Venue | DocVQA | InfoVQA | DeepForm | ChartQA | TextVQAVal | WTQ | TabFact | FUNSD | SROIE | KLC |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MiniCPM-V | 3B | COLM'24 | 71.9 | - | - | 55.6 | 74.1 | - | - | - | - | - |
Mini-Monkey | 2B | ICLR'25 | 87.4 | 60.1 | - | 76.5 | 75.7 | - | - | 42.9 | 70.3 | - |
InternVL2.5 | 2B | arxiv'24 | 88.7 | 60.9 | 15.2 | 79.2 | 74.3 | 38.7 | 58.1 | 37.9 | 68.1 | 16.1 |
TokenVL | 2B | - | 89.9 | 61.0 | 71.9 | 81.1 | 76.4 | 49.0 | 76.9 | 43.0 | 82.6 | 38.8 |
Claude-3.5 Sonnet (Closed-source model) | 88.5 | 59.1 | 31.4 | 51.8 | 71.4 | 47.1 | 53.5 | - | - | 24.8 | ||
GeminiPro-1.5 (Closed-source model) | 91.2 | 73.9 | 32.2 | 34.7 | 80.4 | 50.3 | 71.2 | - | - | 24.1 | ||
GPT4o 20240806 (Closed-source model) | 92.8 | 66.4 | 38.4 | 85.7 | 70.5 | 46.6 | 81.1 | - | - | 29.9 | ||
DocPeida | 7B | arxiv'23 | 47.1 | 15.2 | - | 46.9 | 60.2 | - | - | 29.9 | 21.4 | - |
DocOwl | 7B | arxiv'23 | 62.2 | 38.2 | 42.6 | 57.4 | 52.6 | 26.9 | 67.6 | 0.5 | 1.7 | 30.3 |
LLaVA1.5 | 7B | NeurIPS'23 | - | - | - | 9.3 | - | - | - | 0.2 | 1.7 | - |
UReader | 7B | EMNLP'23 | 65.4 | 42.2 | 49.5 | 59.3 | 57.6 | 29.4 | 67.6 | - | - | 32.8 |
CHOPINLLM | 7B | arxiv'24 | - | - | - | 70.0 | - | - | - | - | - | - |
TextHawk | 7B | arxiv'24 | 76.4 | 50.6 | - | 66.6 | - | 34.7 | 71.1 | - | - | - |
DocKylin | 7B | arxiv'24 | 77.3 | 46.6 | - | 66.8 | - | 32.4 | - | - | - | - |
MM1.5 | 7B | arxiv'24 | 88.1 | 59.5 | - | 78.6 | 76.8 | 46.0 | 75.9 | - | - | - |
DocOwl-1.5 | 8B | EMNLP'24 | 81.6 | 50.4 | 68.8 | 70.5 | 68.8 | 39.8 | 80.4 | - | - | 37.9 |
DocOwl-1.5-Chat | 8B | EMNLP'24 | 82.2 | 50.7 | 68.8 | 70.2 | 68.6 | 40.6 | 80.2 | - | - | 38.7 |
CogAgent | 17B | CVPR'24 | 81.6 | 44.5 | - | 68.4 | 76.1 | - | - | - | - | - |
Monkey | 10B | CVPR'24 | 66.5 | 36.1 | 40.6 | 65.1 | 67.6 | 25.3 | - | - | - | - |
Vary | 7B | ECCV'24 | 76.3 | - | - | 66.1 | - | - | - | - | - | - |
TextHawk2 | 7B | arxiv'24 | 89.6 | 67.8 | - | 81.4 | 75.1 | 46.2 | 78.1 | - | - | - |
PDF-WuKong | 9B | arxiv'24 | 76.9 | - | - | - | - | - | - | - | - | - |
LLaVA-NEXT-7B | 7B | arxiv'24 | 63.5 | 30.9 | 1.3 | 52.1 | 65.1 | 20.1 | 52.8 | - | - | 5.35 |
LLama3.2-11B | 11B | arxiv'24 | 82.7 | 36.6 | 1.78 | 23.8 | 54.3 | 23.0 | 58.3 | - | - | 3.47 |
Pixtral-12B | 12B | arxiv'24 | 87.7 | 49.5 | 27.4 | 71.8 | 76.1 | 45.2 | 73.5 | - | - | 24.1 |
Ovis | 9B | arxiv'24 | 88.8 | 74.0 | 45.2 | 81.4 | 77.7 | 50.7 | 76.7 | - | - | 23.9 |
InternVL2.5 | 8B | arxiv'24 | 93.0 | 77.6 | 37.9 | 84.8 | 79.1 | 52.7 | 74.8 | 38.26 | 71.7 | 22.9 |
AlignVLM | 8B | arxiv'25 | 81.2 | 53.8 | 63.3 | 75.0 | 64.6 | 45.3 | 83.0 | - | - | 35.5 |
TextMonkey | 8B | arxiv'24 | 73.0 | 28.6 | - | 66.9 | 65.6 | - | - | 32.3 | 47.0 | - |
HRVDA | 7B | CVPR'24 | 72.1 | 43.5 | 63.2 | 67.6 | 73.3 | 31.2 | 72.3 | - | - | 37.5 |
InternVL2 | 8B | CVPR'24 | 91.6 | 74.8 | - | - | 77.4 | - | - | - | - | - |
Park et al. | 7B | NeurIPS'24 | 72.7 | 45.9 | 53.0 | 36.7 | 59.2 | 34.5 | 68.2 | - | - | 36.7 |
MOAI | 7B | ECCV'24 | - | - | - | - | 67.8 | - | - | - | - | - |
TokenVL w/o TA | 8B | - | 93.8 | 75.3 | 72.4 | 86.5 | 79.3 | 57.2 | 83.6 | 41.5 | 79.0 | 39.6 |
TokenVL | 8B | - | 94.2 | 76.5 | 72.9 | 86.6 | 79.9 | 61.4 | 85.2 | 42.2 | 81.9 | 39.9 |
Comparisons on various types of text-rich image understanding tasks. All evaluation benchmarks use the officially designated metrics. "size" refers to the number of parameters in the model, and "Val" refers to the validation set.
@inproceedings{guan2025TokenFD,
title={A Token-level Text Image Foundation Model for Document Understanding},
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
booktitle={arxiv},
year={2025}
}