EN
中文

How OCR Training Works in OneVision-Encoder

Explaining multi-label Partial FC for OCR-based supervision

💬 Issue #105 opened by JerryPW

Hi, thank you for releasing this excellent work. While reading the paper, there seems to be one point that is still unclear: how the OCR annotations are actually incorporated into training.

From the paper, the following part is understood:

  • PaddleOCR is applied to images from OBELICS and Zero250M
  • The recognized text is tokenized
  • 100 fine-grained tags are constructed for each image
  • OCR data is introduced in Stage 2 together with video supervision

However, the paper does not seem to explicitly describe how these OCR-derived tags are optimized in the training objective.

1. Data Sources

OCR supervision uses two image datasets that naturally contain embedded text:

15M
OBELICS — document images from interleaved web pages
15M
Zero250M — curated high-quality image collection

These images are not paired with text captions. Instead, PaddleOCR extracts the visible text and converts it into classification labels.

2. PaddleOCR Preprocessing

Each image is processed offline (before training) through the following pipeline:

Raw Image
(OBELICS / Zero250M)
PaddleOCR
Text region detection
+ recognition
Word Segmentation
Tokenize recognized
text into words
Multi-label Tags
100 word-tags
per image

Key point: The segmented words are mapped to a global vocabulary of 365,187 classes. Each image gets exactly 100 tags from this vocabulary. These 100 tags are stored as the image's multi-label annotation in the .rec training file.

Handling variable word counts: In practice, PaddleOCR may extract fewer or more than 100 words from an image. If fewer than 100 words are detected, oversampling is applied — existing tags are duplicated to fill the 100 slots. If more than 100 words are detected, downsampling is applied — a random subset of 100 tags is selected. This ensures every image has a fixed-size label vector of exactly 100.

3. Multi-label Partial FC (Core Mechanism)

This is the key innovation for OCR training. Unlike standard single-label classification, each image has 100 labels simultaneously. The system uses a modified Partial FC that handles multi-label training through parallel independent softmax losses.

Step-by-step: How one OCR image is trained

Step 1
100 Tags

Each image has 100 stored OCR tags (word IDs from the 365K vocabulary):

Step 2
Sample 8

Randomly shuffle all 100, keep first 8. Different every iteration:

// fused_partial_fc_v2_multi_res.py line 180-183
noise = torch.rand(batch_size, random_diff, device="cuda")  // random_diff=100
ids_shuffle = torch.argsort(noise, dim=1)
ids_keep = ids_shuffle[:, :8]
local_labels = torch.gather(local_labels, dim=1, index=ids_keep) // [B, 8]
Step 3
8x Classify

Each of the 8 labels gets its own independent classifier. For each label i:

  1. Sample class centers: Keep the positive class center + randomly sample 10% of the 365K negative centers (sample_rate=0.1)
  2. Compute cosine similarity:
  3. Apply ArcFace margin: Add angular margin m to the positive logit:
  4. Scale and softmax: Multiply by s, then softmax cross-entropy

Multi-label ArcFace Visualization

Click to select label Auto-cycling...
Current label: y₀ = label_0
θ (angle to positive):
θ + m (margin):
Sampled negatives: 5 / 36K
Embedding
Positive center
Negative center
margin m

Each column = one of 8 labels. positive center sampled negatives (10%) skipped (90%)

Step 4
Average

The 8 independent CE losses are averaged:


m = angular margin   |   s = scale factor (64)
// fused_partial_fc_v2_multi_res.py line 497
loss = (loss_0 + loss_1 + loss_2 + loss_3 + loss_4 + loss_5 + loss_6 + loss_7) / 8.0

Why 8 parallel classifiers instead of multi-label BCE?

Each label gets a full softmax over the entire vocabulary (~365K classes). This forces the model to discriminate each OCR word against all other possible words — much harder than binary "present/not present". The ArcFace margin further pushes embeddings apart in angular space, producing more discriminative representations.

4. Multi-Head Training Architecture

In Stage 2, three data heads share a single ViT backbone. Each head has its own Partial FC classifier with separate class centers.

Shared ViT Backbone (OneVision-Encoder)
Image: [B, 3, 448, 448]  |  OCR: [B, 3, 448, 448]  |  Video: [B, 3, 8, 224, 224]
Output: pooler_output → [B, 1024] embedding
Head 1: Image (SI)
COYO-400M / LAION-260M
num_classes = 2,000,000
random_diff = 10
batch_size = 16
Head 2: OCR
OBELICS / Zero250M
num_classes = 365,187
random_diff = 100
batch_size = 8
Head 3: Video Codec
HowTo100M / Panda-70M / K710
Separate PFC head
Codec + Sampling + Collage
batch_size = 16
ConfigImage HeadOCR HeadVideo Head
Data SourceCOYO / LAIONOBELICS / Zero250MHowTo100M / Panda-70M / K710
num_classes2,000,000365,187Separate
random_diff10100N/A
Labels sampled per image8 of 108 of 1008
sample_rate (PFC)0.10.10.1
Input processing2D image (448x448)2D image (448x448)Video (8x224x224)
loss_weight1.01.01.0

5. Total Training Loss

All three losses are weighted equally (weight = 1.0). The shared backbone receives gradients from all three heads simultaneously.
// training/train.py line 641-657
for head_id, pfc in enumerate(list_module_pfc):
    head_embedding = list_embedding[head_id]          // [B, 1024]
    head_label = list_data_batch[head_id]["labels"]   // [B, random_diff]
    head_label = head_label[:, label_select : label_select + random_diff]
    head_loss = pfc(head_embedding, head_label, random_diff) * loss_weight
    list_loss.append(head_loss)

scaled_loss = sum(list_loss) / backward_passes_per_step
scaled_loss.backward()

6. Two-Stage Training Summary

Stage 1: Image-only Pretraining

Train the backbone on COYO/LAION images with 2M cluster centers. The ViT learns general visual representations.

Stage 2: Joint Multi-Modal Training

Add OCR and Video heads. The OCR head forces the encoder to become text-aware — by classifying word-level tags, the ViT learns to represent text regions with high fidelity. This is why OneVision-Encoder excels at DocVQA, ChartQA, and OCRBench despite being a general-purpose vision encoder.

Key Insight

The OCR tags are not used for text decoding. They serve as a discriminative classification signal: by forcing the encoder to distinguish between 365K possible words for each detected text token, the learned representations become highly sensitive to text content. This is fundamentally different from contrastive image-text approaches (like CLIP) which only align global image and sentence embeddings.