Explaining multi-label Partial FC for OCR-based supervision
Hi, thank you for releasing this excellent work. While reading the paper, there seems to be one point that is still unclear: how the OCR annotations are actually incorporated into training.
From the paper, the following part is understood:
However, the paper does not seem to explicitly describe how these OCR-derived tags are optimized in the training objective.
OCR supervision uses two image datasets that naturally contain embedded text:
These images are not paired with text captions. Instead, PaddleOCR extracts the visible text and converts it into classification labels.
Each image is processed offline (before training) through the following pipeline:
Key point: The segmented words are mapped to a global vocabulary of 365,187 classes. Each image gets exactly 100 tags from this vocabulary. These 100 tags are stored as the image's multi-label annotation in the .rec training file.
Handling variable word counts: In practice, PaddleOCR may extract fewer or more than 100 words from an image. If fewer than 100 words are detected, oversampling is applied — existing tags are duplicated to fill the 100 slots. If more than 100 words are detected, downsampling is applied — a random subset of 100 tags is selected. This ensures every image has a fixed-size label vector of exactly 100.
This is the key innovation for OCR training. Unlike standard single-label classification, each image has 100 labels simultaneously. The system uses a modified Partial FC that handles multi-label training through parallel independent softmax losses.
Each image has 100 stored OCR tags (word IDs from the 365K vocabulary):
Randomly shuffle all 100, keep first 8. Different every iteration:
// fused_partial_fc_v2_multi_res.py line 180-183
noise = torch.rand(batch_size, random_diff, device="cuda") // random_diff=100
ids_shuffle = torch.argsort(noise, dim=1)
ids_keep = ids_shuffle[:, :8]
local_labels = torch.gather(local_labels, dim=1, index=ids_keep) // [B, 8]
Each of the 8 labels gets its own independent classifier. For each label i:
sample_rate=0.1)m to the positive logit: s, then softmax cross-entropyEach column = one of 8 labels. positive center sampled negatives (10%) skipped (90%)
The 8 independent CE losses are averaged:
// fused_partial_fc_v2_multi_res.py line 497
loss = (loss_0 + loss_1 + loss_2 + loss_3 + loss_4 + loss_5 + loss_6 + loss_7) / 8.0
Why 8 parallel classifiers instead of multi-label BCE?
Each label gets a full softmax over the entire vocabulary (~365K classes). This forces the model to discriminate each OCR word against all other possible words — much harder than binary "present/not present". The ArcFace margin further pushes embeddings apart in angular space, producing more discriminative representations.
In Stage 2, three data heads share a single ViT backbone. Each head has its own Partial FC classifier with separate class centers.
num_classes = 2,000,000random_diff = 10batch_size = 16
num_classes = 365,187random_diff = 100batch_size = 8
batch_size = 16
| Config | Image Head | OCR Head | Video Head |
|---|---|---|---|
| Data Source | COYO / LAION | OBELICS / Zero250M | HowTo100M / Panda-70M / K710 |
| num_classes | 2,000,000 | 365,187 | Separate |
| random_diff | 10 | 100 | N/A |
| Labels sampled per image | 8 of 10 | 8 of 100 | 8 |
| sample_rate (PFC) | 0.1 | 0.1 | 0.1 |
| Input processing | 2D image (448x448) | 2D image (448x448) | Video (8x224x224) |
| loss_weight | 1.0 | 1.0 | 1.0 |
// training/train.py line 641-657
for head_id, pfc in enumerate(list_module_pfc):
head_embedding = list_embedding[head_id] // [B, 1024]
head_label = list_data_batch[head_id]["labels"] // [B, random_diff]
head_label = head_label[:, label_select : label_select + random_diff]
head_loss = pfc(head_embedding, head_label, random_diff) * loss_weight
list_loss.append(head_loss)
scaled_loss = sum(list_loss) / backward_passes_per_step
scaled_loss.backward()
Train the backbone on COYO/LAION images with 2M cluster centers. The ViT learns general visual representations.
Add OCR and Video heads. The OCR head forces the encoder to become text-aware — by classifying word-level tags, the ViT learns to represent text regions with high fidelity. This is why OneVision-Encoder excels at DocVQA, ChartQA, and OCRBench despite being a general-purpose vision encoder.
The OCR tags are not used for text decoding. They serve as a discriminative classification signal: by forcing the encoder to distinguish between 365K possible words for each detected text token, the learned representations become highly sensitive to text content. This is fundamentally different from contrastive image-text approaches (like CLIP) which only align global image and sentence embeddings.