LLaVA-OneVision FamilyLLaVA-OneVision 项目家族

A research suite of fully-open multimodal models — from foundational vision encoders to the latest video-language frontier.完全开放的多模态研究矩阵——从底层视觉编码器到最新一代视频语言模型。

Releases发布

MAY 2026

LLaVA-OneVision-2

Video MLLM with codec-aligned dense input基于 codec 的视频多模态大模型

An 8B-class video MLLM trained with a four-stage progressive pipeline that scales comprehension from 30-second clips to 15-minute footage. Adds codec-aligned dense input as a new video input mode alongside image and uniform frame sampling, preserving native temporal signal; ships every dataset, training recipe, and checkpoint as a fully reproducible release.8B 级视频多模态大模型，通过四阶段渐进式训练把视频理解能力从 30 秒短片扩展到 15 分钟长视频；在图像输入和视频均匀抽帧之外，新增基于 codec 的密集输入模式，保留原生时序信号；数据、配方与权重全流程开源、完全可复现。

Latest Visit →

JAN 2026

OneVision-Encoder

Codec-aligned vision encoder for video and image面向视频与图像的 codec 对齐视觉编码器

A vision backbone built around codec-style patch selection: only the 3-25% of patches carrying real motion or semantic change are kept, unified under shared 3D RoPE and trained with cluster discrimination over 2M concepts. Outperforms Qwen3-ViT and SigLIP2 across 16 benchmarks while using a fraction of the visual tokens.基于 codec 风格 patch 选择构建的视觉骨干：只保留 3%–25% 真正承载运动或语义变化的 patch，使用统一的 3D RoPE 与 200 万概念的聚类判别训练；在 16 个图像/视频基准上以更少的视觉 token 全面超越 Qwen3-ViT 与 SigLIP2。

Released Visit →

NOV 2025

LLaVA-OneVision-1.5-RL

Reasoning-focused RL post-training面向推理的强化学习后训练

A reinforcement-learning post-training stage on top of LLaVA-OneVision-1.5, using 67K discrepancy-curated examples (high Pass@N, low Pass@1) and a rule-based reward system spanning STEM, grounding, counting, OCR, code, and diagrams. A two-stage curriculum first stabilizes answer-only RL, then unlocks chain-of-thought reasoning.在 LLaVA-OneVision-1.5 之上的强化学习后训练阶段：使用 6.7 万条按 Pass@N 与 Pass@1 差距筛选的样本，配合覆盖 STEM、定位、计数、OCR、代码与图表的规则化奖励系统；两阶段课程先用仅答案 RL 稳定输出，再用链式思维 RL 释放更深的推理能力。

Released Visit →

OCT 2025

LLaVA-OneVision-1.5

Fully-open vision-language flagship完全开放的视觉语言旗舰模型

An 8B vision-language model trained on a fully-open recipe: an 85M concept-balanced pretraining corpus and a 22M instruction set, paired with the in-house RICE-ViT encoder and offline parallel data packing (~11x compression). Stage-1.5 finishes in ~3.7 days on 128 A800 GPUs and matches or beats Qwen2.5-VL on a broad benchmark suite.8B 视觉语言模型，基于完全开放的训练配方：8500 万条概念均衡的预训练语料 + 2200 万条指令数据，搭配自研 RICE-ViT 视觉编码器与离线并行数据打包（约 11× 压缩）；Stage-1.5 在 128 块 A800 上约 3.7 天完成，在多项公开基准上达到或超过 Qwen2.5-VL。

Released Visit →

Blog博客

COMING SOON

Posts coming soon博客即将上线

Deep-dives, training notes, and engineering write-ups from the LLaVA-OneVision team will live here.来自 LLaVA-OneVision 团队的技术深度文章、训练笔记与工程实践，将陆续发布于此。

Community社区

Contributors贡献者

—

Ranked by commit count.按提交数排序。

Stargazers点星者

—

Ranked by star time — earliest first.按 Star 时间排序——最早的在前。