Video LMMs · Agentic RL · Long Video Understanding

ParaVT Taming the Tool Prior Paradox
for Parallel Tool Use in Agentic Video Reinforcement Learning

1MiroMind 2Nanyang Technological University 3The University of Hong Kong 4Hong Kong University of Science and Technology (Guangzhou) 5Tsinghua University 6LMMs-Lab
Corresponding authors

Can a large multimodal model learn to natively invoke tool calls in parallel under agentic RL — when the very prior that enables tool use also destabilizes it?

69.4
VideoMME w/ sub — open-source 7-8B SOTA
+7.9
average gain over Qwen3-VL-8B base (%)
0.64
peak parallel tool-call format compliance
0
benchmarks topped at 7-8B
0
SFT cold-start samples
4.4K
PARA-GRPO RL samples

Motivation

Long-video understanding is increasingly framed as agentic video reasoning: a large multimodal model (LMM) post-trained with reinforcement learning to invoke video-processing tools. Prior native-RL methods, including our earlier LongVT (CVPR 2026), dispatch these tool calls sequentially, one per turn, which is brittle to single mis-localizations, prone to multi-turn context drift, and linear in inference cost.

Failure modes
Two failure modes that share a cause: Format Fragility and Tool Necessity Gap
Two failure modes of vanilla GRPO on a tool-native LMM. Format Fragility (left): under temperature sampling, the policy still generates reasoning content but its SFT-learned <think>/<tool_call>/<answer> boundary closures decay, leaving rollouts that look almost-parseable but mis-tagged. Tool Necessity Gap (right): with 64 overview frames many prompts can be answered without tools, so the GRPO advantage between calling and skipping is near-zero, and the parallel tool-call rate collapses within a few RL steps while accuracy oscillates flatly.
Tool Prior Paradox
Cross-model contrast: a weaker-prior LMM keeps format stable but RL elicits zero tool calls
The shared cause — pretrained tool priors. A weaker-prior LMM stays format-stable but RL elicits zero tool calls; a stronger-prior LMM explores tools but loses format. Prior strength drives both failure modes, so a workable recipe must keep format and reward tool-use credit at the same time. The contrast pairs Qwen2.5-VL-7B (weak prior: fτ=0.85 but 0% tool calls over 520 RL steps) with Qwen3-VL-8B (strong prior: 14.4% tool usage but fτ=0.13 collapses fast), and PARA-GRPO recovers fτ to 0.41 on the latter, isolating the prior as the shared driver of both failures.

We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling: a main agent emits multiple temporal-window crops in a single turn, dispatches them to weight-sharing sub-agents, and aggregates the parallel evidence into a final answer. Applying standard GRPO to ParaVT surfaces two coupled failures driven by the same pretrained tool prior — Format Fragility (the SFT-learned structural tags collapse under temperature sampling) and the Tool Necessity Gap (the skip-tool reward shortcut). We name this trade-off the Tool Prior Paradox and tame it with PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO): a targeted format reward applied only at the structural-token positions most prone to collapse, paired with a per-prompt frame-budget randomization that lets calling the tool earn measurable RL credit. Across seven evaluation splits spanning six long-video benchmarks plus a temporal-grounding split, ParaVT sets a new open-source 7-8B SOTA on six of the seven, improving over the Qwen3-VL-8B base by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64.

01 / Method

Parallel hierarchical agent + PARA-GRPO

A main agent emits one or more <tool_call> blocks in a single turn; each call is handled by an independent sub-agent that shares weights with the main agent and returns a short summary; the main agent aggregates the summaries into a final answer. PARA-GRPO targets the structural-format collapse and the skip-tool reward shortcut that vanilla GRPO surfaces on a tool-native LMM.

Architecture
ParaVT architecture: sequential vs. parallel tool dispatch
Sequential (left) vs. Parallel (right) tool dispatch. ParaVT replaces the one-tool-per-turn pattern with single-turn parallel dispatch: the main agent emits all required crop_video calls in one turn, each is processed by an independent sub-agent, and a gather-and-reason pass produces the final answer. Visual-token density inside one turn stays bounded as the number of crops grows.
Rfmt(o) = Σt∈struct(o)   log πθ(ot | o<t)
Exploration Anchoring. Two cooperating mechanisms repair the structural-boundary collapse without restricting reasoning content. (i) Constrained Generation pins a Think Prefix (<think>) on every response and credits the presence of a final <answer> block via an Answer Suffix term, ruling out blind direct answers and tool-only rollouts. (ii) Selective Anchoring applies the targeted format reward only at the structural-token positions struct(o) most prone to collapse, leaving tool-call content tokens unconstrained.
K ∼ Uniform({4, 8, 16, 32, 64})
nFrames Gating. Randomize the per-prompt overview frame budget K so that, for a controllable fraction of GRPO groups, calling the tool earns measurable credit over skipping on prompts where the overview alone is insufficient. The mixture of budgets in a batch keeps the cross-rollout advantage of tool use non-degenerate, creating the gradient signal that the Tool Necessity Gap otherwise eliminates. Exploration Anchoring must take effect first so the gating gradient is credited to parseable tool-using rollouts.
02 / Results

SOTA results across six long-video benchmarks

All open-source rows were re-evaluated under a unified protocol (image_url channel, 64 frames, per-baseline native prompt). Best result in bold; underlined marks ParaVT's value when it is not the best column-wise. * withheld due to benchmark–training-data overlap. native tool-call schema not reconcilable with Charades-STA grounding output.

Model VideoMME
w/o sub
VideoMME
w/ sub
LongVideo-
Bench
LVBench MLVU MMVU Charades-STA
test (mIoU)
Proprietary LMMs — best-setting numbers from official reports
GPT-4o71.977.266.734.764.666.7
Gemini 1.5 Pro75.081.364.433.174.365.8
Open Instruct LMMs — direct answer
Qwen2.5-VL-7B55.764.546.432.247.865.431.6
Open Reasoning LMMs — <think><answer>
Video-R1-7B57.666.057.436.961.661.325.4
VideoChat-R1-7B50.458.249.223.858.765.031.5
VideoRFT-7B58.565.655.138.044.942.718.7
Time-R1-7B58.966.256.038.260.563.434.7
ReWatch-R1-7B58.865.053.638.560.159.820.2
Video-Thinker-7B61.965.356.0*65.264.529.0
Open Agentic LMMs — <think><tool_call><answer>
Qwen3-VL-8B59.968.452.233.158.368.049.3
Conan-7B55.562.854.538.259.264.025.4
LongVT-RFT-7B59.566.054.737.959.463.423.4
SAGE-7B44.152.437.431.849.755.728.9
VideoZoomer-7B45.348.339.622.946.261.6
ParaVT-8B (Ours) 62.1 69.4 60.4 39.8 65.0 68.6 50.1

Higher is better. ParaVT tops 6 of 7 columns and is within 0.2 pt of the best on MLVU. Avg over the seven columns: 59.3 for ParaVT vs. 55.6 for the Qwen3-VL-8B base (+7.9% relative).

03 / Ablation

PARA-GRPO restores format compliance while incentivizing tool exploration

Each row reports mean training-time format reward fτ at sampling temperature τ=0.7 and mean training-time tool-call rate per rollout κ. Curves on the right plot the same four runs step-by-step. Best result in bold; rows shaded grey mark the full PARA-GRPO recipe.

Setting fτ κ VideoMME
w/o sub
VideoMME
w/ sub
LV-
Bench
MLVU
(A) Training Stage
Qwen3-VL-8B0.030.4559.968.433.158.3
  + SFT Cold-Start0.132.5060.769.039.163.7
  + SFT + GRPO0.130.0262.068.639.364.5
  + SFT + PARA-GRPO0.410.2162.169.439.865.0
(B) Component Effectiveness
SFT + GRPO0.130.0262.068.639.364.5
  + Exploration Anchoring0.350.1961.768.739.364.1
  + nFrames Gating0.101.3661.368.739.163.6
Full PARA-GRPO0.410.2162.169.439.865.0
  − Tool Reward Rtool0.330.0461.968.538.764.3
  − Penalty Term γ0.360.2761.669.038.864.5
(C) Dispatch Mode
Sequential Tool Calling61.468.837.564.1
Parallel Tool Calling62.169.439.865.0
Training dynamics
Training curves: vanilla GRPO format reward falls to 0.13; PARA-GRPO rises to 0.64
Vanilla GRPO vs. PARA-GRPO across RL steps. Vanilla GRPO's format reward collapses within 20 steps; PARA-GRPO lifts it above 0.6 and holds it through the run while keeping the tool-call rate non-degenerate. Reasoning-content tokens stay free to evolve under RL.

Reading the table. Block (A) shows that SFT cold-start mostly transfers tool-call format (κ: 0.45 → 2.50) but not format reward (fτ=0.13), and that GRPO on top of cold-start collapses κ to 0.02. PARA-GRPO recovers fτ to 0.41 and κ to 0.21, and posts the best score on every benchmark column. Block (B) confirms neither component alone suffices: Exploration Anchoring on its own holds format up but suppresses tool calls; nFrames Gating on its own keeps tools active but tanks format. Only their composition wins on both axes. Block (C) shows the same trained checkpoint dispatched in parallel beats sequential calling by +0.7 to +2.3 across benchmarks, with no extra training.

Citation

If you find this project helpful, please consider citing our paper

@misc{yang2026paravt,
  title={{ParaVT}: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning},
  author={Zuhao Yang and Kaichen Zhang and Sudong Wang and Keming Wu and Zhongyu Yang
          and Bo Li and Xiaojuan Qi and Shijian Lu and Xingxuan Li and Lidong Bing},
  year={2026},
  eprint={2605.20342},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}