Codec-Guided Preprocessing Pipeline

Interactive explainer for Issue #112 — uniform temporal sampling + sparse patch selection

💬 Issue #112 opened by RRooyyCChheenn

Hi, I am currently reviewing the data preprocessing pipeline and had a question regarding tools/tools_for_hevc/step3_generate_video_mv_residual_index.py. Specifically, I noticed that the script does not explicitly define a fixed interval for sampling full frames (Intra-frames). Could you please clarify:

  • 1. Is there a default interval being used (e.g., every 32 or 64 frames)?
  • 2. Does the script process the video entirely without full-frame sampling, or does it rely on the GOP structure predefined during the HEVC encoding step?

Understanding this sampling frequency is critical for me to align the training data correctly with the model architecture.

Short Answer

A friendly way to read the pipeline is this: it is built around 64 uniformly sampled temporal positions plus sparse patch selection. In other words, Step 3 is mainly producing a top-k patch index that tells us where the informative content is, while the codec-side GOP setting stays in the background to keep motion-vector and residual extraction stable.

Direct Answers to the Issue

Question 1

Is there a fixed I-frame interval like 32 or 64?

A more accurate way to think about it is that Step 3 does not hard-code a fixed full-frame interval such as 32 or 64. Instead, it builds 64 uniformly spaced sample positions over the whole video using np.linspace(..., num=seq_len), so the actual spacing naturally depends on the video length.

Question 2

Does Step 3 rely on the HEVC GOP structure?

Yes, but in a fairly gentle sense: Step 3 reads MV and residual from an already encoded HEVC stream, so the GOP structure matters as a compressed-domain prerequisite. At the same time, it is still helpful to separate that from the training-side temporal layout, which is organized by the 64 sampled positions.

So if you want one compact mental model, it is usually safest to think: first sample 64 temporal positions, then use those positions to guide top-k patch selection.

What the Model Actually Receives

RGB video top-k patch index sparse patch input
  • The input is not a dense stack of decoded frames.
  • Instead, the input pair is: raw video + top-k patch index (produced by Step 3).
  • During training: decode video → extract patches → keep only patches selected by the index → feed them into the model.

In other words, Step 3 is a scout that looks at compressed-domain signals (MV + residual) to decide which space-time patches are informative. The actual pixel data is decoded later, and only the chosen patches are retained.

Temporal Sampling

  • seq_len = 64 — every video is represented by 64 temporal positions.
  • These positions are uniformly spaced across the entire video duration.
  • Sampling is independent of GOP boundaries.
  • Goal: build a fixed-length training representation regardless of original video length.
If a video has duration frames, the average spacing is approximately frames, so it changes from one video to another.

Below: 64 sampled positions drawn uniformly from a 320-frame video. The highlight moves across the temporal scaffold.

Pipeline Overview

1

Raw Video -> HEVC Transcoding

The codec-side setup fixes GOP=16 so motion vectors and residuals stay stable and easy to read later.
2

Uniform Temporal Sampling (64 positions)

Each video is reduced to a fixed temporal scaffold so later selection happens on a consistent sequence length.
3

MV / Residual Extraction

Compressed-domain signals are extracted only at the sampled positions instead of decoding every frame densely.
4

Score Map Computation

Motion and residual cues are fused into a spatial-temporal importance map.
5

Top-K Patch Index Generation

The pipeline converts the score map into sparse patch indices rather than dense frame tensors. .visidx.npy
6

Decode Video -> Keep Selected Patches

RGB pixels are only retained where the sparse selection index says they matter.
7

Model Input

The final input is a sparse representation guided by compressed-domain selection.

GOP vs. Sampled Sequence

Codec-side GOP

  • Defines the compressed-domain representation
  • Determines how MV and residual are structured in the bitstream
  • Background setting (default GOP = 16)

Training-side Sampled Sequence

  • Defines the temporal scaffold for patch selection
  • Fixed 64 positions, uniform over video length
  • Operates independently of GOP boundaries

Key takeaway: The codec GOP defines the compressed-domain representation, while the sampled sequence defines the temporal scaffold used for patch selection. They operate at different levels.

Local Anchor vs. Real I-frame

1. Real codec I-frame

Created by the HEVC encoder in Step 2. With the default config, the script requests GOP_SIZE=16 and keyint=16:min-keyint=16.

Role: define compressed-stream structure

2. Sampled temporal position

Constructed in Step 3 by np.linspace. These 64 positions are the training-side temporal scaffold, independent of GOP boundaries.

Role: define where we evaluate patch importance

3. Local anchor

A local convention in Step 3: I_pos = {0}. It simply forces the first sampled position to have zero fusion energy.

Role: initialization convention, not real keyframe sampling

These three concepts live at different levels. The confusion in Issue #112 comes from treating the local anchor as if it were a real codec I-frame. The code does not do that.

Why Not Sample Every I-frame?

  • I-frame spacing depends on encoding parameters and can vary if settings change.
  • It does not guarantee a fixed-length input (videos of different lengths would yield different numbers of I-frames).
  • Sampling I-frames is not aligned with the goal of patch selection.
  • Uniform sampling + scoring is simpler, more flexible, and length-invariant.

Concrete Example

Video length 320 frames
Temporal sampling seq_len = 64 (positions 0, 5, 10, ..., 315)
Local anchor Position 0 (first sampled frame) → energy set to 0
Step 3 output Top-k patch index across the 64 sampled positions
Training usage Decode the same 64 positions, keep only selected patches

The right interpretation

Even if the original 320-frame video contains multiple true I-frames in the codec-side GOP, the training representation is still organized as 64 temporal positions for sparse selection. The practical rule is not "one dense full frame every 16 steps," but "64 sampled positions followed by top-k patch selection."

What Step 3 Actually Does

tools/tools_for_hevc/step3_generate_video_mv_residual_index.py

  • Line 256: constructs uniformly sampled temporal positions across the video.
    frame_id_list = np.linspace(0, duration - 1, num=seq_len, dtype=int).tolist()
  • Line 258: hard-codes the first sampled position as the local anchor.
    I_pos = {0}
  • Lines 306–308: skip MV/residual fusion for the anchor by setting energy to 0.
    if i in I_pos:
        fused_list[i] = np.zeros((H, W), dtype=np.float32)
        continue
  • Line 399: generates the top-k patch index.
    vis_idx = mask_by_residual_topk(res_torch, K, patch_size)
  • Lines 619–620: saves the index as .visidx.npy — a selection signal, not RGB tensors.

Step 3 does not re-encode the video or insert new keyframes. It also does not sample real I-frames explicitly; it only uses I_pos = {0} as a local convention for the sampled clip.

Codec-side Background: GOP Setting

For completeness, the HEVC transcoding script tools/tools_for_hevc/step2_convert_videos2hevc.py uses a default GOP size of 16 (line 42) and passes keyint=16:min-keyint=16 to FFmpeg (line 117). This is a background parameter that stabilizes MV/residual extraction; it does not dictate the training sampling strategy.

FAQ for Readers

  • Does Step 3 re-encode the video? No. Re-encoding happens in Step 2. Step 3 only reads the existing HEVC bitstream and computes selection indices.
  • Does the model receive every sampled frame densely? No. The sampled positions are used to score space-time regions, and only the selected patches are kept.
  • If the video is longer, does the interval become larger? Yes. Because the code samples 64 positions over the whole duration, longer videos naturally produce a larger average gap between sampled positions.

Looking Ahead

For the next stage, we are thinking about moving from fully uniform candidate-frame selection to a more codec-aware strategy. The intuition is that different temporal regions do not carry the same amount of compressed-domain activity, so it is reasonable to spend more sampling budget on the bins where the packet-size signal is stronger.

Concretely, we can first summarize pkt_size over short time bins, turn those energies into a cumulative distribution, and then sample frame locations according to that distribution. This still keeps the pipeline sparse, but it makes the selected frames more likely to land in the segments that contain stronger motion or structural change.

The sketch below is only meant to convey the intuition: the bars represent temporal bins with different compressed-domain energy, the cyan curve shows the accumulated sampling preference, and the yellow markers indicate that more samples are naturally drawn from the higher-energy part of the video.

Takeaway for Training Alignment

  • Think of the training-side sequence as 64 uniformly sampled positions.
  • Think of Step 3 as producing a top-k patch index, not dense full-frame input.
  • Treat the codec GOP as a representation constraint that stabilizes MV/residual extraction, not as the direct temporal layout of model input.