Interactive explainer for Issue #112 — uniform temporal sampling + sparse patch selection
Hi, I am currently reviewing the data preprocessing pipeline and had a question regarding tools/tools_for_hevc/step3_generate_video_mv_residual_index.py. Specifically, I noticed that the script does not explicitly define a fixed interval for sampling full frames (Intra-frames). Could you please clarify:
Understanding this sampling frequency is critical for me to align the training data correctly with the model architecture.
A friendly way to read the pipeline is this: it is built around 64 uniformly sampled temporal positions plus sparse patch selection. In other words, Step 3 is mainly producing a top-k patch index that tells us where the informative content is, while the codec-side GOP setting stays in the background to keep motion-vector and residual extraction stable.
A more accurate way to think about it is that Step 3 does not hard-code a fixed full-frame interval such as 32 or 64. Instead, it builds 64 uniformly spaced sample positions over the whole video using np.linspace(..., num=seq_len), so the actual spacing naturally depends on the video length.
Yes, but in a fairly gentle sense: Step 3 reads MV and residual from an already encoded HEVC stream, so the GOP structure matters as a compressed-domain prerequisite. At the same time, it is still helpful to separate that from the training-side temporal layout, which is organized by the 64 sampled positions.
In other words, Step 3 is a scout that looks at compressed-domain signals (MV + residual) to decide which space-time patches are informative. The actual pixel data is decoded later, and only the chosen patches are retained.
duration frames, the average spacing is approximately
frames, so it changes from one video to another.
Below: 64 sampled positions drawn uniformly from a 320-frame video. The highlight moves across the temporal scaffold.
Raw Video -> HEVC Transcoding
The codec-side setup fixes GOP=16 so motion vectors and residuals stay stable and easy to read later.Uniform Temporal Sampling (64 positions)
Each video is reduced to a fixed temporal scaffold so later selection happens on a consistent sequence length.MV / Residual Extraction
Compressed-domain signals are extracted only at the sampled positions instead of decoding every frame densely.Score Map Computation
Motion and residual cues are fused into a spatial-temporal importance map.Top-K Patch Index Generation
The pipeline converts the score map into sparse patch indices rather than dense frame tensors. .visidx.npyDecode Video -> Keep Selected Patches
RGB pixels are only retained where the sparse selection index says they matter.Model Input
The final input is a sparse representation guided by compressed-domain selection.Key takeaway: The codec GOP defines the compressed-domain representation, while the sampled sequence defines the temporal scaffold used for patch selection. They operate at different levels.
Created by the HEVC encoder in Step 2. With the default config, the script requests GOP_SIZE=16 and keyint=16:min-keyint=16.
Constructed in Step 3 by np.linspace. These 64 positions are the training-side temporal scaffold, independent of GOP boundaries.
A local convention in Step 3: I_pos = {0}. It simply forces the first sampled position to have zero fusion energy.
These three concepts live at different levels. The confusion in Issue #112 comes from treating the local anchor as if it were a real codec I-frame. The code does not do that.
| Video length | 320 frames |
| Temporal sampling | seq_len = 64 (positions 0, 5, 10, ..., 315) |
| Local anchor | Position 0 (first sampled frame) → energy set to 0 |
| Step 3 output | Top-k patch index across the 64 sampled positions |
| Training usage | Decode the same 64 positions, keep only selected patches |
Even if the original 320-frame video contains multiple true I-frames in the codec-side GOP, the training representation is still organized as 64 temporal positions for sparse selection. The practical rule is not "one dense full frame every 16 steps," but "64 sampled positions followed by top-k patch selection."
tools/tools_for_hevc/step3_generate_video_mv_residual_index.py
frame_id_list = np.linspace(0, duration - 1, num=seq_len, dtype=int).tolist()
I_pos = {0}
if i in I_pos: fused_list[i] = np.zeros((H, W), dtype=np.float32) continue
vis_idx = mask_by_residual_topk(res_torch, K, patch_size)
Step 3 does not re-encode the video or insert new keyframes. It also does not sample real I-frames explicitly; it only uses
I_pos = {0}
as a local convention for the sampled clip.
For completeness, the HEVC transcoding script tools/tools_for_hevc/step2_convert_videos2hevc.py uses a default GOP size of 16 (line 42) and passes keyint=16:min-keyint=16 to FFmpeg (line 117). This is a background parameter that stabilizes MV/residual extraction; it does not dictate the training sampling strategy.
For the next stage, we are thinking about moving from fully uniform candidate-frame selection to a more codec-aware strategy. The intuition is that different temporal regions do not carry the same amount of compressed-domain activity, so it is reasonable to spend more sampling budget on the bins where the packet-size signal is stronger.
Concretely, we can first summarize pkt_size over short time bins, turn those energies into a cumulative distribution, and then sample frame locations according to that distribution. This still keeps the pipeline sparse, but it makes the selected frames more likely to land in the segments that contain stronger motion or structural change.
The sketch below is only meant to convey the intuition: the bars represent temporal bins with different compressed-domain energy, the cyan curve shows the accumulated sampling preference, and the yellow markers indicate that more samples are naturally drawn from the higher-energy part of the video.