Offline Data Preparation and I-Frame Retention

Interactive explainer for Issue #113 — aligned with the offline preprocessing pipeline used in practice

💬 Issue #113 opened by Wenbo-Nie

Hi, thanks for the great work on this HEVC-based token selection pipeline. I have a question about how I-frames are handled in ap_dataloader_dali_codec.py.

My understanding from the paper is that all tokens from I-frames are preserved, while Top-K selection is only applied to P-frame patches based on codec-derived saliency. In particular, Equation (2) seems to describe the HEVC input as keeping the full patchified I-frame and applying the visibility mask only to decoded P-frames.

However, in get_frame_id_list, I noticed that residuals at I-frame positions are explicitly zeroed out:

if pos in I_pos_set:
    residuals_y[pos] = np.zeros((H0, W0), dtype=dtype0 or np.uint8)

Since patch scores in compute_visible_indices_cpu are computed from residual energy, this seems to imply that all I-frame patches receive a score of 0 and therefore would not be selected by Top-K, except possibly through tie-breaking or the static_fallback path.

So I wanted to check whether I am misunderstanding the implementation, or whether the current code is using a different behavior from what I inferred from the paper. If I-frame tokens are indeed intended to be fully preserved, could you clarify where that happens in the pipeline?

Thanks!

Short Answer

A friendly way to read this behavior is to start from the offline data preparation pipeline. We first re-encode videos with GOP=16, then uniformly sample 64 frames for the downstream sequence, following the same overall idea explained in Issue #112. Under that setup, one frame is kept as a complete frame every 16 frames, so the preserved full-frame structure is best understood as part of offline preprocessing rather than as a separate online rule inside the dataloader.

How to Read the Pipeline

  • It helps to think of this as an offline-prepared data path, rather than a purely online codec decision that is recomputed from scratch inside each batch.
  • Step 1 is very simple: videos are first transcoded with GOP=16, which gives us a periodic full-frame structure at the codec level.
  • Step 2 then samples 64 frames uniformly from that prepared video to build the downstream sequence.
  • So when people talk about “full I-frame preservation” here, the most practical picture is this offline codec-guided preparation: one complete frame is retained every 16 frames, and the final sampled sequence is built on top of that structure.

Relation to Issue #112

It is helpful to read this page together with Issue #112. The two issues are really looking at the same preparation logic from different angles: Issue #112 focuses more on the offline preprocessing order (transcode first, then sample 64 frames), while this issue asks how to interpret the preserved full-frame structure. In practice, both point back to the same pipeline: GOP=16 first, then uniform 64-frame sampling.

Takeaway

  • The easiest way to keep the picture straight is to start from the offline prepared data, rather than reading one dataloader branch in isolation.
  • The preparation rule can be summarized very simply: re-encode with GOP=16, then uniformly sample 64 frames.
  • Under this setup, one full frame is retained every 16 frames, which is the practical interpretation we want to highlight here.