Interactive explainer for Issue #113 — aligned with the offline preprocessing pipeline used in practice
Hi, thanks for the great work on this HEVC-based token selection pipeline. I have a question about how I-frames are handled in ap_dataloader_dali_codec.py.
My understanding from the paper is that all tokens from I-frames are preserved, while Top-K selection is only applied to P-frame patches based on codec-derived saliency. In particular, Equation (2) seems to describe the HEVC input as keeping the full patchified I-frame and applying the visibility mask only to decoded P-frames.
However, in get_frame_id_list, I noticed that residuals at I-frame positions are explicitly zeroed out:
if pos in I_pos_set: residuals_y[pos] = np.zeros((H0, W0), dtype=dtype0 or np.uint8)
Since patch scores in compute_visible_indices_cpu are computed from residual energy, this seems to imply that all I-frame patches receive a score of 0 and therefore would not be selected by Top-K, except possibly through tie-breaking or the static_fallback path.
So I wanted to check whether I am misunderstanding the implementation, or whether the current code is using a different behavior from what I inferred from the paper. If I-frame tokens are indeed intended to be fully preserved, could you clarify where that happens in the pipeline?
Thanks!
A friendly way to read this behavior is to start from the offline data preparation pipeline. We first re-encode videos with GOP=16, then uniformly sample 64 frames for the downstream sequence, following the same overall idea explained in Issue #112. Under that setup, one frame is kept as a complete frame every 16 frames, so the preserved full-frame structure is best understood as part of offline preprocessing rather than as a separate online rule inside the dataloader.
It is helpful to read this page together with Issue #112. The two issues are really looking at the same preparation logic from different angles: Issue #112 focuses more on the offline preprocessing order (transcode first, then sample 64 frames), while this issue asks how to interpret the preserved full-frame structure. In practice, both point back to the same pipeline: GOP=16 first, then uniform 64-frame sampling.