Issue #113 Explainer

💬 Issue #113 opened by Wenbo-Nie

Hi, thanks for the great work on this HEVC-based token selection pipeline. I have a question about how I-frames are handled in ap_dataloader_dali_codec.py.

My understanding from the paper is that all tokens from I-frames are preserved, while Top-K selection is only applied to P-frame patches based on codec-derived saliency. In particular, Equation (2) seems to describe the HEVC input as keeping the full patchified I-frame and applying the visibility mask only to decoded P-frames.

However, in get_frame_id_list, I noticed that residuals at I-frame positions are explicitly zeroed out:

if pos in I_pos_set:
    residuals_y[pos] = np.zeros((H0, W0), dtype=dtype0 or np.uint8)

Since patch scores in compute_visible_indices_cpu are computed from residual energy, this seems to imply that all I-frame patches receive a score of 0 and therefore would not be selected by Top-K, except possibly through tie-breaking or the static_fallback path.

So I wanted to check whether I am misunderstanding the implementation, or whether the current code is using a different behavior from what I inferred from the paper. If I-frame tokens are indeed intended to be fully preserved, could you clarify where that happens in the pipeline?

Thanks!

Short Answer

A friendly way to read this behavior is to start from the offline data preparation pipeline. We first re-encode videos with GOP=16, then uniformly sample 64 frames for the downstream sequence, following the same overall idea explained in Issue #112. Under that setup, one frame is kept as a complete frame every 16 frames, so the preserved full-frame structure is best understood as part of offline preprocessing rather than as a separate online rule inside the dataloader.

How to Read the Pipeline

It helps to think of this as an offline-prepared data path, rather than a purely online codec decision that is recomputed from scratch inside each batch.
Step 1 is very simple: videos are first transcoded with GOP=16, which gives us a periodic full-frame structure at the codec level.
Step 2 then samples 64 frames uniformly from that prepared video to build the downstream sequence.
So when people talk about “full I-frame preservation” here, the most practical picture is this offline codec-guided preparation: one complete frame is retained every 16 frames, and the final sampled sequence is built on top of that structure.

Relation to Issue #112

It is helpful to read this page together with Issue #112. The two issues are really looking at the same preparation logic from different angles: Issue #112 focuses more on the offline preprocessing order (transcode first, then sample 64 frames), while this issue asks how to interpret the preserved full-frame structure. In practice, both point back to the same pipeline: GOP=16 first, then uniform 64-frame sampling.

Takeaway

The easiest way to keep the picture straight is to start from the offline prepared data, rather than reading one dataloader branch in isolation.
The preparation rule can be summarized very simply: re-encode with GOP=16, then uniformly sample 64 frames.
Under this setup, one full frame is retained every 16 frames, which is the practical interpretation we want to highlight here.