Interactive explainer for Issue #116 — why we normalize per frame before selecting patches globally
If each frame is normalized independently, why do a global Top-K afterwards? Doesn't per-frame normalization flatten the differences across frames?
A simple way to read this design is that frame-wise normalization and global Top-K are doing different jobs. Normalization helps put different frames onto a more comparable scale, so one unusually strong frame does not dominate just because its raw energy is larger. Global Top-K then uses the shared sparse budget to keep the most informative patches across the whole sampled sequence.
Not in the way that matters for selection. What normalization mainly changes is the overall scale of each frame, not the internal pattern of which regions are more or less informative. So after normalization, patches that were relatively important inside the same frame are still likely to stay important.
Because in the end the model still works with a single shared patch budget across all 64 sampled frames. Global Top-K is the step that enforces that budget: it ranks all candidate patches together and keeps only the strongest ones. Without it, each frame would effectively choose patches on its own, and the total count would become hard to control.
mask_by_residual_topk." data-zh="所有融合后的帧被堆叠成 (T, H, W) 的张量,传入 mask_by_residual_topk。">All fused frames are stacked into a tensor of shape (T, H, W) and passed to mask_by_residual_topk.
torch.topk selects the global top-K patches." data-zh="在 mask_by_residual_topk 内部,patch 分数是在整个序列上计算的,然后用一次 torch.topk 选出全局 top-K patch。">Inside mask_by_residual_topk, patch scores are computed for the entire sequence and a single torch.topk selects the global top-K patches.
In other words, normalization happens before fusion (per modality, per frame), while Top-K happens after fusion (across all frames).
The two small examples below are meant to show the contrast directly: the left side keeps raw magnitudes, while the right side first normalizes each frame and then applies the same global Top-K budget.
Frame 2 has much larger raw values, so it dominates the final Top-K almost by itself.
After per-frame normalization, frames become more comparable. Global Top-K selects the best patches overall, but no single high-energy frame dominates by scale alone.
Extract MV & Residual per frame
Read motion vectors and residual signals from the HEVC stream at each sampled temporal position.Per-frame normalization
Scale MV magnitude and residual energy to [0,1] based on their own 95th percentile (per frame, per modality). _mv_energy_norm / _residual_energy_normFuse normalized energies
Combine MV and residual cues into a single spatial importance map for each frame. _fuse_energyGlobal Top-K patch selection
Stack all fused frames, score every patch across the whole sequence, and select the top-K patches globally. mask_by_residual_topkSparse patch index
Output a single index file (.visidx.npy) that tells the training loader which patches to keep.A frame with camera shake or a large scene transition can have 10x higher raw energy. Its patches would swamp the global Top-K, pushing out informative patches from calmer frames.
Result: unfair, frame-scale-dominated selectionIf each frame independently kept its own top patches, the total number of patches would scale with the number of frames. The model would receive a variable-length input, breaking the fixed token budget.
Result: no shared budget, variable lengthEvery frame competes on a level playing field, and the final selection still respects a single global patch budget. Informative regions from any frame can win, regardless of the original raw scale.
Result: fair + budget-controlledtools/tools_for_hevc/step3_generate_video_mv_residual_index.py
_residual_energy_norm — normalizes residual energy per frame using the 95th percentile.
a = float(np.percentile(x, pct)) a = max(a, 1.0) norm = np.clip(x / a, 0.0, 1.0)
_mv_energy_norm — computes MV magnitude and normalizes it per frame using the 95th percentile.
a = float(np.percentile(mag, pct)) a = max(a, 1e-6) norm = np.clip(mag / a, 0.0, 1.0)
mv_norm, _ = _mv_energy_norm(...) res_norm, _ = _residual_energy_norm(...) fused = _fuse_energy(mv_norm, res_norm, ...)
vis_idx = mask_by_residual_topk(res_torch, K, patch_size)
mask_by_residual_topk — reshapes all patch scores across the full sequence and selects the global top-K.
scores = res_abs.reshape(B, T, hb, ph, wb, pw).sum(dim=(3,5)).reshape(B, L) topk_idx = torch.topk(scores, k=K, dim=1, largest=True, sorted=False).indices
One thing we have also been thinking about is how the current patch score is constructed. Right now, MV and residual are still two different kinds of codec signals. They come from different sources, behave differently numerically, and do not naturally live on exactly the same scale. In the current pipeline, we first normalize them separately and then fuse them into one value with manually chosen weights.
That design works reasonably well in practice, but part of the scoring rule is still hand-crafted. A direction we find more promising for the future is to use the bit consumption after entropy coding of codec-side information such as MV / residual directly as the score, or at least as a much more central signal.
The reason this feels attractive to us is that bit cost already lives in a more unified coding-space unit. Compared with manually balancing different variables, it may give us a cleaner and more consistent way to measure how much information a region really carries from the codec's point of view.
We are still actively exploring this direction, and we plan to keep updating this part of the work as it becomes more mature.