Issue #116 Explainer

Short Answer

A simple way to read this design is that frame-wise normalization and global Top-K are doing different jobs. Normalization helps put different frames onto a more comparable scale, so one unusually strong frame does not dominate just because its raw energy is larger. Global Top-K then uses the shared sparse budget to keep the most informative patches across the whole sampled sequence.

Direct Answers to the Issue

Question 1

Does per-frame normalization flatten differences across frames?

Not in the way that matters for selection. What normalization mainly changes is the overall scale of each frame, not the internal pattern of which regions are more or less informative. So after normalization, patches that were relatively important inside the same frame are still likely to stay important.

Question 2

Why still use global Top-K after normalization?

Because in the end the model still works with a single shared patch budget across all 64 sampled frames. Global Top-K is the step that enforces that budget: it ranks all candidate patches together and keeps only the strongest ones. Without it, each frame would effectively choose patches on its own, and the total count would become hard to control.

So the easiest one-line summary is: normalization makes frames easier to compare, and global Top-K decides how the final sparse budget is spent.

What the Code Actually Does

per-frame norm global top-k sparse selection

For every sampled frame, the script computes motion-vector energy and residual energy, then independently normalizes each map to [0,1] using a percentile-based scaling.
These normalized maps are fused into a single energy map per frame.
(T, H, W) and passed to mask_by_residual_topk." data-zh="所有融合后的帧被堆叠成 (T, H, W) 的张量，传入 mask_by_residual_topk。">All fused frames are stacked into a tensor of shape (T, H, W) and passed to mask_by_residual_topk.
mask_by_residual_topk, patch scores are computed for the entire sequence and a single torch.topk selects the global top-K patches." data-zh="在 mask_by_residual_topk 内部，patch 分数是在整个序列上计算的，然后用一次 torch.topk 选出全局 top-K patch。">Inside mask_by_residual_topk, patch scores are computed for the entire sequence and a single torch.topk selects the global top-K patches.

In other words, normalization happens before fusion (per modality, per frame), while Top-K happens after fusion (across all frames).

Visual Demo

The two small examples below are meant to show the contrast directly: the left side keeps raw magnitudes, while the right side first normalizes each frame and then applies the same global Top-K budget.

A. Raw energy + global Top-K

Frame 2 has much larger raw values, so it dominates the final Top-K almost by itself.

normal cell Top-K selected

B. Per-frame normalization + global Top-K

After per-frame normalization, frames become more comparable. Global Top-K selects the best patches overall, but no single high-energy frame dominates by scale alone.

normal cell Top-K selected

What changed after normalization?

We are no longer comparing raw magnitudes directly.
We are comparing how informative a patch is relative to other patches in the same frame.
Then we still need a single global ranking, because the final patch budget is shared across all sampled frames.

Pipeline Overview

1

Extract MV & Residual per frame

Read motion vectors and residual signals from the HEVC stream at each sampled temporal position.

2

Per-frame normalization

Scale MV magnitude and residual energy to [0,1] based on their own 95th percentile (per frame, per modality). _mv_energy_norm / _residual_energy_norm

3

Fuse normalized energies

Combine MV and residual cues into a single spatial importance map for each frame. _fuse_energy

4

Global Top-K patch selection

Stack all fused frames, score every patch across the whole sequence, and select the top-K patches globally. mask_by_residual_topk

5

Sparse patch index

Output a single index file (.visidx.npy) that tells the training loader which patches to keep.

Why Both Steps Are Necessary

Without normalization

A frame with camera shake or a large scene transition can have 10x higher raw energy. Its patches would swamp the global Top-K, pushing out informative patches from calmer frames.

Result: unfair, frame-scale-dominated selection

Without global Top-K

If each frame independently kept its own top patches, the total number of patches would scale with the number of frames. The model would receive a variable-length input, breaking the fixed token budget.

Result: no shared budget, variable length

With both steps

Every frame competes on a level playing field, and the final selection still respects a single global patch budget. Informative regions from any frame can win, regardless of the original raw scale.

Result: fair + budget-controlled

Code Walkthrough

tools/tools_for_hevc/step3_generate_video_mv_residual_index.py

Lines 146–152: _residual_energy_norm — normalizes residual energy per frame using the 95th percentile.
```
a = float(np.percentile(x, pct))
a = max(a, 1.0)
norm = np.clip(x / a, 0.0, 1.0)
```
Lines 154–170: _mv_energy_norm — computes MV magnitude and normalizes it per frame using the 95th percentile.
```
a = float(np.percentile(mag, pct))
a = max(a, 1e-6)
norm = np.clip(mag / a, 0.0, 1.0)
```

Lines 312–323: inside the frame loop, both normalized maps are fused into a single energy map.

mv_norm, _ = _mv_energy_norm(...)
res_norm, _ = _residual_energy_norm(...)
fused = _fuse_energy(mv_norm, res_norm, ...)

Lines 398–399: after all frames are fused and resized, a single global Top-K is applied.
```
vis_idx = mask_by_residual_topk(res_torch, K, patch_size)
```

Lines 207–225: mask_by_residual_topk — reshapes all patch scores across the full sequence and selects the global top-K.

scores = res_abs.reshape(B, T, hb, ph, wb, pw).sum(dim=(3,5)).reshape(B, L)
topk_idx = torch.topk(scores, k=K, dim=1, largest=True, sorted=False).indices

A Direction We Find Interesting

One thing we have also been thinking about is how the current patch score is constructed. Right now, MV and residual are still two different kinds of codec signals. They come from different sources, behave differently numerically, and do not naturally live on exactly the same scale. In the current pipeline, we first normalize them separately and then fuse them into one value with manually chosen weights.

That design works reasonably well in practice, but part of the scoring rule is still hand-crafted. A direction we find more promising for the future is to use the bit consumption after entropy coding of codec-side information such as MV / residual directly as the score, or at least as a much more central signal.

The reason this feels attractive to us is that bit cost already lives in a more unified coding-space unit. Compared with manually balancing different variables, it may give us a cleaner and more consistent way to measure how much information a region really carries from the codec's point of view.

We are still actively exploring this direction, and we plan to keep updating this part of the work as it becomes more mature.

FAQ for Readers

Does normalization throw away inter-frame differences? Not really. It mainly rescales magnitudes, while the relative spatial ordering inside each frame is still preserved.
Could we use global normalization instead of per-frame normalization? In principle yes, but it would make the pipeline much more sensitive to outlier frames. That is why the per-frame version is usually the safer choice.
Why percentile-based scaling instead of min-max? Mostly because it is gentler with extreme outliers, such as one unusually bright residual region, so the final scale is more stable.