LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

TL;DR: We propose LiteFrame, a highly efficient video encoder for Video Large Language Models that unlocks scalable, long-form video understanding by resolving inefficiencies in both the LLM and the ViT.

Abstract

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction—reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder.

To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.

Main Results

LiteFrame redefines the performance-latency trade-off across multiple video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench.

Performance comparison across benchmarks

Unlocking Frame Scaling: By offloading the prefilling bottleneck from the LLM and lowering visual encoding costs, LiteFrame enables the processing of 8x more frames within restricted computing budgets.
End-to-End Efficiency: LiteFrame achieves up to a 35% reduction in total inference latency (vision encoding + LLM prefilling) while consistently improving average video understanding accuracy.
Parameter Reduction: LiteFrame utilizes only 87M parameters, a massive reduction from the 304M parameters of the teacher model.
Zero-Shot Spatial Resolution Scaling: LiteFrame's inherent token efficiency enables scaling in high-resolution videos, achieving a state-of-the-art score on HLVid without any high-resolution training.

Methodology

To train LiteFrame, we propose Compressed Token Distillation (CTD) and Language Model Adaptation (LMA).

Compressed Token Distillation (CTD): The student encoder is trained to directly predict information-dense, spatio-temporally compressed supervision targets generated by applying Weighted Average Pooling (WAP) to a large teacher model's output.
Language Model Adaptation (LMA): A lightweight fine-tuning stage aligns the compressed latent space with the downstream LLM, allowing it to seamlessly handle extended temporal contexts (up to 512 frames).
Spatio-Temporal Token Compressive Architecture: Our lightweight student encoder significantly reduces FLOPs and latency by employing depth-wise 1D convolutions for temporal modeling and strided convolutions for downsampling.

Citation

@article{kim2026liteframe,
  title={LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs},
  author={Kim, Jihwan and Parthasarathy, Nikhil and Qin, Danfeng and Hur, Junhwa and Sun, Deqing and Han, Bohyung and Yang, Ming-Hsuan and Gong, Boqing},
  journal={arXiv preprint arXiv:2605.17260},
  year={2026}
}