We derive an attention-based inference method, highlight spot, to enable videoLLM real-time proactive generation without additional training.
To equip the model with real-time interactive modeling capabilities, we propose incorporating interruption detection and parallel decoding.
\[ p(x_{n+k} \mid x_1, x_2, \ldots, x_{n+k-1}) > \beta \cdot \exp \left( -S\left( p(\cdot \mid x_1, x_2, \ldots, x_{n+k-1}) \right) \right) \]
, where \(\beta\) is a scaling factor, \(S(\cdot)\) is the entropy function. The threshold for noise detection is dependent on the perplexity of the model. When there is a larger perplexity, the threshold is reduced, indicating the query is more like a noise that does not need a response.@misc{omnimmi,
title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},
author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
publisher={CVPR},
year={2025}
}