We introduce CAVALRY-V, a novel framework for generating adversarial attacks against Video Multimodal Large Language Models (V-MLLMs). Our approach achieves 22.8% average improvement over existing attacks across 7 state-of-the-art V-MLLMs including GPT-4.1 and Gemini 2.0.
Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored.
CAVALRY-V directly targets the critical interface between visual perception and language generation in V-MLLMs through two key innovations:
Commercial:
Open-source:
Average Improvement
Over best baseline attacks across 7 V-MLLMs
Target Models
Including GPT-4.1, Gemini 2.0, and 5 open-source
Two-stage generator training with cross-modal disruption
CAVALRY-V vs. baseline attacks on MMBench-Video
Attack Mechanism Analysis
Works on both image and video MLLMs
Single forward pass, no iterative optimization
Effective across different architectures
@article{zhang2025cavalry,
title={CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs},
author={Zhang, Jiaming and Hu, Rui and Guo, Qing and Lim, Wei Yang Bryan},
journal={arXiv preprint arXiv:2507.00817},
year={2025}
}