CAVALRY-V

A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs

1NTU    2BJTU    3A*STAR

🚀 TL;DR

We introduce CAVALRY-V, a novel framework for generating adversarial attacks against Video Multimodal Large Language Models (V-MLLMs). Our approach achieves 22.8% average improvement over existing attacks across 7 state-of-the-art V-MLLMs including GPT-4.1 and Gemini 2.0.

📄 What is CAVALRY-V?

Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored.

CAVALRY-V directly targets the critical interface between visual perception and language generation in V-MLLMs through two key innovations:

  • Dual-objective loss function that simultaneously disrupts text generation and visual representations
  • Two-stage generator training combining large-scale pre-training with specialized fine-tuning

🎯 Target Models

Commercial:

  • GPT-4.1
  • Gemini 2.0

Open-source:

  • QwenVL-2.5
  • InternVL-2.5
  • LLaVA-Video
  • Aria
  • MiniCPM-o-2.6

🎯 Key Results

22.8%

Average Improvement

Over best baseline attacks across 7 V-MLLMs

7

Target Models

Including GPT-4.1, Gemini 2.0, and 5 open-source

🔧 Framework Overview

Two-stage generator training with cross-modal disruption

Framework Architecture
Framework Architecture

🎯 Cross-Modal Disruption

  • Semantic Loss: Disrupts text generation
  • Visual Loss: Corrupts visual representations
  • Auxiliary Loss: Enhances transferability

⚡ Two-Stage Training

  • Pre-training: LAION-400M foundation
  • Visual Fine-tuning: LLaVA-Instruct-150K
  • Temporal Fine-tuning: Video-MME

📊 Main Results

CAVALRY-V vs. baseline attacks on MMBench-Video

Performance Comparison Chart
Lower scores = more successful attacks

🔬 Further Analysis

Attack Mechanism Analysis

Further Analysis Results
Further Analysis Results

🎯 Key Findings

🔍 Visual Misidentification (A1)
Mercedes-Benz misidentified as Fiat, demonstrating incorrect object recognition under adversarial perturbation
🧠 Reasoning Disruption (A2-A3)
Accurate visual perception but contextually incorrect answers, showing disrupted visual-language connection
💡 Knowledge-Based Resistance (A4)
Attack fails when answers rely on stored knowledge rather than visual evidence (non-Newtonian fluids)

✨ Key Features

Universal Attack

Works on both image and video MLLMs

Efficient Generation

Single forward pass, no iterative optimization

Cross-Model Transfer

Effective across different architectures

📄 Citation

@article{zhang2025cavalry,
  title={CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs},
  author={Zhang, Jiaming and Hu, Rui and Guo, Qing and Lim, Wei Yang Bryan},
  journal={arXiv preprint arXiv:2507.00817},
  year={2025}
}