AnyAttack: Towards large-scale self-supervised adversarial attacks on vision-language models

TL;DR

We developed AnyAttack, a framework that can turn ordinary images into targeted adversarial examples that fool Vision-Language Models. By pre-training on LAION-400M dataset, our method can make a benign image (like a dog) trick VLMs into generating any specified output (like "this is violent content"), working across both open-source and commercial models.

Abstract

Vision-Language Models (VLMs) have revolutionized multimodal AI applications, yet their vulnerability to adversarial manipulation presents significant security challenges. Traditional targeted attacks require predefined labels, severely limiting their scalability and real-world impact.

We introduce AnyAttack, a novel self-supervised framework that achieves unprecedented attack flexibility through large-scale foundation model training. By pre-training an adversarial noise generator on the LAION-400M dataset without label supervision, our approach enables transforming any benign image into an attack vector that can target any desired output across different VLM architecture.

Our comprehensive evaluation demonstrates AnyAttack's effectiveness across five open-source VLMs (CLIP, BLIP, BLIP2, InstructBLIP, MiniGPT-4) on diverse multimodal tasks including retrieval, classification, and image captioning. Most notably, AnyAttack successfully transfers to commercial systems (Google Gemini, Claude Sonnet, Microsoft Copilot, OpenAI GPT), revealing systemic vulnerabilities in the VLM ecosystem.

This work establishes the first foundation model for adversarial attacks, fundamentally reshaping the threat landscape and highlighting the urgent need for robust defense mechanisms against this new class of scalable, transferable attacks.

Framework Overview

Our proposed framework, AnyAttack, introduces a novel two-phase approach to generating targeted adversarial examples without label supervision:

Adversarial Noise Pre-Training

In the pre-training phase, we leverage the large-scale LAION-400M dataset (𝒟_p) to develop a universal understanding of adversarial patterns. We train a decoder network F to produce adversarial noise δ while using a frozen encoder E as the surrogate model:

Given a batch of images x, we extract their embeddings using the frozen image encoder E
These normalized embeddings z are fed into the decoder F, which generates adversarial noise δ
To enhance generalization and computational efficiency, we introduce a K-augmentation strategy that creates multiple shuffled versions of the original images within each mini-batch
The adversarial noise is added to the shuffled original images to produce the adversarial examples

Adversarial Noise Fine-Tuning

In the fine-tuning phase, we adapt the pre-trained decoder F to specific downstream tasks and datasets (𝒟_f):

We use an unrelated random image x_r from an external dataset (𝒟_e) as the clean image
The pre-trained decoder generates adversarial noise δ specific to the target task
By adding this noise to the clean image (x_r + δ), we create adversarial examples that can effectively manipulate target VLMs

This two-phase approach enables AnyAttack to achieve unprecedented flexibility - any benign image can be transformed into an adversarial example capable of inducing any desired output from target VLMs. By pre-training on massive-scale data, our method develops transferable adversarial capabilities that generalize across models and tasks.

Experimental Results

We evaluated AnyAttack across both open-source and commercial Vision-Language Models, demonstrating its unprecedented transferability and effectiveness.

Results on Open-Source VLMs

Results on Commercial VLMs

Our results reveal a systemic vulnerability across the entire VLM ecosystem. Despite being trained on different datasets with different architectures, both open-source and commercial models remain susceptible to our self-supervised attack approach. This highlights the urgent need for robust defense mechanisms against this new class of transferable adversarial attacks.

Conclusion and Future Directions

AnyAttack demonstrates that self-supervised adversarial learning at scale creates a fundamentally new security challenge for Vision-Language Models, requiring urgent development of robust defenses against this class of transferable attacks.

Looking Forward

We have open-sourced our LAION-400M pre-trained adversarial image generator, which can produce targeted adversarial examples with just a single forward pass. This represents a significant efficiency improvement over traditional adversarial training methods that require costly gradient calculations.

Most importantly, our pre-trained generator potentially offers a promising alternative to conventional adversarial training on large models. By generating diverse adversarial examples efficiently, this approach could enable more practical and scalable robustness enhancements for the next generation of multimodal AI systems.

Code and Pre-trained Models

BibTeX

@inproceedings{zhang2025anyattack,
    title={Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models},
    author={Zhang, Jiaming and Ye, Junhong and Ma, Xingjun and Li, Yige and Yang, Yunfan and Yunhao, Chen and Sang, Jitao and Yeung, Dit-Yan},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2025}
}

AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models