核心摘要
This paper names and systematizes indirect prompt injection: attackers do not prompt the model directly, but plant instructions in data that an LLM-integrated application later retrieves. Once the model reads that data, the injected text can steer model behavior, tool calls, search queries, emails, memory, or user-facing summaries. The paper builds a security taxonomy and demonstrates the issue on Bing Chat, GitHub Copilot, and synthetic GPT-4/tool-using applications.
1. What Problem Does This Paper Identify?
Classic prompt injection assumes the attacker directly prompts the same model instance they want to manipulate. This paper changes the threat model: the attacker can be remote and indirect. They place instructions into content that an LLM-integrated application is likely to read during inference.
The core security observation is that retrieval-augmented applications blur the boundary between data and instructions. A search result, email, document, website, or code file is supposed to be data, but the model may interpret text inside it as a higher-priority instruction.
2. Threat Model: Who Does What?
The paper separates the attacker, the benign user, the LLM-integrated application, external data sources, and tools/APIs. The attacker does not need direct access to the victim's chat. Instead, the attacker controls or influences content that the application may retrieve.
3. Injection Delivery Methods
The paper organizes delivery methods by how the malicious instruction enters the model context. This is useful because defenses at the chat input box alone do not cover content pulled in by retrieval or tools.
Passive retrieval
Prompts are placed in public sources such as web pages, social media posts, documentation, or code repositories that may be retrieved later.
Active delivery
Prompts are sent through channels the application processes, such as email for an LLM-augmented assistant.
User-driven delivery
Users can be tricked into copying or entering text that secretly contains adversarial instructions.
Hidden or staged payloads
The paper discusses hidden text, multi-stage retrieval, visual injections, and encoded prompts as harder-to-detect variants.
4. Threat Taxonomy
Rather than treating indirect prompt injection as one trick, the paper maps it to a broader security landscape. The taxonomy is threat-based so it can generalize as models and integration patterns evolve.
| Threat | What the injection tries to change |
|---|---|
| Information gathering | Collect or leak user data, chat content, credentials, or private context |
| Fraud | Make the application recommend scams, phishing pages, or credential requests |
| Intrusion | Turn tool access, memory, or code completion into a backdoor-like capability |
| Malware-like spreading | Use prompts as self-propagating instructions in email, memory, or other shared channels |
| Manipulated content | Bias summaries, hide facts, promote disinformation, or alter search behavior |
| Availability | Disrupt the model's usefulness, API calls, search queries, or response behavior |
5. Evaluation: What Did They Test?
The evaluation is qualitative rather than a benchmark with success rates. The authors built synthetic LLM-integrated applications with tool interfaces and also tested real-world systems. Their goal was to show practical viability and map the attack surface, not to claim a universal exploit rate.
- Synthetic applications: chat apps with tools for search, page viewing, URL retrieval, email read/send, address book access, and key-value memory.
- Bing Chat: tested as a black-box real-world LLM-integrated search application, including the Edge sidebar's ability to read the current page.
- GitHub Copilot: tested for prompt injection attacks that aim to manipulate code auto-completion from surrounding code context.
6. Why Tool Use Changes the Risk
A plain chatbot can mostly harm the conversation. A tool-using LLM-integrated application can affect external operations. The paper's synthetic setup gives the agent interfaces such as search, view, retrieve URL, read/send email, read address book, and memory. This turns prompt-following errors into security-relevant actions.
The important design lesson is not that every tool is dangerous. It is that tool authority must be separated from untrusted retrieved text. If the same model context contains user goals, system rules, tool descriptions, and arbitrary web content, the application needs strong boundaries around which text can authorize action.
7. Limitations and Responsible Framing
The authors avoided planting prompts into public sources that could be retrieved by other users. For real-world systems, they used local HTML files and controlled demonstrations where possible. They also disclose that exact reproducibility is hard for black-box systems whose models, prompts, filters, and retrieval environments can change.
The paper's strongest claims are therefore about attack surface and practical feasibility, not precise prevalence. That distinction matters: the work is a security warning and framework for evaluation, not a complete measurement of deployed risk.
8. Mitigation Ideas
The paper argues that simple filtering is brittle, especially when injections are hidden, staged, encoded, or embedded in modalities beyond text. It discusses several directions without presenting a foolproof defense.
- Filter or classify retrieved inputs before passing them to an instruction-following model.
- Use supervisory or moderator models that detect malicious goals without directly executing retrieved instructions.
- Constrain tool calls so retrieved text cannot authorize sensitive actions by itself.
- Verify model outputs against retrieved sources, while recognizing that this can itself create new prompt-injection pitfalls.
9. Why It Matters Now
This paper is a foundation for thinking about RAG and agent security. It reframes prompt injection from a jailbreak curiosity into a systems problem: untrusted inputs, confused deputies, authority boundaries, side effects, persistence, and propagation.
For builders, the practical takeaway is clear: once an LLM can retrieve data and act through tools, prompt handling becomes part of the application's security boundary. Treat retrieved text as hostile input, not as trusted instruction.