SecurityPaperTrace — AI Security Deep-Dives

核心摘要

This paper names and systematizes indirect prompt injection: attackers do not prompt the model directly, but plant instructions in data that an LLM-integrated application later retrieves. Once the model reads that data, the injected text can steer model behavior, tool calls, search queries, emails, memory, or user-facing summaries. The paper builds a security taxonomy and demonstrates the issue on Bing Chat, GitHub Copilot, and synthetic GPT-4/tool-using applications.

Attacker controls external content

Web page, email, document, code context, hidden text, encoded payload

plants

User asks a normal question

The user never types the attack prompt

triggers

LLM-integrated app retrieves data

Retrieval makes untrusted data part of model context

ingests

Model treats data as instruction

The paper argues that data and instruction modalities are not disentangled

executes

Tools turn behavior drift into security impact

Search, view, URL retrieval, email, address book, memory

Remote delivery

Inference-time compromise

Tool/API amplification

1. What Problem Does This Paper Identify?

Classic prompt injection assumes the attacker directly prompts the same model instance they want to manipulate. This paper changes the threat model: the attacker can be remote and indirect. They place instructions into content that an LLM-integrated application is likely to read during inference.

The core security observation is that retrieval-augmented applications blur the boundary between data and instructions. A search result, email, document, website, or code file is supposed to be data, but the model may interpret text inside it as a higher-priority instruction.

2. Threat Model: Who Does What?

The paper separates the attacker, the benign user, the LLM-integrated application, external data sources, and tools/APIs. The attacker does not need direct access to the victim's chat. Instead, the attacker controls or influences content that the application may retrieve.

Step 1

Attacker plants hostile instructions in external data

The paper studies prompts placed in web pages, emails, retrieved documents, code context, hidden comments, images, or encoded payloads.

3. Injection Delivery Methods

The paper organizes delivery methods by how the malicious instruction enters the model context. This is useful because defenses at the chat input box alone do not cover content pulled in by retrieval or tools.

Passive retrieval

Prompts are placed in public sources such as web pages, social media posts, documentation, or code repositories that may be retrieved later.

Active delivery

Prompts are sent through channels the application processes, such as email for an LLM-augmented assistant.

User-driven delivery

Users can be tricked into copying or entering text that secretly contains adversarial instructions.

Hidden or staged payloads

The paper discusses hidden text, multi-stage retrieval, visual injections, and encoded prompts as harder-to-detect variants.

4. Threat Taxonomy

Rather than treating indirect prompt injection as one trick, the paper maps it to a broader security landscape. The taxonomy is threat-based so it can generalize as models and integration patterns evolve.

Threat	What the injection tries to change
Information gathering	Collect or leak user data, chat content, credentials, or private context
Fraud	Make the application recommend scams, phishing pages, or credential requests
Intrusion	Turn tool access, memory, or code completion into a backdoor-like capability
Malware-like spreading	Use prompts as self-propagating instructions in email, memory, or other shared channels
Manipulated content	Bias summaries, hide facts, promote disinformation, or alter search behavior
Availability	Disrupt the model's usefulness, API calls, search queries, or response behavior

5. Evaluation: What Did They Test?

The evaluation is qualitative rather than a benchmark with success rates. The authors built synthetic LLM-integrated applications with tool interfaces and also tested real-world systems. Their goal was to show practical viability and map the attack surface, not to claim a universal exploit rate.

Synthetic applications: chat apps with tools for search, page viewing, URL retrieval, email read/send, address book access, and key-value memory.
Bing Chat: tested as a black-box real-world LLM-integrated search application, including the Edge sidebar's ability to read the current page.
GitHub Copilot: tested for prompt injection attacks that aim to manipulate code auto-completion from surrounding code context.

6. Why Tool Use Changes the Risk

A plain chatbot can mostly harm the conversation. A tool-using LLM-integrated application can affect external operations. The paper's synthetic setup gives the agent interfaces such as search, view, retrieve URL, read/send email, read address book, and memory. This turns prompt-following errors into security-relevant actions.

The important design lesson is not that every tool is dangerous. It is that tool authority must be separated from untrusted retrieved text. If the same model context contains user goals, system rules, tool descriptions, and arbitrary web content, the application needs strong boundaries around which text can authorize action.

7. Limitations and Responsible Framing

The authors avoided planting prompts into public sources that could be retrieved by other users. For real-world systems, they used local HTML files and controlled demonstrations where possible. They also disclose that exact reproducibility is hard for black-box systems whose models, prompts, filters, and retrieval environments can change.

The paper's strongest claims are therefore about attack surface and practical feasibility, not precise prevalence. That distinction matters: the work is a security warning and framework for evaluation, not a complete measurement of deployed risk.

8. Mitigation Ideas

The paper argues that simple filtering is brittle, especially when injections are hidden, staged, encoded, or embedded in modalities beyond text. It discusses several directions without presenting a foolproof defense.

Filter or classify retrieved inputs before passing them to an instruction-following model.
Use supervisory or moderator models that detect malicious goals without directly executing retrieved instructions.
Constrain tool calls so retrieved text cannot authorize sensitive actions by itself.
Verify model outputs against retrieved sources, while recognizing that this can itself create new prompt-injection pitfalls.

9. Why It Matters Now

This paper is a foundation for thinking about RAG and agent security. It reframes prompt injection from a jailbreak curiosity into a systems problem: untrusted inputs, confused deputies, authority boundaries, side effects, persistence, and propagation.

For builders, the practical takeaway is clear: once an LLM can retrieve data and act through tools, prompt handling becomes part of the application's security boundary. Treat retrieved text as hostile input, not as trusted instruction.

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection