SecurityPaperTrace — AI Security Deep-Dives

TL;DR

Oedipus shows a systems failure in reasoning CAPTCHA security: GPT-4V and Gemini understand many challenges but fail as end-to-end solvers, while the same tasks become much easier when decomposed into LLM-friendly steps. The paper's core contribution is a CAPTCHA DSL plus local verifier that turns a monolithic visual puzzle into constrained search, reasoning, location, and operation steps. The measured result is not that every CAPTCHA is broken, but that the human-only boundary of studied reasoning CAPTCHAs weakens when a multimodal model is wrapped with decomposition and verification.

Reasoning CAPTCHA

Rotation, grid alignment, or 3D spatial logic

measure

Empirical capability map

Find which subtasks are LLM-easy or LLM-hard

formalize

CAPTCHA DSL plan

Search, reason, locate, operate under syntax constraints

check

Local verification loop

Reject invalid multi-attribute or undefined-variable steps

solve

Guided multimodal solving

Natural-language instructions steer the solver model

Direct CoT is brittle

DSL constrains failure modes

Model upgrades amplify impact

Boundary 1

Reasoning CAPTCHAs expose a visual task interface

The solver receives the rendered challenge image and visible question text. The paper does not assume access to provider secrets, source code, or DOM internals.

What the paper proves

A decomposition-and-verification wrapper substantially improves LLM success on studied reasoning CAPTCHAs.

What remains outside scope

Provider-side behavioral scoring, rate limits, and full production anti-automation stacks are not exhaustively evaluated.

1. What Problem Does This Paper Study?

The paper studies reasoning CAPTCHAs: commercial CAPTCHA designs that require spatial reasoning, logical selection, or multi-step manipulation rather than simple OCR or image classification. The authors focus on challenges from Arkose Labs, GeeTest, and NetEase Yidun, including rotation tasks, grid-based matching tasks, and 3D logical selection tasks.

The underlying security assumption is familiar: a task can be intuitive for humans yet difficult for automated systems. Oedipus questions that assumption in the LLM era. It asks whether a modern multimodal LLM, once given the right decomposition and orchestration, can turn a human-easy AI-hard challenge into a sequence of simpler AI-easy operations.

2. Threat Model and Security Boundary

The attacker has access to a reasoning CAPTCHA challenge as a user would see it: rendered visual content and visible task instructions. The paper does not rely on provider-side secrets, source code, DOM internals, or side-channel bugs. The attacker capability is model orchestration: use a multimodal LLM, generate a structured solution plan, and optionally connect the resulting answer to a GUI executor.

The protected asset is the CAPTCHA provider's human-verification decision. The security boundary is the assumption that a visual reasoning task can separate humans from automated agents. Oedipus attacks that boundary by changing the task representation, not by exploiting a software vulnerability in the provider.

Boundary element	Paper-specific reading
Source	CAPTCHA image, visible question, and sampled challenge variants.
Capability	Multimodal LLM reasoning plus DSL generation, verification, and guided solving.
Policy point	The local DSL verifier enforces syntactic and task-complexity constraints before solving.
Sink	A final selection or operation that can be submitted to the CAPTCHA workflow.
Defender boundary	The paper evaluates visual challenge hardness, not a complete production anti-bot stack.

3. CAPTCHA Categories

The paper first categorizes deployed reasoning CAPTCHAs by the kind of reasoning task they require. This matters because the weakness of direct LLM solving is not uniform: object rotation, grid manipulation, and 3D attribute reasoning fail for different reasons.

Category	Task shape	Why it stresses LLMs
Rotation	Align an object with a reference direction.	Requires precise orientation recognition, not just object naming.
Bingo / grid	Swap or arrange items so identical elements form a line.	Requires object localization plus multi-step board manipulation.
3D logical	Select an object based on color, shape, size, orientation, and spatial relations.	Requires multi-attribute binding and spatial relation reasoning.

4. Direct LLM Solving Fails, But Not Randomly

The authors evaluate GPT-4V and Gemini with zero-shot prompting and chain-of-thought prompting. The result is a useful negative baseline: the models often understand the task, but they do not reliably finish it. GPT-4V and Gemini reach task-understanding rates of 90.1% and 86.2% under CoT, yet their end-to-end CoT success rates are only 21.0% and 17.1%.

This gap is the paper's most important empirical observation. It says the problem is not simply semantic comprehension. The failure comes from fragile intermediate operations: multi-attribute object search, orientation recognition, multi-condition judgement, and maintaining a long reasoning chain without compounding errors.

80.8%

Single-criteria object search

25.0%

Multi-criteria object search

78.2%

Single-condition judgement

22.4%

Multi-condition judgement

5. Why Fine-tuning Is Not the Main Answer

The authors test supervised fine-tuning on Gemini using manually written solutions. The fully fine-tuned model improves only modestly: zero-shot success rises to 15.1% and CoT success to 18.9%, compared with the original Gemini baseline of 12.6% and 17.1%. More importantly, when one CAPTCHA category is absent from training, transfer to that category remains weak.

This motivates a system-level approach. Instead of training a model to internalize every provider-specific challenge, Oedipus externalizes the solution structure into a DSL and uses verification to keep the model inside safer, simpler operations.

6. The CAPTCHA DSL

The CAPTCHA DSL is the paper's central mechanism. It has four operational keywords: search, reason, locate, and operate. The design is intentionally restrictive. For example, search is bounded to single-attribute filtering, because the empirical study shows that multi-attribute search is where LLMs frequently hallucinate or bind attributes to the wrong object.

Find objects under a simple attribute constraint.

REASON

Perform bounded reasoning, usually producing a boolean, three-valued logic result, or orientation attribute.

LOCATE

Map the selected object to a concrete position in the image.

OPERATE

Represent a high-level interaction such as selection or manipulation.

7. Oedipus Architecture

Oedipus has two phases. Offline, the authors prepare DSL examples and tune or prompt a model to generate CAPTCHA DSL scripts. Online, the system generates a DSL solution for a new challenge, checks it with a local verifier, translates the verified script into natural-language instructions, and gives those instructions plus the challenge image to the multimodal solver model.

The paper also describes a possible executor layer that can submit the final answer through GUI actions. For defensive reading, the key point is not the mechanics of submission; it is that the model-facing task has been made smaller, more typed, and easier to verify.

8. Security Abstraction

The clean abstraction is source, planner, verifier, solver, and sink. The CAPTCHA image is an untrusted challenge source; the DSL generator is a planner; the local verifier is the policy enforcement point; the multimodal solver is the capability that interprets verified instructions; and the final answer is the sink that can affect the verification outcome.

This abstraction separates the paper from generic prompt engineering. Oedipus improves reliability by inserting an enforcement point between free-form model planning and final action, then constraining the plan to operations backed by empirical subtask success rates.

9. Main Evaluation Results

The evaluation uses four commercial reasoning CAPTCHA types, 100 samples per type, 10 repetitions per setting, and several multimodal LLM backends. The paper reports 28,000 total experiments and 1565.26 USD in total LLM API cost. The strongest reported configuration is Oedipus with Claude-3.7, reaching 73.8% average success. GPT-4V reaches 63.5%, GPT-4o reaches 62.7%, Gemini-2.0-Flash reaches 65.0%, and the reconstructed VTT baseline reaches 25.1%.

System	Average success rate	Reading
Oedipus + Claude-3.7	73.8%	Best reported result; stronger reasoning model gives a clear lift.
Oedipus + Gemini-2.0-Flash	65.0%	Lower-cost newer model, competitive success.
Oedipus + GPT-4V	63.5%	The headline average also used in the paper conclusion.
Oedipus + GPT-4o	62.7%	Similar accuracy to GPT-4V with faster and cheaper inference.
VTT reconstructed	25.1%	Vision-only baseline is much weaker and does not cover all tasks.

10. Ablations: What Actually Matters?

The ablations support the paper's design. First-attempt DSL generation succeeds only 59.8% of the time on average. With verifier feedback, DSL generation success rises to 93.7%, and the corresponding solving success rises from 39.5% to 63.5%. This shows that local structured feedback is not decoration; it is central to the system.

Task breakdown also matters. In Table 8, Gobang rises from 2.6% direct GPT-4V solving to 80.2% with task breakdown, and from 3.2% direct GPT-4o solving to 78.2% with task breakdown. For spatial tasks, direct prompting already has some success, but decomposition still improves accuracy.

11. Transferability and Cost

The authors test two newer CAPTCHA types introduced in 2023: Arkose-AngularV2 and GeeTest-IconCrush. Oedipus transfers partially, with Claude-3.7 reaching 63.8% average and GPT-4o reaching 53.6%. This is important because it suggests the DSL abstraction can survive some provider-side design changes as long as the new task is composed of covered unit operations.

The cost result is also strategically important. The paper states that Oedipus reaches a cost as low as 1.03 USD per 100 CAPTCHA solving, and Table 6 reports Gemini-2.0-Flash costs of 0.4, 3.6, 0.06, and 0.07 USD per 100 successful solutions across the four evaluated challenge types. The latency remains a limitation: Table 10 reports average solving times of 146.5 seconds for GPT-4V, 92.4 for GPT-4o, 128.6 for Gemini, 63.6 for Gemini-2.0-Flash, and 40.4 for Claude-3.7.

12. Cautious Interpretation

The strongest security takeaway is about decomposition. A CAPTCHA can be hard when viewed as one visual reasoning puzzle, yet fragile when decomposed into single-attribute searches, local judgements, and simple operations. Defenders should therefore evaluate whether a challenge can be represented as a sequence of bounded model-friendly primitives, not only whether the full puzzle looks cognitively rich.

The paper also implies that CAPTCHA security is now coupled to model progress. Oedipus does not need a new architecture to benefit from a stronger multimodal model; swapping the backend from older models to newer ones already improves success and cost. This makes static visual challenge design a moving-target defense.

13. Defense Implications

The paper proposes three CAPTCHA design directions: longer reasoning chains, deceptive object recognition through adversarial examples, and unit operations outside current LLM capabilities such as intuitive physical dynamics. These may raise cost, but the broader implication is that visual puzzle hardness should be only one layer.

A more robust anti-abuse stack should combine visual challenges with behavior signals, device and session risk, retry limits, anomaly detection, and adaptive challenge selection. Once LLM agents can reason over pixels and tools, relying on a static puzzle alone is increasingly brittle.

14. Builder Takeaways

Evaluate CAPTCHA designs under decomposition, not only direct prompting. A challenge is weaker if it can be expressed as stable single-attribute searches and local judgements.
Treat visual challenge difficulty as one signal in an anti-abuse stack, not as the complete security boundary.
Use enforcement points outside the model when building AI systems. Oedipus is offensive research, but its verifier lesson generalizes: constrain plans before they reach sensitive sinks.
Monitor model capability drift. A CAPTCHA that was AI-hard against one model generation may become solvable when the backend model improves.

15. Relationship to This Site

This page belongs next to the site's CAPTCHA and agent-security readings. Compared with Halligan-style visual GUI agents, Oedipus emphasizes a different security abstraction: not broad UI-state search, but a constrained planning language that inserts a verifier between model reasoning and the final action sink. Both papers point to the same PaperTrace theme: once models can reason over visual interfaces and tools, security depends on explicit boundaries, not on hoping a task remains intuitively hard.

Oedipus: LLM-enchanced Reasoning CAPTCHA Solver