TL;DR
Oedipus shows a systems failure in reasoning CAPTCHA security: GPT-4V and Gemini understand many challenges but fail as end-to-end solvers, while the same tasks become much easier when decomposed into LLM-friendly steps. The paper's core contribution is a CAPTCHA DSL plus local verifier that turns a monolithic visual puzzle into constrained search, reasoning, location, and operation steps. The measured result is not that every CAPTCHA is broken, but that the human-only boundary of studied reasoning CAPTCHAs weakens when a multimodal model is wrapped with decomposition and verification.
1. What Problem Does This Paper Study?
The paper studies reasoning CAPTCHAs: commercial CAPTCHA designs that require spatial reasoning, logical selection, or multi-step manipulation rather than simple OCR or image classification. The authors focus on challenges from Arkose Labs, GeeTest, and NetEase Yidun, including rotation tasks, grid-based matching tasks, and 3D logical selection tasks.
The underlying security assumption is familiar: a task can be intuitive for humans yet difficult for automated systems. Oedipus questions that assumption in the LLM era. It asks whether a modern multimodal LLM, once given the right decomposition and orchestration, can turn a human-easy AI-hard challenge into a sequence of simpler AI-easy operations.
2. Threat Model and Security Boundary
The attacker has access to a reasoning CAPTCHA challenge as a user would see it: rendered visual content and visible task instructions. The paper does not rely on provider-side secrets, source code, DOM internals, or side-channel bugs. The attacker capability is model orchestration: use a multimodal LLM, generate a structured solution plan, and optionally connect the resulting answer to a GUI executor.
The protected asset is the CAPTCHA provider's human-verification decision. The security boundary is the assumption that a visual reasoning task can separate humans from automated agents. Oedipus attacks that boundary by changing the task representation, not by exploiting a software vulnerability in the provider.
| Boundary element | Paper-specific reading |
|---|---|
| Source | CAPTCHA image, visible question, and sampled challenge variants. |
| Capability | Multimodal LLM reasoning plus DSL generation, verification, and guided solving. |
| Policy point | The local DSL verifier enforces syntactic and task-complexity constraints before solving. |
| Sink | A final selection or operation that can be submitted to the CAPTCHA workflow. |
| Defender boundary | The paper evaluates visual challenge hardness, not a complete production anti-bot stack. |
3. CAPTCHA Categories
The paper first categorizes deployed reasoning CAPTCHAs by the kind of reasoning task they require. This matters because the weakness of direct LLM solving is not uniform: object rotation, grid manipulation, and 3D attribute reasoning fail for different reasons.
| Category | Task shape | Why it stresses LLMs |
|---|---|---|
| Rotation | Align an object with a reference direction. | Requires precise orientation recognition, not just object naming. |
| Bingo / grid | Swap or arrange items so identical elements form a line. | Requires object localization plus multi-step board manipulation. |
| 3D logical | Select an object based on color, shape, size, orientation, and spatial relations. | Requires multi-attribute binding and spatial relation reasoning. |
4. Direct LLM Solving Fails, But Not Randomly
The authors evaluate GPT-4V and Gemini with zero-shot prompting and chain-of-thought prompting. The result is a useful negative baseline: the models often understand the task, but they do not reliably finish it. GPT-4V and Gemini reach task-understanding rates of 90.1% and 86.2% under CoT, yet their end-to-end CoT success rates are only 21.0% and 17.1%.
This gap is the paper's most important empirical observation. It says the problem is not simply semantic comprehension. The failure comes from fragile intermediate operations: multi-attribute object search, orientation recognition, multi-condition judgement, and maintaining a long reasoning chain without compounding errors.
Single-criteria object search
Multi-criteria object search
Single-condition judgement
Multi-condition judgement
5. Why Fine-tuning Is Not the Main Answer
The authors test supervised fine-tuning on Gemini using manually written solutions. The fully fine-tuned model improves only modestly: zero-shot success rises to 15.1% and CoT success to 18.9%, compared with the original Gemini baseline of 12.6% and 17.1%. More importantly, when one CAPTCHA category is absent from training, transfer to that category remains weak.
This motivates a system-level approach. Instead of training a model to internalize every provider-specific challenge, Oedipus externalizes the solution structure into a DSL and uses verification to keep the model inside safer, simpler operations.
6. The CAPTCHA DSL
The CAPTCHA DSL is the paper's central mechanism. It has four operational keywords: search, reason, locate, and operate. The design is intentionally restrictive. For example, search is bounded to single-attribute filtering, because the empirical study shows that multi-attribute search is where LLMs frequently hallucinate or bind attributes to the wrong object.
Find objects under a simple attribute constraint.
Perform bounded reasoning, usually producing a boolean, three-valued logic result, or orientation attribute.
Map the selected object to a concrete position in the image.
Represent a high-level interaction such as selection or manipulation.
7. Oedipus Architecture
Oedipus has two phases. Offline, the authors prepare DSL examples and tune or prompt a model to generate CAPTCHA DSL scripts. Online, the system generates a DSL solution for a new challenge, checks it with a local verifier, translates the verified script into natural-language instructions, and gives those instructions plus the challenge image to the multimodal solver model.
The paper also describes a possible executor layer that can submit the final answer through GUI actions. For defensive reading, the key point is not the mechanics of submission; it is that the model-facing task has been made smaller, more typed, and easier to verify.
8. Security Abstraction
The clean abstraction is source, planner, verifier, solver, and sink. The CAPTCHA image is an untrusted challenge source; the DSL generator is a planner; the local verifier is the policy enforcement point; the multimodal solver is the capability that interprets verified instructions; and the final answer is the sink that can affect the verification outcome.
This abstraction separates the paper from generic prompt engineering. Oedipus improves reliability by inserting an enforcement point between free-form model planning and final action, then constraining the plan to operations backed by empirical subtask success rates.
9. Main Evaluation Results
The evaluation uses four commercial reasoning CAPTCHA types, 100 samples per type, 10 repetitions per setting, and several multimodal LLM backends. The paper reports 28,000 total experiments and 1565.26 USD in total LLM API cost. The strongest reported configuration is Oedipus with Claude-3.7, reaching 73.8% average success. GPT-4V reaches 63.5%, GPT-4o reaches 62.7%, Gemini-2.0-Flash reaches 65.0%, and the reconstructed VTT baseline reaches 25.1%.
| System | Average success rate | Reading |
|---|---|---|
| Oedipus + Claude-3.7 | 73.8% | Best reported result; stronger reasoning model gives a clear lift. |
| Oedipus + Gemini-2.0-Flash | 65.0% | Lower-cost newer model, competitive success. |
| Oedipus + GPT-4V | 63.5% | The headline average also used in the paper conclusion. |
| Oedipus + GPT-4o | 62.7% | Similar accuracy to GPT-4V with faster and cheaper inference. |
| VTT reconstructed | 25.1% | Vision-only baseline is much weaker and does not cover all tasks. |
10. Ablations: What Actually Matters?
The ablations support the paper's design. First-attempt DSL generation succeeds only 59.8% of the time on average. With verifier feedback, DSL generation success rises to 93.7%, and the corresponding solving success rises from 39.5% to 63.5%. This shows that local structured feedback is not decoration; it is central to the system.
Task breakdown also matters. In Table 8, Gobang rises from 2.6% direct GPT-4V solving to 80.2% with task breakdown, and from 3.2% direct GPT-4o solving to 78.2% with task breakdown. For spatial tasks, direct prompting already has some success, but decomposition still improves accuracy.
11. Transferability and Cost
The authors test two newer CAPTCHA types introduced in 2023: Arkose-AngularV2 and GeeTest-IconCrush. Oedipus transfers partially, with Claude-3.7 reaching 63.8% average and GPT-4o reaching 53.6%. This is important because it suggests the DSL abstraction can survive some provider-side design changes as long as the new task is composed of covered unit operations.
The cost result is also strategically important. The paper states that Oedipus reaches a cost as low as 1.03 USD per 100 CAPTCHA solving, and Table 6 reports Gemini-2.0-Flash costs of 0.4, 3.6, 0.06, and 0.07 USD per 100 successful solutions across the four evaluated challenge types. The latency remains a limitation: Table 10 reports average solving times of 146.5 seconds for GPT-4V, 92.4 for GPT-4o, 128.6 for Gemini, 63.6 for Gemini-2.0-Flash, and 40.4 for Claude-3.7.
12. Cautious Interpretation
The strongest security takeaway is about decomposition. A CAPTCHA can be hard when viewed as one visual reasoning puzzle, yet fragile when decomposed into single-attribute searches, local judgements, and simple operations. Defenders should therefore evaluate whether a challenge can be represented as a sequence of bounded model-friendly primitives, not only whether the full puzzle looks cognitively rich.
The paper also implies that CAPTCHA security is now coupled to model progress. Oedipus does not need a new architecture to benefit from a stronger multimodal model; swapping the backend from older models to newer ones already improves success and cost. This makes static visual challenge design a moving-target defense.
13. Defense Implications
The paper proposes three CAPTCHA design directions: longer reasoning chains, deceptive object recognition through adversarial examples, and unit operations outside current LLM capabilities such as intuitive physical dynamics. These may raise cost, but the broader implication is that visual puzzle hardness should be only one layer.
A more robust anti-abuse stack should combine visual challenges with behavior signals, device and session risk, retry limits, anomaly detection, and adaptive challenge selection. Once LLM agents can reason over pixels and tools, relying on a static puzzle alone is increasingly brittle.
14. Builder Takeaways
- Evaluate CAPTCHA designs under decomposition, not only direct prompting. A challenge is weaker if it can be expressed as stable single-attribute searches and local judgements.
- Treat visual challenge difficulty as one signal in an anti-abuse stack, not as the complete security boundary.
- Use enforcement points outside the model when building AI systems. Oedipus is offensive research, but its verifier lesson generalizes: constrain plans before they reach sensitive sinks.
- Monitor model capability drift. A CAPTCHA that was AI-hard against one model generation may become solvable when the backend model improves.
15. Relationship to This Site
This page belongs next to the site's CAPTCHA and agent-security readings. Compared with Halligan-style visual GUI agents, Oedipus emphasizes a different security abstraction: not broad UI-state search, but a constrained planning language that inserts a verifier between model reasoning and the final action sink. Both papers point to the same PaperTrace theme: once models can reason over visual interfaces and tools, security depends on explicit boundaries, not on hoping a task remains intuitively hard.