Core Thesis
The paper argues that visual CAPTCHAs are no longer reliably bot-hard once a VLM is wrapped as an agent with tools. Halligan does not train a custom solver for one CAPTCHA family. It turns each visual challenge into a search problem: infer the objective, abstract the interface into frames and interactable entities, explore candidate states with click/drag/slide/swap tools, and use GPT-4o-based visual evaluation to pick the best state.
1. What Problem Does This Paper Identify?
Traditional CAPTCHA security relies on an asymmetry: humans can solve the puzzle cheaply, while bots cannot. The paper claims this asymmetry is eroding. Earlier attacks usually targeted one family, such as text CAPTCHAs, reCAPTCHA image grids, sliders, or rotations. Defenders could switch challenge types and invalidate specialized solvers.
Halligan changes the attack shape. It tries to solve previously unseen visual challenges through general visual reasoning, GUI abstraction, and search. The security question is therefore no longer 'can a model learn this CAPTCHA distribution?', but 'can a general visual agent infer and manipulate this task at run time?'.
2. Threat Model
The attacker can use local or online VLMs and build an agentic orchestration around them. The attacker does not need CAPTCHA source code, DOM internals, provider-side secrets, or side-channel bugs. The solver operates at the vision level with screen coordinates, mouse, and keyboard actions.
The defender may create new visual challenges and add adversarial visual or prompt-based defenses. The paper explicitly excludes behavioral-only and proof-of-work systems from the main benchmark, because the target is visual challenge solving.
3. Method: From CAPTCHA to Search
The method has three stages. First, objective identification prompts the VLM to describe frames, infer semantic relations such as instruct/act/refer/terminate, and produce the target condition. Second, abstraction extracts frames, elements, and keypoints; it labels interactables as clickable, inputtable, pointable, swappable, selectable, draggable, slideable, or next. Third, solving uses exploration tools to generate neighboring states and evaluation tools to rank or compare them.
| Component | Role in the system |
|---|---|
| Frame extractor | Partitions the CAPTCHA screenshot into nested or disjoint visual regions. |
| Element and keypoint extractors | Use segmentation and saliency to locate visible objects and candidate click points. |
| Space exploration tools | Type, click, swap, slide, and drag to create candidate CAPTCHA states. |
| Solution evaluation tools | Rank or compare candidate states using the inferred objective. |
| Enhancing tools | Mark, focus, ask, and match to make visual comparison easier for the VLM. |
4. Closed-world Evaluation
The authors build an interactive benchmark with 26 CAPTCHA types and 2,600 unique challenges. Halligan uses GPT-4o as the VLM agent and PyAutoGUI for screen observation and interaction. It is compared with a CoT GUI-agent ablation, WebVoyager, ShowUI, and specialized solvers such as VTTSolver, GeeSolver, and PhishDecloaker.
Halligan solved 1,577 of 2,600 benchmark challenges.
It solved at least some challenges in every CAPTCHA type.
Average solve rate on discrete tasks such as click, swap, and point.
Average solve rate on continuous tasks such as drag and slide.
The performance distribution is uneven. Halligan is strong on visual reasoning and object detection tasks, but weak on precise 3D spatial reasoning and continuous-control puzzles. This matters defensively: harder interaction geometry can still raise cost, but it does not restore a clean human-bot boundary.
5. Intermediate-stage Analysis
The paper separately measures two intermediate steps on 260 sampled challenges. CAPTCHA abstraction reaches average precision 0.71 and recall 0.79 for interactable annotations. Objective identification scores 1367 out of 1421, or 96% accuracy. This shows that most failures are not because the model cannot read the task at all, but because abstraction, tool outputs, solution comparison, and continuous search still introduce compounding errors.
6. Adversarial Defenses
The authors test Gaussian noise, motion blur, typographic prompt injection, refusal-style visual warnings, and distracting banners. These defenses reduce solve rates by 15.4% to 49.6%. With attacker-side preprocessing and filtering, Halligan recovers to 56.5% to 85.4%. The strongest single disruption is typographic visual prompt injection, because the VLM follows visible text even when it is not part of the CAPTCHA task.
The important security reading is not that these defenses are useless. They do raise cost and cause failures. The problem is that CAPTCHA services often allow repeated attempts. Even a lowered single-attempt success rate can become high after multiple tries.
7. Field Study
The paper's most provocative result is the 30-day field study. The authors infiltrated a human CAPTCHA-solving platform as a worker and attempted 3,000 live tasks. Halligan obtained provider tokens for 2,117 tasks, an average success rate of 70.6%. It saw 9 CAPTCHA providers and 25 variants; 17 variants were not in the benchmark and made up 75.2% of the tasks.
| Provider | Tasks | Halligan success |
|---|---|---|
| prosopo | 652 | 560 (85.9%) |
| amazon | 549 | 518 (94.4%) |
| 2captcha | 672 | 498 (74.1%) |
| hcaptcha | 250 | 197 (78.8%) |
| recaptcha | 181 | 130 (71.8%) |
| geetest | 390 | 99 (25.4%) |
| arkose | 218 | 72 (33.0%) |
8. User Study
Five CS participants with LLM prompting experience were asked to craft GPT-4o prompts for three CAPTCHA types. Human-written prompts overfit training samples and often timed out. Halligan generated stable Python code much faster and achieved higher test-set solve rates. This supports the claim that the key contribution is not a clever prompt, but the decomposition into abstraction, search, and reusable action code.
9. Limitations
- Temporal challenges remain hard: tasks with time-dependent dynamic elements are outside Halligan's current comfort zone.
- Continuous geometry is a bottleneck: dragging, sliding, 3D spatial rotation, and precise coordinates have much lower success rates.
- Only GPT-4o is evaluated as the main VLM, so model-to-model generality is argued but not fully measured.
- The benchmark reproduces deployed challenge styles, but any benchmark can lag behind rapidly changing CAPTCHA providers.
10. Defensive Implications
The paper does not claim every anti-bot mechanism is dead. It specifically weakens the assumption that visual puzzles alone can separate humans from bots. The authors point toward puzzle-less alternatives: user interaction patterns, device or possession signals, social or strong-network verification, proof-of-work, and honeypots. In practice, the path forward is likely layered risk scoring rather than harder visual puzzles alone.