SecurityPaperTrace — AI Security Deep-Dives

Core Thesis

The paper argues that visual CAPTCHAs are no longer reliably bot-hard once a VLM is wrapped as an agent with tools. Halligan does not train a custom solver for one CAPTCHA family. It turns each visual challenge into a search problem: infer the objective, abstract the interface into frames and interactable entities, explore candidate states with click/drag/slide/swap tools, and use GPT-4o-based visual evaluation to pick the best state.

Visual CAPTCHA challenge

Only rendered pixels and coordinates are assumed

observe

Objective identification

Describe frames, infer relations, derive a natural-language objective

parse

CAPTCHA abstraction

Frames, keypoints, elements, and interactable types

optimize

Search and evaluation

Explore neighboring states and rank them against the objective

act

Executable action script

A reusable Python solution for one CAPTCHA type

No per-type fine-tuning

Search beats direct prompting

Repeated attempts amplify risk

1. What Problem Does This Paper Identify?

Traditional CAPTCHA security relies on an asymmetry: humans can solve the puzzle cheaply, while bots cannot. The paper claims this asymmetry is eroding. Earlier attacks usually targeted one family, such as text CAPTCHAs, reCAPTCHA image grids, sliders, or rotations. Defenders could switch challenge types and invalidate specialized solvers.

Halligan changes the attack shape. It tries to solve previously unseen visual challenges through general visual reasoning, GUI abstraction, and search. The security question is therefore no longer 'can a model learn this CAPTCHA distribution?', but 'can a general visual agent infer and manipulate this task at run time?'.

2. Threat Model

The attacker can use local or online VLMs and build an agentic orchestration around them. The attacker does not need CAPTCHA source code, DOM internals, provider-side secrets, or side-channel bugs. The solver operates at the vision level with screen coordinates, mouse, and keyboard actions.

The defender may create new visual challenges and add adversarial visual or prompt-based defenses. The paper explicitly excludes behavioral-only and proof-of-work systems from the main benchmark, because the target is visual challenge solving.

Step 1

Attacker wraps a VLM as a visual GUI agent

The attacker can use local or hosted VLMs plus orchestration code. The paper assumes no access to provider secrets, CAPTCHA source code, or DOM internals.

Attacker boundary

Vision-level access only; no provider-side secrets or CAPTCHA source code.

Defender boundary

Visual hardening helps, but the paper argues for layered anti-bot signals.

3. Method: From CAPTCHA to Search

The method has three stages. First, objective identification prompts the VLM to describe frames, infer semantic relations such as instruct/act/refer/terminate, and produce the target condition. Second, abstraction extracts frames, elements, and keypoints; it labels interactables as clickable, inputtable, pointable, swappable, selectable, draggable, slideable, or next. Third, solving uses exploration tools to generate neighboring states and evaluation tools to rank or compare them.

Component	Role in the system
Frame extractor	Partitions the CAPTCHA screenshot into nested or disjoint visual regions.
Element and keypoint extractors	Use segmentation and saliency to locate visible objects and candidate click points.
Space exploration tools	Type, click, swap, slide, and drag to create candidate CAPTCHA states.
Solution evaluation tools	Rank or compare candidate states using the inferred objective.
Enhancing tools	Mark, focus, ask, and match to make visual comparison easier for the VLM.

4. Closed-world Evaluation

The authors build an interactive benchmark with 26 CAPTCHA types and 2,600 unique challenges. Halligan uses GPT-4o as the VLM agent and PyAutoGUI for screen observation and interaction. It is compared with a CoT GUI-agent ablation, WebVoyager, ShowUI, and specialized solvers such as VTTSolver, GeeSolver, and PhishDecloaker.

60.7%

Halligan solved 1,577 of 2,600 benchmark challenges.

26/26

It solved at least some challenges in every CAPTCHA type.

65%

Average solve rate on discrete tasks such as click, swap, and point.

29%

Average solve rate on continuous tasks such as drag and slide.

The performance distribution is uneven. Halligan is strong on visual reasoning and object detection tasks, but weak on precise 3D spatial reasoning and continuous-control puzzles. This matters defensively: harder interaction geometry can still raise cost, but it does not restore a clean human-bot boundary.

5. Intermediate-stage Analysis

The paper separately measures two intermediate steps on 260 sampled challenges. CAPTCHA abstraction reaches average precision 0.71 and recall 0.79 for interactable annotations. Objective identification scores 1367 out of 1421, or 96% accuracy. This shows that most failures are not because the model cannot read the task at all, but because abstraction, tool outputs, solution comparison, and continuous search still introduce compounding errors.

6. Adversarial Defenses

The authors test Gaussian noise, motion blur, typographic prompt injection, refusal-style visual warnings, and distracting banners. These defenses reduce solve rates by 15.4% to 49.6%. With attacker-side preprocessing and filtering, Halligan recovers to 56.5% to 85.4%. The strongest single disruption is typographic visual prompt injection, because the VLM follows visible text even when it is not part of the CAPTCHA task.

The important security reading is not that these defenses are useless. They do raise cost and cause failures. The problem is that CAPTCHA services often allow repeated attempts. Even a lowered single-attempt success rate can become high after multiple tries.

7. Field Study

The paper's most provocative result is the 30-day field study. The authors infiltrated a human CAPTCHA-solving platform as a worker and attempted 3,000 live tasks. Halligan obtained provider tokens for 2,117 tasks, an average success rate of 70.6%. It saw 9 CAPTCHA providers and 25 variants; 17 variants were not in the benchmark and made up 75.2% of the tasks.

Provider	Tasks	Halligan success
prosopo	652	560 (85.9%)
amazon	549	518 (94.4%)
2captcha	672	498 (74.1%)
hcaptcha	250	197 (78.8%)
recaptcha	181	130 (71.8%)
geetest	390	99 (25.4%)
arkose	218	72 (33.0%)

8. User Study

Five CS participants with LLM prompting experience were asked to craft GPT-4o prompts for three CAPTCHA types. Human-written prompts overfit training samples and often timed out. Halligan generated stable Python code much faster and achieved higher test-set solve rates. This supports the claim that the key contribution is not a clever prompt, but the decomposition into abstraction, search, and reusable action code.

9. Limitations

Temporal challenges remain hard: tasks with time-dependent dynamic elements are outside Halligan's current comfort zone.
Continuous geometry is a bottleneck: dragging, sliding, 3D spatial rotation, and precise coordinates have much lower success rates.
Only GPT-4o is evaluated as the main VLM, so model-to-model generality is argued but not fully measured.
The benchmark reproduces deployed challenge styles, but any benchmark can lag behind rapidly changing CAPTCHA providers.

10. Defensive Implications

The paper does not claim every anti-bot mechanism is dead. It specifically weakens the assumption that visual puzzles alone can separate humans from bots. The authors point toward puzzle-less alternatives: user interaction patterns, device or possession signals, social or strong-network verification, proof-of-work, and honeypots. In practice, the path forward is likely layered risk scoring rather than harder visual puzzles alone.

Are CAPTCHAs Still Bot-hard? Generalized Visual CAPTCHA Solving with Agentic Vision Language Model