PaperTrace

Into the Dark: Unveiling Internal Site Search Abused for Black Hat SEO

Yunyi Zhang, Mingxuan Liu, Baojun Liu, Yiming Zhang, Haixin Duan, Min Zhang, Hui Jiang, Yanzhe Li, Fan Shi · USENIX Security 2024 · USENIX page

Advanced — research-levelWeb SecurityBlack Hat SEOSearch Abuse
Prereqs:Search engine indexing · Web trust boundaries · Abuse measurement

TL;DR

This paper studies Internal site Search Abuse Promotion (ISAP), a black-hat SEO technique that turns a website's own search result pages into indexable promotion carriers. The attacker does not need to compromise the site. They inject promotional text into internal-search queries, rely on flawed no-result pages that reflect the query in URLs and titles, then use distribution sites to get those reflection URLs indexed by search engines. The security boundary failure is source confusion: search engines and users see abusive text under reputable domains even though the reputable site did not create or endorse it.

ISAP Detection and Measurement Pipeline
URL snapshot data
Baidu snapshot pairs of URL and HTML content
reduce scale
URL Screener
Filter by distribution-site patterns and promotion target features
classify
Semantic Detector
BERT-based promotion detection and category classification
measure
Ecosystem measurement
Abused sites, distribution sites, targets, PVs, and cross-engine checks
defend
Website Finder
Proactively test whether internal search can become an ISAP carrier
No site compromise required
Search index is the sink
Website and engine controls both matter
ISAP Attack Chain
Step 1

Black-hat SEO starts from a target to promote

The paper observes targets such as adult sites, gambling domains, SEO services, anonymous servers, and contact handles. The attack goal is visibility in search results, not code execution on the victim site.

Trust boundary

The victim site does not endorse the promotion; its internal search page is only reflecting attacker-controlled query text.

Policy point

Search engines and website operators can both break the chain: remove abusive indexed URLs or prevent no-result pages from becoming indexable carriers.

1. Problem Framing

Internal site search is a normal usability feature: users type a keyword and get resources within the site. The paper cites prior measurement that 47.23% of Alexa Top 1M websites use internal site search. ISAP abuses this feature when a no-result search page reflects attacker-controlled search keywords into the generated URL and webpage, especially the title.

The result is not classic web compromise. The victim site remains intact, but its search endpoint becomes a reflection surface for promotional text. Once search engines index these reflected pages, a high-reputation domain can appear to carry adult, gambling, SEO, anonymous-server, or other black-hat promotions.

2. Threat Model

The adversary is a black-hat SEO operator whose goal is to inject promotional content into search engine results and increase visibility for a target. The paper's attacker needs two conditions: many websites whose internal search can generate reflection URLs, and distribution websites that can lead search engine crawlers to those URLs.

The victim website has two relevant properties: it exposes an internal search function, and its no-result page includes the search keyword in both the URL and webpage. The search engine is the amplification layer: once the reflection URL is indexed, users may see the promotional text under a trusted domain.

RoleCapability or boundary
AdversaryCan craft search keywords, generate reflection URLs, and place links on distribution websites.
Victim websiteDoes not need to be compromised; its internal search endpoint reflects attacker-controlled text.
Search engineCrawls, indexes, ranks, and displays the reflected URL as a search result.
UserSees promotional content associated with a reputable site and may follow it to risky services.

3. Security Abstraction

The clean model is source, reflection surface, distribution channel, indexing sink, and user-facing sink. The attacker-controlled query is the source. The no-result internal search page is the reflection surface. Distribution websites are the crawler entry points. Search engine indexing and result pages are the sinks where the abuse becomes visible.

This abstraction explains why both website operators and search engines have enforcement points. Websites can prevent reflection pages from becoming indexable carriers. Search engines can identify and remove already indexed ISAP URLs and distribution sites.

4. Empirical Observations Used for Detection

The detector is not built by directly running heavy NLP over all traffic. The paper first extracts ISAP-specific regularities from a ground-truth dataset, then uses them as a funnel to reduce traffic before semantic classification.

Observation 1

Illegal distribution websites embed many external reflection URLs with limited distinct promotion keywords. Only 2.84% of illegal distribution websites had fewer than 4 external links, while 99.95% of benign websites had no more than 3.

Observation 2

Promotion keyword length falls in a characteristic range: 90% of illegal search keywords are between 14 and 108 characters, with an average length of 59.27 characters.

Observation 3

The promotion keywords vary, but the promotion target remains constant. This enables co-occurrence analysis to recover target strings.

5. ISAP Detector

The ISAP Detector has two stages. The URL Screener first reduces traffic based on distribution website patterns and promotion information. It flags potential distribution sites when they exceed 3 external URLs and 3 unique keywords, filters out search keywords shorter than 5 characters, and uses character co-occurrence plus stop words to identify candidate promotion targets.

The Semantic Detector then applies BERT-based models only to the much smaller candidate set. The binary classifier detects promotional semantics, and the five-category classifier labels business categories: gambling, adult content, SEO, anonymous servers, and others.

60 million

Average daily URLs in the snapshot dataset

10K

Approximate daily URLs left after URL Screener

95.02% / 95.22%

Binary classifier test / validation accuracy

99.47% / 99.40%

Five-category classifier test / validation accuracy

2.14%

False-negative rate in 1,400 real-traffic samples

1.86%

False-positive rate in 1,400 real-traffic samples

6. Website Finder for Proactive Defense

The paper also builds ISAP Website Finder for website operators. It crawls HTML, identifies explicit and implicit search boxes, simulates a benign test search, then checks whether the keyword appears in both the redirected URL and page title. Sites satisfying both criteria are treated as potentially affected.

The finder uses Pyppeteer and headless Chromium for dynamic pages. In manual validation over 582 websites, the false-positive rate is 1.68% and the false-negative rate is 11.58%. The authors present this as a lower-bound measurement of risk, because missed dynamic or unusual search implementations can hide additional affected sites.

7. Measurement Evidence

The largest measurement uses Baidu URL snapshots from May 1 to September 19, 2023. The dataset covers 125 available days because 17 days were lost due to storage issues. In total, it contains more than 7 billion URL and HTML pairs, with an average of 60 million URLs per day.

FindingExact valueInterpretation
ISAP URLs3,222,864Detected reflection URLs in Baidu traffic.
Abused websites10,209Victim domains whose internal search was abused.
Distribution websites4,458Sites used to spread reflection URLs to crawlers.
Tranco Top 1M abused sites3,607Abused sites that were also in Tranco Top 1M.
Education / government abused sites228 / 162Affected education and government websites in the measured abuse.
User-side impactAbout 6 million ISAP URL PVs in 4 daysMeasured through Baidu search logs from September 30 to October 3, 2023.

8. Abuse Ecosystem

Adult content and gambling dominate the observed abuse, accounting for 77.44% and 20.41% of identified promotional activity. The paper also identifies self-promotion for SEO and anonymous-server promotion as emerging businesses. Some abused websites remain affected for long periods: 909 websites were abused for more than 30 days, and the paper gives examples of sites abused for 100 or 106 days.

The attacker strategy differs by site reputation. Popular domains are heavily exploited once found because their reputation can be valuable. Less-popular domains are used in aggregate: each may host fewer reflection URLs, but many such domains together provide scale.

9. Google, Bing, and Potential Risk

To test whether ISAP is specific to Baidu, the authors sampled 182 detected promotion targets and queried Google and Bing. On Google, first-page results for 98 targets, or 53.84%, contained 801 ISAP URLs. On Bing, first-page results for 75 targets, or 41.21%, contained 387 ISAP URLs. For exposure checks with attractive keywords, 27 of 50 Google queries and 9 of 50 Bing queries showed ISAP URLs on the first page.

For proactive risk assessment, the authors evaluated high-profile domains. Out of 50,762 evaluated domains from Tranco Top 10K, EDU, and GOV lists, 9,233 were potentially exploitable, or 18.19%. The per-list rates are 11.76% for evaluated Top 10K domains, 24.59% for EDU, and 14.10% for GOV.

10. Cautious Interpretation

The paper strongly establishes that ISAP is real, large-scale, and cross-engine. It does not prove that every potentially exploitable website is actively abused, nor that every search engine has the same exposure level. The Google and Bing evaluation is based on simulated user searches because the authors do not have those engines' internal URL databases.

The authors frame several results as lower bounds. The Website Finder misses some implementations, the active assessment focuses on apex domains rather than every subdomain, and the Baidu snapshot dataset stores only first-layer URL and HTML content rather than deep-crawling all embedded links.

11. Limitations

  • The main measurement relies on one search engine's internal data, although Google and Bing checks support cross-engine prevalence.
  • The BERT model is Chinese-oriented because Baidu primarily serves Chinese users; other languages may affect detector performance.
  • Website Finder cannot cover every internal-search implementation, so active risk numbers are conservative.
  • Attackers may adapt, but the paper argues that evasion either weakens promotional information or raises promotion cost.

12. Defensive Lessons

For website operators, the most direct fixes are to avoid generating abusable reflection URLs and prevent no-result search pages from being indexed. The paper recommends using POST for parameter transmission, excluding search keywords from result-page URLs and titles, and adding noindex on empty-result pages. Google's Search Central guidance similarly warns site owners to prevent parts of a site from being abused by spam.

For search engines, the enforcement point is the index and ranking pipeline. Engines can detect distribution-site patterns, identify promotional semantics after funneling, remove abusive ISAP URLs, and avoid punishing the reputation of victim domains that only reflected attacker-controlled queries.

13. Builder Takeaways

  • Do not let user-controlled search terms become source-authored titles, canonical URLs, or indexable no-result pages.
  • Treat search-result pages as untrusted reflections unless they contain real site resources and stable editorial meaning.
  • At search-engine scale, pair cheap structural filters with semantic models; do not start with expensive NLP over all traffic.
  • Measure both actual abuse and potential exposure. The paper's split between detector and website finder is a useful pattern for abuse-defense programs.

14. Relationship to This Site

This page extends PaperTrace beyond LLM-specific attacks while staying on the same AI security line: untrusted input crosses a trust boundary, is amplified by an automated ranking system, and reaches users through a high-authority interface. For LLM and RAG builders, the analogy is direct: never confuse reflected user input with trusted source-authored content, especially when downstream systems rank, summarize, or act on it.

Resources