Why it is so hard to block content scraping attacks

Of all of the automated business logic abuse attacks, the simple act of copying and pasting content from one web page to another is the most difficult for any technology to stop. Content scraping was one of the problems we designed Cequence Security API Spartan to address, and we purposely avoided relying on agent-based techniques found in 1st generation bot mitigation tools. The reliance on JavaScript, Cookies, and Tokens to collect end-user device telemetry makes it difficult for 1st generation tools to block scraping attacks with any sustained efficacy.

Recently I had a chance to observe a 1st generation bot mitigator in a content scraping technology bakeoff and thought it worthwhile to share the outcome of this particular nasty attack campaign.

The Problems 1st Generation Bot Mitigators Face

When a scraping attack targets an API endpoint, 1st generation tools struggle because there is no web page or mobile SDK on which to install the necessary agent. Let’s start with why that is a problem:

Scraping typically happens using HTTP GET Requests – JavaScript-based Bot Mitigation techniques rely on sending a JavaScript agent as a response to the first HTTP GET request. They then collect device telemetry in the subsequence HTTP POST request. But, how do you get the JavaScript to end-user device before the 1st HTTP GET request – which is when the act of copying and pasting occurs? (you don’t)
Scraping can happen on any web URL – Because scraping occurs everywhere within your web domain, you have to instrument and monitor the entire site, which could be millions of web pages. Not only is it hard to instrument JavaScript across an entire domain, it also slows performance of the entire site leading to a less than desirable experience for legitimate users. (the impact is substantial)
Scraping can happen through APIs – You can’t get the device telemetry the old Bot Mitigators require from an API consumer because you cannot install anything on the API. (this leads to some devastating outcomes)

The Situation Observed

We were working with a large eCommerce site that consistently found their original content on a competitor’s site, and they wanted to find a quick solution. They deployed both a 1st generation bot mitigation tool and the Cequence Application Security Platform to see how well each solution could prevent content scraping attacks.

To mitigate problems 1 and 2 above, the 1st generation bot mitigation tool was heavily customized to work as follows:

Every HTTP GET request to anyURL was intercepted by an inline device and inspected to determine if a cookie from the Bot Mitigation tool was present.
- If a cookie was present and still valid, the request was forwarded upstream to the origin server.
- If a cookie was not present, the system generated a response with the JavaScript embedded in it (the cookie) to collect end-user device telemetry.
The response from the first HTTP GET request delivers the JavaScript to the end-user device and collects the device telemetry to send back in a 2nd HTTP GET request.
The 2nd HTTP GET request is intercepted by the 1^st gen Bot Mitigation solution and the end-user device telemetry is analyzed in real-time to determine if the requester is a bot or not.
- If the request is deemed to be coming from a bot, then a response is sent based on custom-configured policies.
- If the request is deemed to be coming from a legitimate user, the response contains a newly-generated HTTP Cookie for subsequent requests. This cookie is valid for 24 hours.
Upon receiving the HTTP Cookie as a response to the 2nd HTTP GET request, the system generates a 3rd HTTP request, which contains the HTTP cookie that allows the 1^st gen Bot Mitigation solution to validate that the request is coming from a legitimate user.

The Results from the 1st Generation Bot Mitigation Test

Notwithstanding the need for a diagram to fully understand the complex workflow put into place by the 1^st gen bot mitigation solution, the first result was an undesired one. The solution had a significant impact on customer experience. The page load time for a first-time visitor to the website increased more than 7X to 1.6 seconds on average. For retailers, this is an unacceptable user delay as it may result in lost shoppers. In addition, it is a significant financial penalty considering the page optimization investments in CDNs and other tools to improve page load time by mere milliseconds. The bottom line – in an attempt to prevent bot attacks, all page load time improvements were lost.

Despite the significant page load impact, there was still a hope that the 1^st gen bot mitigation solution could successfully stop scraping. There was an immediate and clear drop in scraping attacks at the outset of the solution deployment, but it was very short-lived. The attackers quickly figured out how they were being blocked and they retooled their methods.

Could 1st Gen Tool Stop the Retooled Attack?

Within a week, the attackers came back with a vengeance. It seemed as if there were multiple actors involved in the follow-on attacks, and they utilized different techniques at different times.

The first technique – leverage cookies: Attackers figured out that to not impact page load time for all pages for all users, they took advantage of the fact that the 1^st gen bot mitigator-supplied cookies were valid for 24 hours. The attackers used real browsers to generate a batch of cookies and they then instrumented the cookies into their automated scraping scripts (aka bad bots). These cookie-bearing bots were used very aggressively without being blocked for 24-hour periods. They continued to repeat this process at will.
The second technique – use the mobile endpoints: The second set of bad actors took advantage of the fact that there was no protection on the mobile endpoints. Even after putting in some basic rate-limits and geo-fencing, the attackers used residential proxy networks and spread their attacks over millions of IPs to circumvent those controls.
The third technique – go direct to the APIs: The last observed technique attacked the APIs, the most vulnerable threat vector because it had almost no protection. Very soon, 57% of their API traffic was scrapers. It was easy for the bots to instrument and enumerate through the item IDs and scrape all the relevant content associated with those items.

The most devastating result from the test with the 1st generation bot mitigation tool was that it was stopping 43% of requests from search engine bots, which negatively impacted SEO and resulted in a decline in user-activity and overall demand.

The Wrap-up and the Alternative

In summary, 1st generation bot mitigation tools that rely heavily on signals from end-user devices can’t defend against web content scraping. It’s like fitting a square peg in a round hole. You run the risk of compromising user experience, potentially breaking your SEO, and still not solving the scraping problem – all while burning precious cycles deploying a heavy-handed solution.

The alternative approach is to use a solution designed for dynamic, automated bot attacks against web and mobile applications, and APIs. Faced with the same techniques outlined above in the second wave attack, Cequence API Spartan was able to stop the malicious bots in their tracks (while letting the good bots through), with no impact on customer experience, SEO, and without requiring JavaScript instrumentation or SDK deployment.

You can read more about how API Spartan stops content scraping here.

Tales from the Front Lines: Why Simple Attacks Like Content Scraping are the Hardest to Block

The Problems 1st Generation Bot Mitigators Face

The Situation Observed

The Results from the 1st Generation Bot Mitigation Test

Could 1st Gen Tool Stop the Retooled Attack?

The Wrap-up and the Alternative

Sign up for the latest Cequence Security news

Related Articles

App Instrumentation: What It Is and How It Affects API Security

Cequence 2023 Holiday Season API Security Threat Report – Retail Fraud Up Nearly 700%

Essential Lessons from the Duolingo API Breach