Blog

The Dangers of Web Scraping & How to Prevent It

March 20, 2023 | 6 MIN READ

by Ameya Talwalkar

prevent web scraping attacks

Many of today’s hyper-connected organizations are faced with the challenge of how to detect and prevent web scraping attacks in an efficient and scalable manner. In this blog, we’ll share how a comprehensive approach involving API security can help mitigate this problem that leverages behavioral fingerprinting to continuously track sophisticated attacks, supported by an API threat intelligence database made up of over 100 million records.

The Impacts of Web Scraping Attacks

The impact of attacks that are scraping web pages can be wide-ranging, starting from overspending on infrastructure to devastating data extraction and loss of intellectual property. Of all the automated business logic abuse attacks, content scraping is the most difficult to prevent. Here are three reasons why:

They Can Happen Anywhere Within the Domain

Whereas other automated forms of business logic abuse are targeted at certain applications and related endpoints, scraping can be directed at any application or endpoint within the domain. For example, account takeover/credential stuffing attacks target applications that are user credential-based; denial of inventory attacks are focused on checkout applications and their API requests; scraping is more wide-reaching in its end goal. The challenge with preventing a web scraper attack becomes one of breadth – can your detection and mitigation approach encompass all your public facing applications – even on the application endpoints that have dynamically generated URIs? If you are trying to prevent scraping with a bot mitigation tool that requires application instrumentation, you are forced into the position of injecting an agent into every web application and endpoint within your domain. The impacts of this approach are twofold:

  • If the URI is dynamically generated, page load times may limit the ability to add an agent and the associated processing burden.
  • The injection of an agent to the page adds delay and complexities to the application development and deployment workflow.

They’re Primarily HTTP GET-Based

Automated web scraper attacks execute by sending a simple HTTP GET request to the targeted URIs. On a typical domain, the HTTP GET requests represent 99% of all transactions which means that your bot mitigation approach must have the capacity to process all HTTP GET transactions. This approach introduces both scalability and efficacy impacts.

  • Scale: Most bot mitigation approaches cannot scale or require significant oversizing to handle all site/domain traffic especially for medium to large sites.
  • Efficacy: The emphasis on HTTP POST to send device fingerprinting logic means that they will miss most of the attack signals emanating from an HTTP GET.

They Leverage Application APIs & Endpoints

The use of API endpoints is rapidly becoming a critical element in the move towards a more rapid, iterative application development workflow. The same information that may be consumed by mobile customers, partners, and aggregators from a rich web-based interface is also available via the API endpoints. When a web or data scraping attack faces resistance from web applications, they simply switch to using API endpoints to achieve their goal. The challenge facing first-generation bot mitigation tools in preventing web scraper attacks targeting the API endpoints is that there is no page or SDK to install an agent on. The API consumers are themselves bots, so it’s almost impossible to integrate JScript or a Mobile SDK.

How Cequence Security Prevents Web Scraping

Cequence Unified API Protection (UAP) keeps business logic abuses from striking at your web apps, mobile apps, and their underlying API infrastructure.

Part of Cequence UAP, API Spartan leverages behavioral fingerprinting to continuously track sophisticated attacks, even as they retool to avoid detection. Supported by the largest API threat database in the world, with millions of behavioral and malicious infrastructure records are translated into out-of-the-box policies that can be implemented on high efficacy protection on day-one.

By employing analysis powered by artificial intelligence and machine learning, API Spartan can analyze incoming traffic to detect even hard-to-spot business logic abuses hitting your web, mobile and API-based applications. This architectural approach eliminates the need for application instrumentation and provides you with the insight and intelligence to detect and prevent automated bot attacks and application vulnerability exploits targeting your public facing applications.

Cequence Web Scraping Prevention Deployment

API Spartan can be enabled to protect your APIs and web applications in as little as 15 minutes and can immediately begin reducing the operational burden associated with preventing attacks that can result in fraud, data loss and business disruption.

Alternatively, the modular architecture allows API Spartan to be deployed in your data center, your cloud environment, or a hybrid infrastructure.

  • Deployed in front of your public-facing applications, typically the DMZ, analyzes ALL transactions for ALL applications paths being used by clients. This allows us to correlate information across the entire application tier, tracking user and device access and behavior across the entire site. This means we have complete visibility into all the potential scraping targets – web, mobile and API-based – within your domain.
  • The Cequence UAP software-based approach enables deployments to be sized for your environment, but also easily scale to address any spikes in transaction volume. This means we can analyze both the HTTP GET and HTTP POST methods across the entire application tier, detecting clusters of common/repeated behavior indicative of scraping, independent of geo-location, IP and device information presented by the clients. Since our approach does not require application instrumentation, there is zero performance impact to the actual application from scraping detection.
  • In some cases, scraping is both allowed and encouraged, while in others, it is viewed as malicious. Financial institutes must allow users to access and aggregate their information – in some countries, there are laws for this. Travel and hospitality sites allow aggregate shopping sites to scrape information to promote sales. In contrast, competitive scraping where the entire site is copied to compete more aggressively be stopped. Our mitigation mechanisms allow you to select the response that makes the most sense: slowing down aggregators using rate-limits, sending fake information to competitors, or outright blocking if the scraping happens to be too volumetric.

Cequence Web Scraping Prevention in Action

One of our social media customers is processing over one billion transactions per day to prevent account take overs (ATO), and fake account creation and reputation bombing involving fake likes and fake comments. This customer always suspected that they had a scraping problem but had no visibility into it. Like most web pages and social media sites, they allowed search engine and social network crawlers to crawl their site. Using API Spartan, the customer was able to uncover competitive scraping from China hiding among legitimate crawlers. These competitive scraping bots were using the same toolkits and common libraries as the legitimate crawlers, down to the same User-Agent strings in their requests, so they appeared to be legitimate crawlers. With API Spartan the customer was able to split the (legitimate) crawlers from the (competitive) scrapers and generate a unique bot fingerprint that allowed this customer to block the attack, even as the bad actors changed IP addresses and other request parameters.

API Spartan prevents all attacks natively inline and requires no JavaScript, Mobile SDK, or web application firewall (WAF) integration to detect and block cyber-attacks.

Ameya Talwalkar

Author

Ameya Talwalkar

President, Chief Executive Officer & Founder

Ameya Talwalkar, founder and CEO at Cequence, has built strong engineering teams in Silicon Valley, Los Angeles, Madrid, Pune, and Chengdu over the past decade. Previously at Symantec, he led the development of advanced anti-malware technology. Ameya holds a Bachelor of Engineering from the University of Mumbai’s SPCE.

Related Articles