Crawling large volumes of web pages

Crawling and scraping data from the web is a funny thing. It’s fairly easy to achieve and it’s giving you immediate results. However, scaling from a basic crawler (thanks to a quick Python script for example) to a full speed large volume crawler, is hard to achieve. I’ll try to tell you about a couple of typical challenges one faces when building such a web crawler.

Concurrency

Concurrency is absolutely central to more and more modern applications and it’s especially true to applications that are heavily relying on network access like web crawlers. Indeed, as every HTTP request you’re triggering is taking a lot of time to return, you’d better launch them in parallel rather than sequentially. Basically it means that if you’re crawling 10 web pages taking 1 second each, it will roughly take 1 second overall rather than 10 seconds.

So concurrency is critical to web crawlers, but how to achieve it?

The naive approach, which works well for a small application, is to code a logic that triggers jobs in parallel, wait for all the results, and process them. Typically in Python you would spawn several parallel processes, and in Golang (which is better suited for this kind of thing) you would create goroutines. But handling this manually can quickly become a hassle: as your RAM and CPU resources are limited there’s no way you can crawl millions of web pages in parallel, so how do you handle job queues, and how can you handle retries in case some jobs are failing (and they will for sure) or in case your server is stopping for some reason?

The most robust approach is to leverage a messaging system like RabbitMQ. Every new URL parsed by your application should now be enqueued in RabbitMQ, and every new page your application needs to crawl should be dequeued from RabbitMQ. The amount of concurrent requests you want to reach is just a simple setting in RabbitMQ.

Of course, even when using a messaging system, the choice of the underlying programming language remains important: triggering 100 parallel jobs in Go will cost you much less resources than in Python for example (which is partly why I really like Go!).

Scalability

At some point, no matter how lightweight and optimized your web crawler is, you’ll be limited by hardware resources.

The first solution is to upgrade your server (which is called “vertical scalability”). It’s easy but once you’re reaching a certain level of RAM and CPU, it’s cheaper to favor “horizontal scalability”.

Horizontal scalability is about adding several modest servers to your infrastructure, rather than turning one single server into a supercomputer. Achieving this is harder though because your servers might have to communicate about a shared state, and a refactoring of your application might be needed. Good news is that a web crawler can fairly easily become “stateless”: several instances of your application can be run in parallel and shared information will most likely be located in your messaging system and/or your database. It’s easy then to increase/decrease the number of servers based on the speed you want to achieve. Each server should handle a certain amount of concurrent requests consumed from the messaging server. It’s up to you to define how many concurrent requests each server can handle depending on its RAM/CPU resources.

Containers orchestrators like Kubernetes make horizontal scalability easier. It’s easy to scale up to more instances simply by clicking a button, and you can even let Kubernetes auto-scale your instances for you (always set limits though in order to control the costs).

If you want to have a deeper understanding of the scalability challenges, you should read this amazing book by Martin Kleppmann: Data Intensive Applications.

Report Errors Wisely

Tons of ugly things can happen during a crawl: connectivity issues (on the client side and on the server side), network congestion, target page too big, memory limit reached,…

It’s crucial that you handle these errors gracefully and that you report them wisely in order not to get overwhelmed by errors.

A good practice it centralize all errors into Sentry. Some errors are never sent to Sentry because we’re not considering them as critical and we don’t want to be alerted for that. For example, we want to know when an instance is reaching memory issues, but we don’t want to know when a URL cannot be downloaded because of a website timing out (this kind of error is business as usual for a crawler). It’s up to you to fine tune which errors are worth being reported urgently and which one are not.

File Descriptors and RAM Usage

When dealing with web crawlers, it’s worth being familiar with file descriptors. Every HTTP request you’re launching is opening a file descriptor, and a file descriptor is consuming memory.

On Linux systems, the max number of open file descriptors in parallel is capped by the OS in order to avoid breaking the system. Once this limit is reached you won’t be able to concurrently open any new webpage.

You might want to increase this limit but proceed carefully as it could lead to excessive RAM usage.

Avoid Some Traps

Here are 2 typical tricks that drastically help improve performance when crawling large volumes of data:

Abort if excessive page size: some pages are too big and should be ignored not only for stability reasons (you don’t want to fill your disk with it) but also for efficiency reasons.
Fine tune timeouts wisely: a web request might time out for several reasons and it’s important that you understand the underlying concept in order to adopt different levels of timeouts. See this great Cloudflare article for more details. In Go you can set a timeout when creating a net/http client, but a more idiomatic (and maybe more modern) approach is to use contexts for that purpose.

DNS

When crawling millions of web pages, the default DNS server you’re using is likely to end up rejecting your requests. Then it’s interesting to start using a more robust DNS server like the Google or Cloudflare ones, or even rotate resolution requests among several DNS servers.

Refresh Data

Crawling data once is often of little interest. Data should be refreshed asynchronously on a regular basis using periodic tasks, or synchronously upon a user request.

A recent application I worked on refreshed data asynchronously. Every time we were crawling a domain, we were storing the current date in the database, and then everyday a periodic task was looking for all the domains we had in db that needed to be refreshed. As Cron was too limited for our needs, we were using this more advanced Cron-like tool for Go applications: https://github.com/robfig/cron.

Being Fair

Crawling the web should be done respectfully. It basically means 2 things:

don’t crawl a web page if its robots.txt file disallows it
don’t hammer a single web server with tons of requests: set a very low concurrency level when you’re crawling several pages from a single domain, and pause for a moment between 2 requests

Conclusion

Setting up a large volume crawler is a fascinating journey which requires both coding and infrastructure considerations.

In this post I’m only scratching the surface but hopefully some of these concepts will help you in your next project! If you have comments or questions, please don’t hesitate to reach out to me, I’ll be pleased.

Crawling large volumes of web pages

October 31, 2020

Concurrency

Scalability

Report Errors Wisely

File Descriptors and RAM Usage

Avoid Some Traps

DNS

Refresh Data

Being Fair

Conclusion

Existe aussi en français

API Rate Limiting With Traefik, Docker, Go, and Caching

API Analytics With Time-Series Thanks to TimescaleDB

Storing Stripe Payment Data in the Database