How to give delay between each requests in scrapy?

PythonWeb ScrapingScrapy

Python Problem Overview


I don't want to crawl simultaneously and get blocked. I would like to send one request per second.

Python Solutions


Solution 1 - Python

There is a setting for that:

> DOWNLOAD_DELAY

> Default: 0 > > The amount of time (in secs) that the downloader should wait before > downloading consecutive pages from the same website. This can be used > to throttle the crawling speed to avoid hitting servers too hard. >

DOWNLOAD_DELAY = 0.25    # 250 ms of delay

Read the docs: https://doc.scrapy.org/en/latest/index.html

Solution 2 - Python

You can also set 'download_delay' attribute on spider if you don't want a global download delay. See http://doc.scrapy.org/en/latest/faq.html#what-does-the-response-status-code-999-means

Solution 3 - Python

class S(Spider):
    rate = 1

    def __init__(self):
        self.download_delay = 1/float(self.rate)

rate sets a maximum amount of pages could be downloaded in one second.

Solution 4 - Python

Beside DOWNLOAD_DELAY, you can also use AUTOTHROTTLE feature of scrapy, https://doc.scrapy.org/en/latest/topics/autothrottle.html

It changes delay amount between requests depending on settings file. If you set 1 for both start and max delay, it will wait 1 second in each request.

It's original purpose is to vary delay time so detection of your bot will be harder.

You just need to set it in settings.py as follows:

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 3

Solution 5 - Python

Delays Can we set in 2 says:-

We can specify the delay while running the crawler. Eg. scrapy crawl sample --set DOWNLOAD_DELAY=3 ( which means 3 seconds delay between two requests)

Or else we can specify Globaly in the settings.py DOWNLOAD_DELAY = 3

by default scrapy takes 0.25 seconds delay between 2 requests.

Solution 6 - Python

if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it.

But scrapy also has a feature to automatically set download delays called AutoThrottle. It automatically sets delays based on load of both the Scrapy server and the website you are crawling. This works better than setting an arbitrary delay.

Read further about this on http://doc.scrapy.org/en/1.0/topics/autothrottle.html#autothrottle-extension
I've crawled more than 100 domains and not been blocked with AutoThrottle turned on

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionnizam.spView Question on Stackoverflow
Solution 1 - PythonwarvariucView Answer on Stackoverflow
Solution 2 - PythonMikhail KorobovView Answer on Stackoverflow
Solution 3 - PythonYan.ZeroView Answer on Stackoverflow
Solution 4 - PythonMehmet KurtipekView Answer on Stackoverflow
Solution 5 - PythonNiranjan SagarView Answer on Stackoverflow
Solution 6 - PythonJeff P ChackoView Answer on Stackoverflow