getting Forbidden by robots.txt: scrapy

PythonScrapyWeb Crawler

Python Problem Overview


while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/>

ERROR: No response downloaded for: https://www.netflix.com/

Python Solutions


Solution 1 - Python

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY

ROBOTSTXT_OBEY = False

Here are the release notes

Solution 2 - Python

First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.

Solution 3 - Python

Netflix's Terms of Use state:

> You also agree not to circumvent, remove, alter, deactivate, degrade or thwart any of the content protections in the Netflix service; use any robot, spider, scraper or other automated means to access the Netflix service;

They have their robots.txt set up to block web scrapers. If you override the setting in settings.py to ROBOTSTXT_OBEY=False then you are violating their terms of use which can result in a law suit.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questiondeepak kumarView Question on Stackoverflow
Solution 1 - PythonRafael AlmeidaView Answer on Stackoverflow
Solution 2 - PythonKetan PatelView Answer on Stackoverflow
Solution 3 - PythonCubeOfCheeseView Answer on Stackoverflow