How to avoid HTTP error 429 (Too Many Requests) python

PythonHttpMechanizeHttp Status-Code-429

Python Problem Overview


I am trying to use Python to login to a website and gather information from several webpages and I get the following error:

> Traceback (most recent call last): > File "extract_test.py", line 43, in > response=br.open(v) > File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open > return self._mech_open(url, data, timeout=timeout) > File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open > raise response > mechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code

I used time.sleep() and it works, but it seems unintelligent and unreliable, is there any other way to dodge this error?

Here's my code:

import mechanize
import cookielib
import re
first=("example.com/page1")
second=("example.com/page2")
third=("example.com/page3")
fourth=("example.com/page4")
## I have seven URL's I want to open

urls_list=[first,second,third,fourth]

br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options 
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Log in credentials
br.open("example.com")
br.select_form(nr=0)
br["username"] = "username"
br["password"] = "password"
br.submit()

for url in urls_list:
        br.open(url)
        print re.findall("Some String")

Python Solutions


Solution 1 - Python

Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.

You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.

If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.

You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3

Solution 2 - Python

Writing this piece of code when requesting fixed my problem:

requests.get(link, headers = {'User-agent': 'your bot 0.1'})

This works because sites sometimes return a Too Many Requests (429) error when there isn't a user agent provided. For example, Reddit's API only works when a user agent is applied.

Solution 3 - Python

As MRA said, you shouldn't try to dodge a 429 Too Many Requests but instead handle it accordingly. You have several options depending on your use-case:

  1. Sleep your process. The server usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying. Keep in mind that sleeping a process might cause problems, e.g. in a task queue, where you should instead retry the task at a later time to free up the worker for other things.

  2. Exponential backoff. If the server does not tell you how long to wait, you can retry your request using increasing pauses in between. The popular task queue Celery has this feature built right-in.

  3. Token bucket. This technique is useful if you know in advance how many requests you are able to make in a given time. Each time you access the API you first fetch a token from the bucket. The bucket is refilled at a constant rate. If the bucket is empty, you know you'll have to wait before hitting the API again. Token buckets are usually implemented on the other end (the API) but you can also use them as a proxy to avoid ever getting a 429 Too Many Requests. Celery's rate_limit feature uses a token bucket algorithm.

Here is an example of a Python/Celery app using exponential backoff and rate-limiting/token bucket:

class TooManyRequests(Exception):
"""Too many requests"""

@task(
   rate_limit='10/s',
   autoretry_for=(ConnectTimeout, TooManyRequests,),
   retry_backoff=True)
def api(*args, **kwargs):
  r = requests.get('placeholder-external-api')

  if r.status_code == 429:
    raise TooManyRequests()

Solution 4 - Python

if response.status_code == 429:
  time.sleep(int(response.headers["Retry-After"]))

Solution 5 - Python

Another workaround would be to spoof your IP using some sort of Public VPN or Tor network. This would be assuming the rate-limiting on the server at IP level.

There is a brief blog post demonstrating a way to use tor along with urllib2:

http://blog.flip-edesign.com/?p=119

Solution 6 - Python

I've found out a nice workaround to IP blocking when scraping sites. It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429.

Check out this article

Solution 7 - Python

In many cases, continuing to scrape data from a website even when the server is requesting you not to is unethical. However, in the cases where it isn't, you can utilize a list of public proxies in order to scrape a website with many different IP addresses.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAous1000View Question on Stackoverflow
Solution 1 - PythonMRAView Answer on Stackoverflow
Solution 2 - Pythontadm123View Answer on Stackoverflow
Solution 3 - PythonpsanikoView Answer on Stackoverflow
Solution 4 - PythondavidbrownView Answer on Stackoverflow
Solution 5 - PythonGaurav AgarwalView Answer on Stackoverflow
Solution 6 - PythonJuan Luis Ruiz-tagleView Answer on Stackoverflow
Solution 7 - Pythonyeah22View Answer on Stackoverflow