Scraping a JSON response with Scrapy

PythonJsonWeb ScrapingScrapy

Python Problem Overview


How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this:

{
    "firstName": "John",
    "lastName": "Smith",
    "age": 25,
    "address": {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021"
    },
    "phoneNumber": [
        {
            "type": "home",
            "number": "212 555-1234"
        },
        {
            "type": "fax",
            "number": "646 555-4567"
        }
    ]
}

I would be looking to scrape specific items (e.g. name and fax in the above) and save to csv.

Python Solutions


Solution 1 - Python

It's the same as using Scrapy's HtmlXPathSelector for html responses. The only difference is that you should use json module to parse the response:

class MySpider(BaseSpider):
    ...


    def parse(self, response):
         jsonresponse = json.loads(response.text)

         item = MyItem()
         item["firstName"] = jsonresponse["firstName"]             

         return item

Hope that helps.

Solution 2 - Python

Don't need to use json module to parse the reponse object.

class MySpider(BaseSpider):
...


def parse(self, response):
     jsonresponse = response.json()

     item = MyItem()
     item["firstName"] = jsonresponse.get("firstName", "")           

     return item

Solution 3 - Python

The possible reason JSON is not loading is that it has single-quotes before and after. Try this:

json.loads(response.body_as_unicode().replace("'", '"'))

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionThomas KingaroyView Question on Stackoverflow
Solution 1 - PythonalecxeView Answer on Stackoverflow
Solution 2 - PythonHARVYS 789View Answer on Stackoverflow
Solution 3 - PythonManoj SahuView Answer on Stackoverflow