How to run Scrapy from within a Python script

PythonWeb ScrapingWeb CrawlerScrapy

Python Problem Overview


I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

I can't figure out where I should put my spider code and how to call it from the main function. Please help. This is the example code:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 
 
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports
 
from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue
 
class CrawlerScript():
 
    def __init__(self):
		self.crawler = CrawlerProcess(settings)
		if not hasattr(project, 'crawler'):
			self.crawler.install()
		self.crawler.configure()
		self.items = []
		dispatcher.connect(self._item_passed, signals.item_passed)
 
    def _item_passed(self, item):
    	self.items.append(item)
 
    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
    	self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)
 
    def crawl(self, spider):
		queue = Queue()
		p = Process(target=self._crawl, args=(queue, spider,))
		p.start()
		p.join()
		return queue.get(True)
 
# Usage
if __name__ == "__main__":
	log.start()
 
	"""
	This example runs spider1 and then spider2 three times. 
	"""
	items = list()
	crawler = CrawlerScript()
	items.append(crawler.crawl('spider1'))
	for i in range(3):
		items.append(crawler.crawl('spider2'))
	print items
 
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

Thank you.

Python Solutions


Solution 1 - Python

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

Solution 2 - Python

Simply we can use

from scrapy.crawler import CrawlerProcess
from project.spiders.test_spider import SpiderName

process = CrawlerProcess()
process.crawl(SpiderName, arg1=val1,arg2=val2)
process.start()

Use these arguments inside spider __init__ function with the global scope.

Solution 3 - Python

Though I haven't tried it I think the answer can be found within the scrapy documentation. To quote directly from it:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

From what I gather this is a new development in the library which renders some of the earlier approaches online (such as that in the question) obsolete.

Solution 4 - Python

In scrapy 0.19.x you should do this:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

Note these lines

settings = get_project_settings()
crawler = Crawler(settings)

Without it your spider won't use your settings and will not save the items. Took me a while to figure out why the example in documentation wasn't saving my items. I sent a pull request to fix the doc example.

One more to do so is just call command directly from you script

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

Copied this answer from my first answer in here: https://stackoverflow.com/a/19060485/1402286

Solution 5 - Python

When there are multiple crawlers need to be run inside one python script, the reactor stop needs to be handled with caution as the reactor can only be stopped once and cannot be restarted.

However, I found while doing my project that using

os.system("scrapy crawl yourspider")

is the easiest. This will save me from handling all sorts of signals especially when I have multiple spiders.

If Performance is a concern, you can use multiprocessing to run your spiders in parallel, something like:

def _crawl(spider_name=None):
    if spider_name:
        os.system('scrapy crawl %s' % spider_name)
    return None

def run_crawler():

    spider_names = ['spider1', 'spider2', 'spider2']

    pool = Pool(processes=len(spider_names))
    pool.map(_crawl, spider_names)

Solution 6 - Python

it is an improvement of https://stackoverflow.com/questions/53033791/scrapy-throws-an-error-when-run-using-crawlerprocess

and https://github.com/scrapy/scrapy/issues/1904#issuecomment-205331087

First create your usual spider for successful command line running. it is very very important that it should run and export data or image or file

Once it is over, do just like pasted in my program above spider class definition and below __name __ to invoke settings.

it will get necessary settings which "from scrapy.utils.project import get_project_settings" failed to do which is recommended by many

both above and below portions should be there together. only one don't run. Spider will run in scrapy.cfg folder not any other folder

tree diagram may be displayed by the moderators for reference

#Tree
[enter image description here][1]

#spider.py
import sys
sys.path.append(r'D:\ivana\flow') #folder where scrapy.cfg is located

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from flow import settings as my_settings

#----------------Typical Spider Program starts here-----------------------------

          spider class definition here

#----------------Typical Spider Program ends here-------------------------------

if __name__ == "__main__":

    crawler_settings = Settings()
    crawler_settings.setmodule(my_settings)

    process = CrawlerProcess(settings=crawler_settings)
    process.crawl(FlowSpider) # it is for class FlowSpider(scrapy.Spider):
    process.start(stop_after_crawl=True)

Solution 7 - Python

# -*- coding: utf-8 -*-
import sys
from scrapy.cmdline import execute


def gen_argv(s):
    sys.argv = s.split()


if __name__ == '__main__':
    gen_argv('scrapy crawl abc_spider')
    execute()

Put this code to the path you can run scrapy crawl abc_spider from command line. (Tested with Scrapy==0.24.6)

Solution 8 - Python

If you want to run a simple crawling, It's easy by just running command:

scrapy crawl . There is another options to export your results to store in some formats like: Json, xml, csv.

scrapy crawl -o result.csv or result.json or result.xml.

you may want to try it

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser47954View Question on Stackoverflow
Solution 1 - PythondanielmhanoverView Answer on Stackoverflow
Solution 2 - PythonArun AugustineView Answer on Stackoverflow
Solution 3 - PythonmrmagooeyView Answer on Stackoverflow
Solution 4 - PythonMedeirosView Answer on Stackoverflow
Solution 5 - PythonFengzmgView Answer on Stackoverflow
Solution 6 - PythonmoorthypntView Answer on Stackoverflow
Solution 7 - PythonKxrrView Answer on Stackoverflow
Solution 8 - Pythondoeun kolView Answer on Stackoverflow