Selenium versus BeautifulSoup for web scraping

JavascriptPythonSeleniumBeautifulsoup

Javascript Problem Overview


I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium.

Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath, what reason is there to use BeautifulSoup when I could just use Selenium for everything?

And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?

Javascript Solutions


Solution 1 - Javascript

Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use a HTTP library (like Requests or the built-in urllib.request) with lxml or BeautifulSoup, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:

  • Bandwidth, and time to run your script. Using Selenium means fetching all the resources that would normally be fetched when you visit a page in a browser - stylesheets, scripts, images, and so on. This is probably unnecessary.
  • Stability and ease of error recovery. Selenium can be a little fragile, in my experience - even with PhantomJS - and creating the architecture to kill a hung Selenium instance and create a new one is a little more irritating than setting up simple retry-on-exception logic when using requests.
  • Potentially, CPU and memory usage - depending upon the site you're crawling, and how many spider threads you're trying to run in parallel, it's conceivable that either DOM layout logic or JavaScript execution could get pretty expensive.

Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.

Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:

  • JavaScript served with the page has the content already baked into it. The JavaScript is just there to do the templating or other DOM manipulation that puts the content into the page. In this case, you might want to see if there's an easy way to pull the content you're interested in straight out of the JavaScript using regex.
  • The JavaScript is hitting a web API to load content. In this case, consider if you can identify the relevant API URLs and just hit them yourself; this may be much simpler and more direct than actually running the JavaScript and scraping content off the web page.

If you do decide your situation merits using Selenium, use it in headless mode, which is supported by (at least) the Firefox and Chrome drivers. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.

Solution 2 - Javascript

I would recommend using Selenium for things such as interacting with web pages whether it is in a full blown browser, or a browser in headless mode, such as headless Chrome. I would also like to say that beautiful soup is better for observing and writing statements that rely on if an element is found or WHAT is found, and then using selenium ot execute interactive tasks with the page if the user desires.

Solution 3 - Javascript

I used Selenium for web scraping, but it is not happy solution. In my last project I used https://github.com/chromedp/chromedp . It is more simple solution than Selenium.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionelieView Question on Stackoverflow
Solution 1 - JavascriptMark AmeryView Answer on Stackoverflow
Solution 2 - JavascriptLiamººTView Answer on Stackoverflow
Solution 3 - JavascriptRioView Answer on Stackoverflow