How can I parse a website using Selenium and Beautifulsoup in python?

Python Problem Overview

New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction?

Any help appreciated -

Python Solutions

Solution 1 - Python

Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. You would then load the page_source into BeautifulSoup as follows:

In [8]: from bs4 import BeautifulSoup

In [9]: from selenium import webdriver

In [10]: driver = webdriver.Firefox()

In [11]: driver.get('http://news.ycombinator.com')

In [12]: html = driver.page_source

In [13]: soup = BeautifulSoup(html)

In [14]: for tag in soup.find_all('title'):
   ....:     print tag.text
   ....:     
   ....:     
Hacker News

Solution 2 - Python

As your question isn't particularly concrete, here's a simple example. To do something more useful read the BS docs. You will also find plenty of examples of selenium (and BS )usage here in SO.

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Firefox()
browser.get('http://webpage.com')

soup=BeautifulSoup(browser.page_source)

#do something useful
#prints all the links with corresponding text

for link in soup.find_all('a'):
    print link.get('href',None),link.get_text()

Solution 3 - Python

Are you sure you want to use Selenium? For this reasons I used PyQt4, it's very powerful, and you can do what ever you want.

I can give you a sample code, that I just wrote, just change url and you good to go:

#! /usr/bin/env python2.7

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
import sys, signal

class Browser(QWebView):
	def __init__(self):
		QWebView.__init__(self)
		self.loadProgress.connect(self._progress)
		self.loadFinished.connect(self._loadFinished)
		self.frame = self.page().currentFrame()
		
	def _progress(self, progress):
		print str(progress) + "%"
		
	def _loadFinished(self):
		print "Load Finished"
		html = unicode(self.frame.toHtml()).encode('utf-8')
		soup = BeautifulSoup(html)
		print soup.prettify()
		self.close()
		
if __name__ == "__main__":
	app = QApplication(sys.argv)
	br = Browser()
	url = QUrl('http://web site that can contain javascript.com')
	br.load(url)
	br.show()
	if signal.signal(signal.SIGINT, signal.SIG_DFL):
		sys.exit(app.exec_())
	app.exec_()

Content Type	Original Author	Original Content on Stackoverflow
Question	twitch after coffee	View Question on Stackoverflow
Solution 1 - Python	RocketDonkey	View Answer on Stackoverflow
Solution 2 - Python	root	View Answer on Stackoverflow
Solution 3 - Python	Vor	View Answer on Stackoverflow

How can I parse a website using Selenium and Beautifulsoup in python?

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

Why are difference lists more efficient than regular concatenation in Haskell?

Unable to find javadoc command - maven

Attributions