Headless Browser for Python (Javascript support REQUIRED!)

JavascriptPythonScreen ScrapingHeadless Browser

Javascript Problem Overview


I need a headless browser which is fairly easy to use (I am still fairly new to Python and programming in general) which will allow me to navigate to a page, log into a form that requires Javascript, and then scrape the resulting web page by searching for results matching certain criteria, clicking check boxes, and clicking to download files. All of this requires Javascript.

I hear a headless browser is what I want - requirements/preferences are that I be able to run it from Python, and preferably that the resultant script will be compilable by py2exe (I am writing this program for other users).

So far Windmill looks like it MIGHT be what I want, but I am not sure.

Any ideas appreciated!

Javascript Solutions


Solution 1 - Javascript

I use webkit as a headless browser in Python via pyqt / pyside:
http://www.riverbankcomputing.co.uk/software/pyqt/download
http://developer.qt.nokia.com/wiki/Category:LanguageBindings::PySide::Downloads

I particularly like webkit because it is simple to setup. For Ubuntu you just use: sudo apt-get install python-qt4

Here is an example script:
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

Solution 2 - Javascript

The answer to this question was Spynner

Solution 3 - Javascript

I'm in the midst of writing a Python driver for Zombie.js, "a lightweight framework for testing client-side JavaScript code in a simulated environment".

I'm currently at a standstill on a resolution to a bug in Node.js (before I write more tests and more code), but feel free to keep an eye on my project as it progresses:

https://github.com/ryanpetrello/python-zombie

Solution 4 - Javascript

There are not too many headless browsers yet that support Javascript.

You could try Zombie.js or Phantomjs. Those are not Python, but plain Javascript and those really can do the job.

Solution 5 - Javascript

Try using phantomjs, it has great javascript support. Then you could run it as a subprocess of a python script

http://docs.python.org/library/subprocess.html

that could boss it around.

Solution 6 - Javascript

You can use HTQL in combination with IRobotSoft webscraper. Check here for examples: http://htql.net/

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSteven MatthewsView Question on Stackoverflow
Solution 1 - JavascripthojuView Answer on Stackoverflow
Solution 2 - JavascriptSteven MatthewsView Answer on Stackoverflow
Solution 3 - JavascriptRyanTheDevView Answer on Stackoverflow
Solution 4 - JavascriptEpeliView Answer on Stackoverflow
Solution 5 - JavascriptshelmanView Answer on Stackoverflow
Solution 6 - JavascriptseagulfView Answer on Stackoverflow