Save and render a webpage with PhantomJS and node.js

JavascriptHtmlnode.jsWeb ScrapingPhantomjs

Javascript Problem Overview


I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page.

This should be a simple example with an obvious use-case for PhantomJS. I can't find a decent example, the documentation seems to be all about command line use.

Javascript Solutions


Solution 1 - Javascript

From your comments, I'd guess you have 2 options

  1. Try to find a phantomjs node module - https://github.com/amir20/phantomjs-node
  2. Run phantomjs as a child process inside node - http://nodejs.org/api/child_process.html

Edit:

It seems the child process is suggested by phantomjs as a way of interacting with node, see faq - http://code.google.com/p/phantomjs/wiki/FAQ

Edit:

Example Phantomjs script for getting the pages HTML markup:

var page = require('webpage').create();  
page.open('http://www.google.com', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var p = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML
        });
        console.log(p);
    }
    phantom.exit();
});

Solution 2 - Javascript

With v2 of phantomjs-node it's pretty easy to print the HTML after it has been processed.

var phantom = require('phantom');

phantom.create().then(function(ph) {
  ph.createPage().then(function(page) {
    page.open('https://stackoverflow.com/').then(function(status) {
      console.log(status);
      page.property('content').then(function(content) {
        console.log(content);
        page.close();
        ph.exit();
      });
    });
  });
});

This will show the output as it would have been rendered with the browser.

Edit 2019:

You can use async/await:

const phantom = require('phantom');

(async function() {
  const instance = await phantom.create();
  const page = await instance.createPage();
  await page.on('onResourceRequested', function(requestData) {
    console.info('Requesting', requestData.url);
  });

  const status = await page.open('https://stackoverflow.com/');
  const content = await page.property('content');
  console.log(content);

  await instance.exit();
})();

Or if you just want to test, you can use npx

npx phantom@latest https://stackoverflow.com/

Solution 3 - Javascript

I've used two different ways in the past, including the page.evaluate() method that queries the DOM that Declan mentioned. The other way I've passed info from the web page is to spit it out to console.log() from there, and in the phantomjs script use:

page.onConsoleMessage = function (msg, line, source) {
  console.log('console [' +source +':' +line +']> ' +msg);
}

I might also trap the variable msg in the onConsoleMessage and search for some encapsulate data. Depends on how you want to use the output.

Then in the Nodejs script, you would have to scan the output of the Phantomjs script:

var yourfunc = function(...params...) {
  var phantom = spawn('phantomjs', [...args]);
  phantom.stdout.setEncoding('utf8');
  phantom.stdout.on('data', function(data) {
    //parse or echo data
    var str_phantom_output = data.toString();
    // The above will get triggered one or more times, so you'll need to
    // add code to parse for whatever info you're expecting from the browser
  });
  phantom.stderr.on('data', function(data) {
    // do something with error data
  });
  phantom.on('exit', function(code) {
    if (code !== 0) {
      // console.log('phantomjs exited with code ' +code);
    } else {
      // clean exit: do something else such as a passed-in callback
    }
  });
}

Hope that helps some.

Solution 4 - Javascript

Why not just use this ?

var page = require('webpage').create();
page.open("http://example.com", function (status)
{
	if (status !== 'success') 
	{
        console.log('FAIL to load the address');            
	} 
    else 
	{
		console.log('Success in fetching the page');
		console.log(page.content);
	}
    phantom.exit();
});

Solution 5 - Javascript

Late update in case anyone stumbles on this question:

A project on GitHub developed by a colleague of mine exactly aims at helping you do that: https://github.com/vmeurisse/phantomCrawl.

It still a bit young, it certainly is missing some documentation, but the example provided should help doing basic crawling.

Solution 6 - Javascript

Here's an old version that I use running node, express and phantomjs which saves out the page as a .png. You could tweak it fairly quickly to get the html.

https://github.com/wehrhaus/sitescrape.git

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionHarryView Question on Stackoverflow
Solution 1 - JavascriptDeclan CookView Answer on Stackoverflow
Solution 2 - JavascriptAmir RaminfarView Answer on Stackoverflow
Solution 3 - JavascriptultrageekView Answer on Stackoverflow
Solution 4 - JavascriptyossiView Answer on Stackoverflow
Solution 5 - JavascriptStilltorikView Answer on Stackoverflow
Solution 6 - Javascriptuser2950147View Answer on Stackoverflow