Scrape An Entire Website

HtmlWeb Scraping

Html Problem Overview


I'm looking for recommendations for a program to scrape and download an entire corporate website.

The site is powered by a CMS that has stopped working and getting it fixed is expensive and we are able to redevelop the website.

So I would like to just get the entire website as plain html / css / image content and do minor updates to it as needed until the new site comes along.

Any recomendations?

Html Solutions


Solution 1 - Html

wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains www.website.com \
     --no-parent \
         www.website.com

Read more about it here.

Solution 2 - Html

I know this is super old and I just wanted to put my 2 cents in.

wget -m -k -K -E -l 7 -t 6 -w 5 http://www.website.com

A little clarification regarding each of the switches:

-m Essentially, this means “mirror the site”, and it recursively grabs pages & images as it spiders through the site. It checks the timestamp, so if you run wget a 2nd time with this switch, it will only update files/pages that are newer than the previous time.

-k This will modify links in the html to point to local files. If instead of using things like page2.html as links throughout your site you were actually using a full http://www.website.com/page2.html you’ll probably need/want this. I turn it on just to be on the safe side – chances are at least 1 link will cause a problem otherwise.

-K The option above (lowercase k) edits the html. If you want the “untouched” version as well, use this switch and it will save both the changed version and the original. It’s just good practise in case something is awry and you want to compare both versions. You can always delete the one you didn’t want later.

-E This saves HTML & CSS with “proper extensions”. Careful with this one – if your site didn’t have .html extensions on every page, this will add it. However, if your site already has every file named with something like “.htm” you’ll now end up with “.htm.html”.

-l 7 By default, the -m we used above will recurse/spider through the entire site. Usually that’s ok. But sometimes your site will have an infinite loop in which case wget will download forever. Think of the typical website.com/products/jellybeans/sort-by-/name/price/name/price/name/price example. It’s somewhat rare nowadays – most sites behave well and won’t do this, but to be on the safe side, figure out the most clicks it should possibly take to get anywhere from the main page to reach any real page on the website, pad it a little (it would suck if you used a value of 7 and found out an hour later that your site was 8 levels deep!) and use that #. Of course, if you know your site has a structure that will behave, there’s nothing wrong with omitting this and having the comfort of knowing that the 1 hidden page on your site that was 50 levels deep was actually found.

-t 6 If trying to access/download a certain page or file fails, this sets the number of retries before it gives up on that file and moves on. You usually do want it to eventually give up (set it to 0 if you want it to try forever), but you also don’t want it to give up if the site was just being wonky for a second or two. I find 6 to be reasonable.

-w 5 This tells wget to wait a few seconds (5 seconds in this case) before grabbing the next file. It’s often critical to use something here (at least 1 second). Let me explain. By default, wget will grab pages as fast as it possibly can. This can easily be multiple requests per second which has the potential to put huge load on the server (particularly if the site is written in PHP, makes MySQL accesses on each request, and doesn’t utilize a cache). If the website is on shared hosting, that load can get someone kicked off their host. Even on a VPS it can bring some sites to their knees. And even if the site itself survives, being bombarded with an insane number of requests within a few seconds can look like a DOS attack which could very well get your IP auto-blocked. If you don’t know for certain that the site can handle a massive influx of traffic, use the -w # switch.5 is usually quite safe. Even 1 is probably ok most of the time. But use something.

Solution 3 - Html

None of the above got exactly what I needed (the whole site and all assets). This worked though.

First, follow this tutorial to get wget on OSX.

Then run this

wget --recursive --html-extension --page-requisites --convert-links http://website.com

Solution 4 - Html

Consider HTTrack. It's a free and easy-to-use offline browser utility.

>It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.

Solution 5 - Html

The best way is to scrape it with wget as suggested in @Abhijeet Rastogi's answer. If you aren't familiar with is then Blackwidow is a decent scraper. I've used it in the past. http://www.sbl.net/

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDale FraserView Question on Stackoverflow
Solution 1 - HtmlAbhijeet RastogiView Answer on Stackoverflow
Solution 2 - HtmlT0ny lombardiView Answer on Stackoverflow
Solution 3 - HtmlTyler McGinnisView Answer on Stackoverflow
Solution 4 - Htmlp.campbellView Answer on Stackoverflow
Solution 5 - HtmlseanbreedenView Answer on Stackoverflow