How to download all files (but not HTML) from a website using wget?

UbuntuDownloadWget

Ubuntu Problem Overview


How to use wget and get all the files from website?

I need all files except the webpage files like HTML, PHP, ASP etc.

Ubuntu Solutions


Solution 1 - Ubuntu

To filter for specific file extensions:

wget -A pdf,jpg -m -p -E -k -K -np http://site/path/

Or, if you prefer long option names:

wget --accept pdf,jpg --mirror --page-requisites --adjust-extension --convert-links --backup-converted --no-parent http://site/path/

This will mirror the site, but the files without jpg or pdf extension will be automatically removed.

Solution 2 - Ubuntu

This downloaded the entire website for me:

wget --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://site/path/

Solution 3 - Ubuntu

wget -m -p -E -k -K -np http://site/path/

man page will tell you what those options do.

wget will only follow links, if there is no link to a file from the index page, then wget will not know about its existence, and hence not download it. ie. it helps if all files are linked to in web pages or in directory indexes.

Solution 4 - Ubuntu

I was trying to download zip files linked from Omeka's themes page - pretty similar task. This worked for me:

wget -A zip -r -l 1 -nd http://omeka.org/add-ons/themes/
  • -A: only accept zip files
  • -r: recurse
  • -l 1: one level deep (ie, only files directly linked from this page)
  • -nd: don't create a directory structure, just download all the files into this directory.

All the answers with -k, -K, -E etc options probably haven't really understood the question, as those as for rewriting HTML pages to make a local structure, renaming .php files and so on. Not relevant.

To literally get all files except .html etc:

wget -R html,htm,php,asp,jsp,js,py,css -r -l 1 -nd http://yoursite.com

Solution 5 - Ubuntu

You may try:

wget --user-agent=Mozilla --content-disposition --mirror --convert-links -E -K -p http://example.com/

Also you can add:

-A pdf,ps,djvu,tex,doc,docx,xls,xlsx,gz,ppt,mp4,avi,zip,rar

to accept the specific extensions, or to reject only specific extensions:

-R html,htm,asp,php

or to exclude the specific areas:

-X "search*,forum*"

If the files are ignored for robots (e.g. search engines), you've to add also: -e robots=off

Solution 6 - Ubuntu

I know this topic is very old, but I fell here at 2021 looking for a way to download all Slackware files from a mirror (http://ftp.slackware-brasil.com.br/slackware64-current/).

After reading all the answers, the best option for me was:

wget -m -p -k -np -R '*html*,*htm*,*asp*,*php*,*css*' -X 'www' http://ftp.slackware-brasil.com.br/slackware64-current/

I had to use *html* instead of just html to avoid downloads like index.html.tmp.

Please forgive me for resurrecting this topic, I thought it might be useful to someone other than me, and my doubt is very similar to @Aniruddhsinh's question.

Solution 7 - Ubuntu

Try this. It always works for me

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

Solution 8 - Ubuntu

wget -m -A * -pk -e robots=off www.mysite.com/

this will download all type of files locally and point to them from the html file and it will ignore robots file

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAniruddhsinhView Question on Stackoverflow
Solution 1 - UbuntuZsolt BotykaiView Answer on Stackoverflow
Solution 2 - UbuntuizilottiView Answer on Stackoverflow
Solution 3 - UbuntuJesseView Answer on Stackoverflow
Solution 4 - UbuntuSteve BennettView Answer on Stackoverflow
Solution 5 - UbuntukenorbView Answer on Stackoverflow
Solution 6 - UbuntuNerunView Answer on Stackoverflow
Solution 7 - UbuntuSuneel KumarView Answer on Stackoverflow
Solution 8 - UbuntuAbdalla Mohamed Aly IbrahimView Answer on Stackoverflow