Wget

Use Wget to create a static clone of a website

3 minute read

Required expertise level : Beginner / Intermediate

Platform : Gnu/Linux | macOS | MS Windows | Android | BSD


One of Wget’s features is the ability to scan and index the entirety of a website and download a fully functional static clone of the original website.

The static clone can be later refreshed and updated with new content published to the original website. While there are different ways of performing this task using Wget, you may get different results depending on your original website properties, including the CMS being used, the web server configurations, any kind of DDoS protection and online asset’s protection e.g. images and videos.


Install Wget

MS Windows

  • Chocolatey package manager for MS Windows

    choco install wget

Gnu/Linux

Examples

  • apt install wget

  • dnf install wget

  • pacman -S wget

macOS

  • Install using Homebrew package manager

    brew install wget

  • Install using Macports

    port install wget


Pulling the website to your local machine

wget --mirror --convert-links --adjust-extension --page-requisites http://example.org

Parameters and options description

--mirror

Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.

--convert-links

After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

--adjust-extension

If some link points to //foo.com/bar.cgi?xyz with --adjust-extension asserted and its local destination is intended to be ./foo.com/bar.cgi?xyz.css, then the link would be converted to //foo.com/bar.cgi?xyz.css. Note that only the filename part has been modified. The rest of the URL has been left untouched, including the net path ("//") which would otherwise be processed by Wget and converted to the effective scheme (ie. "http://").

--page-requisites

This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.


This can be solved by either whitelisting your IP address on Cloudflare and disable assets protection features during the crawling process, or configure the origin server to allow direct access on a different domain/sub-domain with basic authentication enabled, you can then add --http-user=[HTTP-USER] --http-passwd=[HTTP-PASSWORD] parameters to your Wget command to authenticate.

Last modified September 20, 2020: Hello world (17bfe5c)