3 minute read
Required expertise level : Beginner / Intermediate
Platform : Gnu/Linux | macOS | MS Windows | Android | BSD
Last tested and confirmed : January 2022
The static clone can be later refreshed and updated with new content published to the original website. While there are different ways of performing this task using Wget, you may get different results depending on your original website properties, including the CMS being used, the web server configurations, any kind of DDoS protection and online asset’s protection e.g. images and videos.
Chocolatey package manager for MS Windows
choco install wget
NoteAs Wget is a Gnu developed software, it’s available in most distributions main repositories, the installation process should be as simple as using your distribution’s package manager.
apt install wget
dnf install wget
pacman -S wget
Pulling the website to your local machine
NoteWe will be using some basic parameters for Wget which should work for the majority of websites, but you may need to refer to the manual pages of Wget in case of needing to do some tweaks or solve an issue with the resulting mirror.
wget --mirror --convert-links --adjust-extension --page-requisites http://example.org
Parameters and options description
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
If some link points to
--adjust-extension asserted and its local destination is intended to be .
/foo.com/bar.cgi?xyz.css, then the link would be converted to
//foo.com/bar.cgi?xyz.css. Note that only the filename part has been modified. The rest of the URL has been left untouched, including the net path
("//") which would otherwise be processed by Wget and converted to the effective scheme
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
TipWget2 is currently being developed, while it’s not stable yet but it’s a full rewrite of the original Wget and meant to replace it in the near future. Wget2 comes with many new features such as HTTP/2.0 support and multi-threaded download which can make the process of pulling large websites way faster.
NoteFor websites operating behind Cloudflare, this process can be identified as malicious behaviour as many simultaneous requests are coming from one IP address in short intervals, this can result in partial downloads or failing to download some assets such as inline images and CSS files.
This can be solved by either whitelisting your IP address on Cloudflare and disable assets protection features during the crawling process, or configure the origin server to allow direct access on a different domain/sub-domain with basic authentication enabled, you can then add
--http-passwd=[HTTP-PASSWORD] parameters to your Wget command to authenticate.