4 minute read
Required expertise level : Beginner / Intermediate
Platform : Gnu/Linux | macOS | MS Windows | Android | BSD
HTTrack is a free software developed for the specific purpose of downloading fully functional offline copies of any website.
It has many advantages over Wget, and offers a graphical user interface. While it’s not being updated since 2017, it proofed to be efficient in most use cases in our testing scenarios.
Chocolatey package manager for MS Windows
choco install httrack
Or you can download the installation file here
apt install httrack
dnf install httrack
pacman -S httrack
NoteAlso, a version with a graphical user interface exists for Gnu/Linux but still in beta, you can find the source here
Using Homebrew package manager
brew install httrack
sudo port install httrack
Pulling the website to your local machine
httrack --mirror --robots=0 --stay-on-same-domain --user-agent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0" --keep-links=0 --path example.org --quiet https://example.org/ -* +example.org/*
Parameters and options description
Follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always (even strict rules)) (–robots[=N])
Stay on the same principal domain
--user-agent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0"
User-agent field sent in HTTP headers
keep original links (e.g. http://www.adr/link) (
--keep-links=0 *relative link,
--keep-links absolute links,
--keep-links=4 original links,
--keep-links=3 absolute URI links,
--keep-links=5 transparent proxy link)
Path for mirror/logfiles+cache (
--path mirror[,path cache and logfiles])
no questions - quiet mode
Replace example.org with the website you want to mirror
HTTrack will crawl and scan the whole website, renders every and save it locally to your machine in an offline browsable form. The suggested combination of arguments will convert the inline URLs to relative links which can be hosted virtually anywhere.
In case of intercepted or uncompleted download process HTTrack will use the cache to resume the download process and make sure it won’t include re-downloading the same unchanged assets.
Here we need to consider a very important note about how HTTrack functions.
Normally there will be a cached version of all downloaded assets saved in a directory under the main project’s directory named hts-cache, the cache will be used in every update presumably to avoid having to crawl and download the whole website with every update, which can be a very time-consuming process especially with big websites.
However, in our testing with different websites of different sizes and structures, the real-life scenario turned to be different from what the software documentations provides.
This can be connected with many elements, among them will that HTTrack is a relatively old software which didn’t receive any updates since 2017 therefore it’s support for the most recent changes in the web structure and technologies isn’t the best.
One other element which will cause great impact and change of behavior in any tools doing the same functionality, will be dealing with different configuration and installations of Web servers, CMS, and security measurements.
We encourage you to dig deeper in HTTrack documentation to find options and arguments which can help to find the best suited configuration for your setup.
NoteWhen using HTTrack - and almost any website downloader -, while you are using Cloudflare proxy and DDoS protection for your website, it’s highly important to set a
user-agentin the arguments and make sure the chosen
user-agentisn’t blocked in the security settings in your web server and Cloudflare.
With big websites, many security settings and tools might identify the constant crawling and multiple hits in short time intervals as malicious behavior and block or throttle your IP address’s connection to the website.
In that case you should revise your security settings and find the maximum allowed connections then you can use arguments like
--max-pause=N to limit HTTrack traffic hitting your website to the maximum allowed numbers.
Also, you should consider whitelisting your IP address in your security settings if the option exists.