4 minute read

HTTrack

Required expertise level : Beginner / Intermediate

Platform : Gnu/Linux | macOS | MS Windows | Android | BSD

Last tested and confirmed : January 2022

HTTrack is a free software developed for the specific purpose of downloading fully functional offline copies of any website.

It has many advantages over Wget, and offers a graphical user interface. While it’s not being updated since 2017, it proofed to be efficient in most use cases in our testing scenarios.

Install HTTrack

Windows

Chocolatey package manager for MS Windows

choco install httrack
Or you can download the installation file here

Gnu/Linux

Since HTTrack is available in most major distributions main repositories, you can use your package manager to directly install the compiled version.

Example

apt install httrack
dnf install httrack
pacman -S httrack

In this page you can find instructions on downloading and building HTTrack from source.

Note

Also, a version with a graphical user interface exists for Gnu/Linux but still in beta, you can find the source here

macOS

Using Homebrew package manager

brew install httrack
Using Macports

sudo port install httrack

Pulling the website to your local machine

httrack --mirror --robots=0 --stay-on-same-domain --user-agent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0"  --keep-links=0 --path example.org --quiet https://example.org/ -* +example.org/*

Parameters and options description

--mirror

Mirror websites

--robots=0

Follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always (even strict rules)) (–robots[=N])

--stay-on-same-domain

Stay on the same principal domain

--user-agent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0"

User-agent field sent in HTTP headers

--keep-links=0

keep original links (e.g. http://www.adr/link) (--keep-links=0 *relative link, --keep-links absolute links, --keep-links=4 original links, --keep-links=3 absolute URI links, --keep-links=5 transparent proxy link)

--path example.org

Path for mirror/logfiles+cache (--path mirror[,path cache and logfiles])

--quiet

no questions - quiet mode

https://example.org/ -*

Replace example.org with the website you want to mirror

+example.org/*

HTTrack will crawl and scan the whole website, renders every and save it locally to your machine in an offline browsable form. The suggested combination of arguments will convert the inline URLs to relative links which can be hosted virtually anywhere.

httrack --continue

In case of intercepted or uncompleted download process HTTrack will use the cache to resume the download process and make sure it won’t include re-downloading the same unchanged assets.

httrack --update

Here we need to consider a very important note about how HTTrack functions.

Normally there will be a cached version of all downloaded assets saved in a directory under the main project’s directory named hts-cache, the cache will be used in every update presumably to avoid having to crawl and download the whole website with every update, which can be a very time-consuming process especially with big websites.

However, in our testing with different websites of different sizes and structures, the real-life scenario turned to be different from what the software documentations provides.

For example with a big WordPress website running relatively recent version of Nginx web server and operating behind a Cloudflare proxy, the cache never helped to reduce the update times needed to refresh the static copy with new published content and HTTrack was almost crawling all over the website with every update.

This can be connected with many elements, among them will that HTTrack is a relatively old software which didn’t receive any updates since 2017 therefore it’s support for the most recent changes in the web structure and technologies isn’t the best.

One other element which will cause great impact and change of behavior in any tools doing the same functionality, will be dealing with different configuration and installations of Web servers, CMS, and security measurements.

We encourage you to dig deeper in HTTrack documentation to find options and arguments which can help to find the best suited configuration for your setup.

Note

When using HTTrack - and almost any website downloader -, while you are using Cloudflare proxy and DDoS protection for your website, it’s highly important to set a user-agent in the arguments and make sure the chosen user-agent isn’t blocked in the security settings in your web server and Cloudflare.

Note

With big websites, many security settings and tools might identify the constant crawling and multiple hits in short time intervals as malicious behavior and block or throttle your IP address’s connection to the website.

In that case you should revise your security settings and find the maximum allowed connections then you can use arguments like --max-rate=N, --connection-per-second=N, --max-pause=N to limit HTTrack traffic hitting your website to the maximum allowed numbers.

Also, you should consider whitelisting your IP address in your security settings if the option exists.