4 minute read

HTTrack

Required expertise level : Beginner / Intermediate

Platform : Gnu/Linux | macOS | MS Windows | Android | BSD

Last tested and confirmed : January 2022


Install HTTrack

Windows

  • Chocolatey package manager for MS Windows

    choco install httrack

  • Or you can download the installation file here

Gnu/Linux

Example

  • apt install httrack
  • dnf install httrack
  • pacman -S httrack

macOS

  • Using Homebrew package manager

    brew install httrack

  • Using Macports

    sudo port install httrack


Pulling the website to your local machine

httrack --mirror --robots=0 --stay-on-same-domain --user-agent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0"  --keep-links=0 --path example.org --quiet https://example.org/ -* +example.org/*
Parameters and options description

--mirror

Mirror websites

--robots=0

Follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always (even strict rules)) (–robots[=N])

--stay-on-same-domain

Stay on the same principal domain

--user-agent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0"

User-agent field sent in HTTP headers

--keep-links=0

keep original links (e.g. http://www.adr/link) (--keep-links=0 *relative link, --keep-links absolute links, --keep-links=4 original links, --keep-links=3 absolute URI links, --keep-links=5 transparent proxy link)

--path example.org

Path for mirror/logfiles+cache (--path mirror[,path cache and logfiles])

--quiet

no questions - quiet mode

https://example.org/ -*

Replace example.org with the website you want to mirror

+example.org/*

HTTrack will crawl and scan the whole website, renders every and save it locally to your machine in an offline browsable form. The suggested combination of arguments will convert the inline URLs to relative links which can be hosted virtually anywhere.

httrack --continue

In case of intercepted or uncompleted download process HTTrack will use the cache to resume the download process and make sure it won’t include re-downloading the same unchanged assets.

httrack --update


Here we need to consider a very important note about how HTTrack functions.

Normally there will be a cached version of all downloaded assets saved in a directory under the main project’s directory named hts-cache, the cache will be used in every update presumably to avoid having to crawl and download the whole website with every update, which can be a very time-consuming process especially with big websites.

However, in our testing with different websites of different sizes and structures, the real-life scenario turned to be different from what the software documentations provides.

This can be connected with many elements, among them will that HTTrack is a relatively old software which didn’t receive any updates since 2017 therefore it’s support for the most recent changes in the web structure and technologies isn’t the best.

One other element which will cause great impact and change of behavior in any tools doing the same functionality, will be dealing with different configuration and installations of Web servers, CMS, and security measurements.

We encourage you to dig deeper in HTTrack documentation to find options and arguments which can help to find the best suited configuration for your setup.