TideLog Archive for the “WinHTTrack/HTTrack” Category

Please follow these common sense rules to avoid any network abuse when using WinHTTrack. It’s not fair for other users or the webmaster of the site if you’re hogging all his bandwidth. Not all sites have unlimited bandwidth.

I have taken this info from http://www.httrack.com/html/abuse.html. Check back there for any changes

  • Do not overload the websites!
    • Do not download too large websites: use filters, but use them sensibly
    • Do not use too many simultaneous connections
    • Use bandwidth limits
    • Use connection limits
    • Use size limits
    • Use time limits
    • Only disable robots.txt rules with great care
    • Try not to download during working hours
    • Check your mirror transfer rate/size
    • For large mirrors, first ask the webmaster of the site
  • Downloading a site can overload it, if you have a fast pipe, or if you capture too many simultaneous cgi (dynamically generated pages).

  • Ensure that you can copy the website
    • Are the pages copyrighted?
    • Can you copy them only for private purpose?
    • Do not make online mirrors unless you are authorized to do so
  • Do not overload your network
    • Is your (corporate, private..) network connected through dialup ISP?
    • Is your network bandwidth limited (and expensive)?
    • Are you slowing down the traffic?
  • Do not steal private information
    • Do not grab emails
    • Do not grab private information

Following these general guidelines will ensure fairness for all. I’ve seen sites shut down because the webmaster was fed up of his site keep running out of monthly bandwidth, costing him extra money in server and host costs.

Use my article advice, and Xavier’s brilliant software, with GREAT care, and be considerate of others. Make donations to your favourite sites!

Many thanks to Bandit for his pointers about a disclaimer.

Comments No Comments »

I’ve had a lot of people ask me how to correctly rip a website using WinHTTrack on its forum. It isn’t your fault you’re getting errors, it’s the programs default settings, they’re set incorrectly, causing the copier to get booted off almost straight away in some cases. I’m here to help, because the only “person” on there, William Roeder, is the most arrogant, nasty, clueless individual I and many other HTTrack users have met.

So, to get a good copy, we need to understand several things that may STOP a copy:

Problem 1 – Browser ID (User Agent) set incorrectly as default

Every web browser has a User Agent string. This tells a web server about the browser, and details the operating system it is running on. Mine looks like this:

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv: Gecko/20091201 Firefox/3.5.6 (.NET CLR 3.5.30729)

Here’s a table that shows what all that means:

This is where one of the problems HTTrack causes comes in. Many websites don’t allow website copiers, because they eat bandwidth, so they are blocked. The Robots.txt file is where specific useragents can be specified for blocking, and resides on the server. HTTrack’s default UserAgent is like this:

Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

So, because it contains the HTTrack bit, the copier is getting kicked, without the website being copied. We can change this so that HTTrack can pretend to be another browser! I will show you how later in this article.

Problem 2 – Spider – follow Robots.txt enabled

WinHTTrack is a web spider. It works in many ways the same as a web browser. It scours a website, and copies everything you tell it to. By default, it is set to specifically follow the Robots.txt file as detailed above, and because of this, with all default settings and User Agent, gets kicked. I don’t know why Xavier Roche (the author) has never changed the defaults, because it is causing confusion. I’m here to help!

Step 1 – One thing it WON’T do:

1. Copy files from folders that have Directory Listing disabled.

This is not a bug, it’s deliberate. If the website owner doesn’t want you to view folder contents, a copier won’t see them either! This is set in a .htaccess file on the server, and CANNOT be overridden, unless web pages on the site link to what’s in the folder.

Step 2 – Setting Options

Once you’ve set your links in the Addresses window, click Set Options. If you’re copying a subscription site, don’t just paste URLS in the text box, use the Add URL button. This allows you to enter your credentials. Otherwise, copying will fail.

Then, click the Limits tab, and set Max Transfer to 50,000. If ripping a lot of high res pictures, zips, and videos, I enter 250,000 – 500,000 in, as this is in Bytes, and not Kilobytes:

Leave everything else on this tab as is, and click the Spider tab. Here we’ll stop HTTrack following the robots file:

Set Spider to “no robots.txt rules”, leave everything else, and go to the Browser ID tab. Here we’re going to “spoof” HTTrack as another browser to fool the webserver into thinking we’re just another internet surfer!

Set “Browser identity” to anything you like, as long as you don’t pick one that’s too old, as some sites don’t allow old or obsolete browsers, due to them not being supported and possibly full of holes. I set mine to “Mozilla/4.78 [en] (Windows NT 5.0; U)”, as that’s fairly close to mine. You can add your own browser if you know how, but that is out of the scope of this article.

Also, set the HTML Footer to “None”, this removes the “Mirrored by HTTrack” footer stamped on copied pages, for cleanness.

That’s it! Click OK, then Next, and your site should rip! If you have any problems, leave me a comment, and if I feel relevant, I’ll do an article on your problem for you, and how to put it right. I’ve been using HTTrack since its first release, and can take the time to explain things nicely, in a polite manner, unlike William Roeder. His second name should be “Ruder!!”

Comments 60 Comments »