talkgeektome.us

Internet Radio for Computer Geeks

Httrack And ``Non-commercial Searches''

Deepgeek

27 Mar 2009

Abstract:

This paper will give a software review to httrack, an offline web site copier. After the review, httrack will be shown to be a way of created a limited web search with common web server accounts.

This paper is a script for a podcast for ``Talk Geek To Me.'' The material will match the ``feature'' portion of the podcast (but not the ``administrative'' portions.)


Download Audio File, choose format OGG MP3

Download HPR0284: OGG MP3Speex "Is Google Evil"
Closing music is "Salme Dahlstrom - Cmon YAll Klubjmpers Radio Mix"
Httrack's website


1 What is httrack?

Httrack is a small lightweight utility which mirrors websites. When you run httrack, you create a local copy of a website on your disk.

Httrack has a GUI version called webhttrack, as well as a proxy utility. It comes in Linux/Unix and Windows versions.

2 Why would I want to copy web sites?

There are a number of reasons you may want to copy a web site. The most important one being convenience. There are times and places where we just can't be on-line. If you have a few favorite websites stored on you computer or usb drive, you can always refer to them as well as just read them. A common situation that comes to mind is being on an airplane. Frequently, there is no Internet or perhaps some limited and expensive connectivity. Having copy's of a few favorite web sites is great in this situation.

Of course, we aren't talking about full-blown ``lamp'' websites. We are talking about websites that are your typical small sites. What some people call ``informational'' sites. I love these, they are typically under one gigabyte in size, are mostly text-based HTML, and serve as mini-encyclopedias on a specialized topic. My personal favorites include a techno-shamanism site, a Linux site by a gentlemen who lectures on Linux, a few sites about a favorite hobby-pipe smoking. These sites are small and rich in information, and thus several can be fit on a thumb-drive.

Another reason to copy websites is to archive material. Web sites are ephemeral in nature, that is to say they can be ``here today, and gone tomorrow,'' literally. By having some archives, we can rest assured that some of our digital heritage is preserved.

Yet another reason is for research reference. Web sites can be changed, moved, or deleted quickly. Perhaps you are doing research and quoting the web. A copy of the site quoted as it appeared that day can be important to prove your own integrity in the future. Of course, you are not limited to the reasons discussed here, you may have reasons of your own for having a copy of web sites or web pages.

For some people, the things you can do with a website once captured this way may be the reason. When I first experimented with this process, I ran a small web server on my laptop with a Perl script for searches. I would invoke the search and keep learning about my favorite topics wherever I was. Now, if I wanted to do this again, I would probably use the web search indexing program htdig, which is depended upon by the kde program ``khelpcenter'' to build it's search-able index of documents. If I did not want a full blown index, httrack allows you to restructure a site as it is copied. Thus, you can put all your HTML in one directory, and even grep it if need be.

3 What is Using httrack like?

I used the GUI version once or twice, and it was like you would expect, you type a URL into a web form and off it goes. Now I prefer to start with a command line.

Typically the first step would be to create a directory just for the mirror. I usually name the directory after the site. Then you kick off a command prompt in the directory, type ``httrack'' and the URL of the web site, and off it goes, downloading pages. With luck, it will give you a good local copy of the website with all the links adjusted, I usually use a switch to group the website by file type, but that's me. Now, if there are problems (think of the phrase ``run away spider,'') you can control-c the program and begin using switches and filters to limit the scope of it's actions.

4 A real world problem - the ``vanity search''

I don't even know if ``vanity search'' is the right term for this, I also call it the ``non-commercial'' search.

Now in ``Hacker Public Radio Episode # 284'' myself and several other HPR hosts discussed whether or not Google was evil. At one point of this discussion, we theorized that a search run by our own posse would be great. This got me to thinking about a grassroots alternative to big search engines. That, and my experience with the Perl-driven search engine on a laptop led me to a tentative solution. I don't know if it would be scalable or not, but what I did is I used httrack and my cpanel-driven web host to make a private non-commercial search.

The common web host administration program ``cpanel'' comes with a cgi-script search called ``entropy search.'' I married this, with some httrack generated mirrors, to make a small search on my website for my on-line friends (this, BTW, includes all the listeners to my pod-casts.'') You can check it out at Deepgeek.us/search.html. Any feedback will be appreciated. What follows are some details of what I did.

Before starting, a few caveats from the manuals (now you know why this is called ``Talk Geek To Me.'') First, if you decide to do something like this, don't use httrack with any password protected sites. Httrack can copy these, but it will store the password plain text in it's logs. Don't do that! Second, the program ``entropy search'' searches ALL files on your web server and indexes them, so if you host any password protected sites on your server, don't do this, the search indexes will contain the contents of your protected sites. Of course, if you are like me, and you only use your web server account for things you want to make publicly available, you can go ahead and knock yourself out.

The first step was to, you guessed it, mirror some favorite sites to my local computer. Httrack did the hard stuff for me.

Step two was copying those sites to a sub-folder on my website called ``mirrors.'' I also put up a robots.txt file to tell the search engines not to search these sites. I figured it would look like I was trying to steal content if I did otherwise.

Then I logged onto the control panel of the web host, went to the ``cgi center'' page, and clicked the link to build the index for ``entropy search.'' Then I copied the html example into a web page I called ``search.html.'' Now people can search a few thousand of my favorite web pages as well as searching all of my own material also.

To close, if anybody else decides to try this, I would love to search your favorite sites.

talkgeektome.us