diff options
Diffstat (limited to 'debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html')
-rw-r--r-- | debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html | 2590 |
1 files changed, 2590 insertions, 0 deletions
diff --git a/debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html b/debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html new file mode 100644 index 00000000..9f2db468 --- /dev/null +++ b/debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html @@ -0,0 +1,2590 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> +<html> + <head> + <title>ht://Dig Frequently Asked Questions</title> + <link rel="stylesheet" href="css/htdig.css"> + </head> + <body bgcolor="#eef7ff"> + <h1>Frequently Asked Questions</h1> + <p> + ht://Dig Copyright © 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br> + Please see the file <a href="COPYING">COPYING</a> for + license information. + </p> + <hr noshade size=4> + <p class="main">This FAQ is compiled by the ht://Dig developers and the + most recent version is available at <<a + href="http://www.htdig.org/FAQ.html">http://www.htdig.org/FAQ.html</a>>. + Questions (and answers!) are greatly appreciated. + Please send questions and/or answers to the ht://Dig user + mailing list at: <<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general@lists.sourceforge.net</a>>. + </p> + <h2>Questions</h2> + + <h3>1. General</h3> + 1.1. <a href="#q1.1">Can I search the internet with ht://Dig?</a><br> + 1.2. <a href="#q1.2">Can I index the internet with ht://Dig?</a><br> + 1.3. <a href="#q1.3">What's the difference between htdig and + ht://Dig?</a><br> + 1.4. <a href="#q1.4">I sent mail to Andrew or Geoff or + Gilles, but I never got a response!</a><br> + 1.5. <a href="#q1.5">I sent a question to the mailing list but I + never got a response!</a><br> + 1.6. <a href="#q1.6">I have a great idea/patch for ht://Dig!</a><br> + 1.7. <a href="#q1.7">Is ht://Dig Y2K compliant?</a><br> + 1.8. <a href="#q1.8">I think I found a bug. What should I do?</a><br> + 1.9. <a href="#q1.9">Does ht://Dig support phrase or near + matching?</a><br> + 1.10. <a href="#q1.10">What are the practical and/or theoretical + limits of ht://Dig?</a><br> + 1.11. <a href="#q1.11">Do any ISPs offer ht://Dig as part of + their web hosting services?</a><br> + 1.12. <a href="#q1.12">Can I use ht://Dig on a commercial website?</a><br> + 1.13. <a href="#q1.13">Why do you use a non-free product to + index PDF files?</a><br> + 1.14. <a href="#q1.14">Why do you have all those SourceForge + logos on your website?</a><br> + 1.15. <a href="#q1.15">My question isn't answered here. Where should I + go for help?</a><br> + 1.16. <a href="#q1.16">Why do the developers get annoyed when + I e-mail questions directly to them rather than the mailing list?</a><br> + 1.17. <a href="#q1.17">Why do replies to messages on the + mailing list only go to the sender and not to the list?</a><br> + 1.18. <a href="#q1.18">Can I use ht://Dig to index and search + an SQL database?</a><br> + + <hr noshade size=2> + + <h3>2. Getting ht://Dig</h3> + 2.1. <a href="#q2.1">What's the latest version of ht://Dig?</a><br> + 2.2. <a href="#q2.2">Are there binary distributions of ht://Dig?</a><br> + 2.3. <a href="#q2.3">Are there mirror sites for ht://Dig?</a><br> + 2.4. <a href="#q2.4">Is ht://Dig available by ftp?</a><br> + 2.5. <a href="#q2.5">Are patches around to upgrade between + versions?</a><br> + 2.6. <a href="#q2.6">Is there a Windows 95/98/2000/NT + version of ht://Dig?</a><br> + 2.7. <a href="#q2.7">Where can I find the documentation for my + version of ht://Dig?</a><br> + + <hr noshade size=2> + + <h3>3. Compiling</h3> + 3.1. <a href="#q3.1">When I compile ht://Dig I get an error + about libht.a.</a><br> + 3.2. <a href="#q3.2">I get an error about -lg</a><br> + 3.3. <a href="#q3.3">I'm compiling on Digital Unix and I get + mesages about "unresolved" and "db_open."</a><br> + 3.4. <a href="#q3.4">I'm compiling on FreeBSD and I get lots + of messages about '___error' being unresolved.</a><br> + 3.5. <a href="#q3.5">I'm compiling on HP/UX and I get a complaint about + "Large Files not supported."</a><br> + 3.6. <a href="#q3.6">I'm compiling on Solaris and when I run the + programs I get complaints about not finding libstdc++.</a><br> + 3.7. <a href="#q3.7">I'm compiling on IRIX and I'm having + database problems when I run the program.</a><br> + 3.8. <a href="#q3.8">I'm compiling with gcc 3.2 and getting + all sorts of warnings/errors about ostream and such.</a><br> + + <hr noshade size=2> + + <h3>4. Configuration</h3> + 4.1. <a href="#q4.1">How come I can't index my site?</a><br> + 4.2. <a href="#q4.2">How can I change the output format of + htsearch?</a><br> + 4.3. <a href="#q4.3">How do I index pages that start with '~'?</a><br> + 4.4. <a href="#q4.4">Can I use multiple databases?</a><br> + 4.5. <a href="#q4.5">OK, I can use multiple databases. Can I + merge them into one?</a><br> + 4.6. <a href="#q4.6">Wow, ht://Dig eats up a lot of disk + space. How can I cut down?</a><br> + 4.7. <a href="#q4.7">Can I use SSI or other CGIs in my + htsearch results?</a><br> + 4.8. <a href="#q4.8">How do I index Word, Excel, PowerPoint + or PostScript documents?</a><br> + 4.9. <a href="#q4.9">How do I index PDF files?</a><br> + 4.10. <a href="#q4.10">How do I index documents in other + languages?</a><br> + 4.11. <a href="#q4.11">How do I get rotating banner ads in + search results?</a><br> + 4.12. <a href="#q4.12">How do I index numbers in documents?</a><br> + 4.13. <a href="#q4.13">How can I call htsearch from a hypertext + link, rather than from a search form?</a><br> + 4.14. <a href="#q4.14">How do I restrict a search to only meta + keywords entries in documents?</a><br> + 4.15. <a href="#q4.15">Can I use meta tags to prevent htdig from + indexing certain files?</a><br> + 4.16. <a href="#q4.16">How do I get htsearch to use the star image + in a different directory than the default /htdig?</a><br> + 4.17. <a href="#q4.17">How do I get htdig or htsearch to rewrite + URLs in the search results?</a><br> + 4.18. <a href="#q4.18">What are all the options in + htdig.conf, and are there others?</a><br> + 4.19. <a href="#q4.19">How do I get more than 10 pages of + 10 search results from htsearch?</a><br> + 4.20. <a href="#q4.20">How do I restrict a search to only + certain subdirectories or documents?</a><br> + 4.21. <a href="#q4.21">How can I allow people to search + while the index is updating?</a><br> + 4.22. <a href="#q4.22">How can I get htdig to ignore the + robots.txt file or meta robots tags?</a><br> + 4.23. <a href="#q4.23">How can I get htdig not to index + some directories, but still follow links?</a><br> + 4.24. <a href="#q4.24">How can I get rid of duplicates in + search results?</a><br> + 4.25. <a href="#q4.25">How can I change the scores in + search results, and what are the defaults?</a><br> + 4.26. <a href="#q4.26">How can I get htdig not to index + JavaScript code or CSS?</a><br> + + <hr noshade size=2> + + <h3>5. Troubleshooting</h3> + 5.1. <a href="#q5.1">I can't seem to index more than X documents + in a directory.</a><br> + 5.2. <a href="#q5.2">I can't index PDF files.</a><br> + 5.3. <a href="#q5.3">When I run "rundig," I get a message about + "DATABASE_DIR" not being found.</a><br> + 5.4. <a href="#q5.4">When I run htmerge, it stops with an "out + of diskspace" message.</a><br> + 5.5. <a href="#q5.5">I have problems running rundig from cron + under Linux.</a><br> + 5.6. <a href="#q5.6">When I run htmerge, it stops with an + "Unexpected file type" message.</a><br> + 5.7. <a href="#q5.7">When I run htsearch, I get lots of Internal + Server Errors (#500).</a><br> + 5.8. <a href="#q5.8">I'm having problems with indexing words + with accented characters.</a><br> + 5.9. <a href="#q5.9">When I run htmerge, it stops with a + "Word sort failed" message.</a><br> + 5.10. <a href="#q5.10">When htsearch has a lot of matches, it runs + extremely slowly.</a><br> + 5.11. <a href="#q5.11">When I run htsearch, it gives me a count of + matches, but doesn't list the matching documents.</a><br> + 5.12. <a href="#q5.12">I can't seem to index documents with names + like left_index.html with htdig.</a><br> + 5.13. <a href="#q5.13">I get Premature End of Script Headers errors + when running htsearch.</a><br> + 5.14. <a href="#q5.14">I get Segmentation faults when running + htdig, htsearch or htfuzzy.</a><br> + 5.15. <a href="#q5.15">Why does htdig 3.1.3 mangle URL parameters + that contain bare "&" characters?</a><br> + 5.16. <a href="#q5.16">When I run htmerge, it stops with an + "Unable to open word list file '.../db.wordlist'" message.</a><br> + 5.17. <a href="#q5.17">When using Netscape, htsearch always returns the + "No match" page.</a><br> + 5.18. <a href="#q5.18">Why doesn't htdig follow links to other + pages in JavaScript code?</a><br> + 5.19. <a href="#q5.19">When I run htsearch from the web server, + it returns a bunch of binary data.</a><br> + 5.20. <a href="#q5.20">Why are the betas of 3.2 so slow at indexing?</a><br> + 5.21. <a href="#q5.21">Why does htsearch use ";" instead of + "&" to separate URL parameters for the page buttons?</a><br> + 5.22. <a href="#q5.22">Why does htsearch show the + "&" character as "&amp;" in search results?</a><br> + 5.23. <a href="#q5.23">I get Internal Server or Unrecognized + character errors when running htsearch.</a><br> + 5.24. <a href="#q5.24">I took some settings out of + my htdig.conf but they're still set.</a><br> + 5.25. <a href="#q5.25">When I run htdig on my site, + it misses entire directories.</a><br> + 5.26. <a href="#q5.26">What do all the numbers and symbols + in the htdig -v output mean?</a><br> + 5.27. <a href="#q5.27">Why is htdig rejecting some of the + links in my documents?</a><br> + 5.28. <a href="#q5.28">When I run htdig or htmerge, I get a + "DB2 problem...: missing or empty key value specified" message.</a><br> + 5.29. <a href="#q5.29">When I run htdig on my site, + it seems to go on and on without ending.</a><br> + 5.30. <a href="#q5.30">Why does htsearch no longer recognize + the -c option when run from the web server?</a><br> + 5.31. <a href="#q5.31">I've set a config attribute exactly + as documented but it seems to have no effect.</a><br> + 5.32. <a href="#q5.32">When I run htsearch, it gives a page + with an "Unable to read configuration file" message.</a><br> + 5.33. <a href="#q5.33">How can I find out which version + of ht://Dig I have installed?</a><br> + 5.34. <a href="#q5.34">When running htdig, I get "Error (0): + PDF file is damaged - attempting to reconstruct xref table..."</a><br> + 5.35. <a href="#q5.35">When running htdig on Mandrake Linux, + I get "host not found" and "no server running" errors.</a><br> + 5.36. <a href="#q5.36">When I run htsearch, it gives me the + list of matching documents, but no header or footer.</a><br> + 5.37. <a href="#q5.37">When I index files with doc2html.pl, + it fails with the "UNABLE to convert" error.</a><br> + 5.38. <a href="#q5.38">Why do my searches find search terms + in pathnames, or how do I prevent matching filenames?</a><br> + 5.39. <a href="#q5.39">I set up an external parser but I still + can't index Word/Excel/PowerPoint/PDF documents.</a><br> + + <hr noshade size=4> + <h2>Answers</h2> + + <h3>1. General</h3> + <strong>1.1. <a name="q1.1">Can I search the internet with + ht://Dig?</a></strong><br> + <p>No, ht://Dig is a system for indexing and searching a + finite (not necessarily small) set of sites or intranet. It + is not meant to replace any of the many internet-wide search + engines.</p> + + <strong>1.2. <a name="q1.2">Can I index the internet with + ht://Dig?</a></strong><br> + <p>No, as above, ht://Dig is not meant as an + internet-wide search engine. While there is + <em>theoretically</em> nothing to stop you from indexing as + much as you wish, practical considerations (e.g. time, disk + space, memory, etc.) will limit this.</p> + + <strong>1.3. <a name="q1.3">What's the difference between htdig and + ht://Dig?</a></strong><br> + <p>The complete ht://Dig package consists of several programs, one of + which is called "htdig." This program performs the "digging" or + indexing of the web pages. Of course an index doesn't do you much good + without a program to sort it, search through it, etc.</p> + + <strong>1.4. <a name="q1.4">I sent mail to Andrew or Geoff + or Gilles, but I never got a response!</a></strong><br> + <p>Andrew no longer does much work on ht://Dig. He has started a + company, called <a href="http://www.contigo.com/">Contigo + Software</a> and is quite busy with that. To contact any of the + current developers, send mail to <<a + href="mailto:htdig-dev@lists.sourceforge.net">htdig-dev</a>>. + This list is intended primarily for the discussion of current + and future development of the software.</p> + + <p>Geoff and Gilles are currently the maintainers of + ht://Dig, but they are both volunteers. So while they do + read all the e-mail they receive, they may not respond + immediately. Questions about ht://Dig in general, and especially + questions or requests for help in configuring the software, + should be posted to the <<a + href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>> + mailing list. When posting a followup to a message on the + list, you should use the "reply to all" or "group reply" + feature of your mail program, to make sure the mailing list + address is included in the reply, rather than replying only + to the author of the message. + See also question <a href="#q1.16">1.16</a> and the + <a href="http://www.htdig.org/mailarchive.html">mailing list</a> + page.</p> + + <strong>1.5. <a name="q1.5">I sent a question to the mailing list but I + never got a response!</a></strong><br> + <p>Development of ht://Dig is done by volunteers. Since we all + have other jobs, it make take a while before someone gets back + to you. Please be patient and don't hound the volunteers with + direct or repeated requests. If you don't get a response after + 3 or 4 days, then a reminder may help. + See also question <a href="#q1.16">1.16</a>.</p> + + <strong>1.6. <a name="q1.6">I have a great idea/patch for + ht://Dig!</a></strong><br> + <p>Great! Development of ht://Dig continues through suggestions + and improvements from users. If you have an idea (or even better, + a patch), please send it to the ht://Dig mailing list so others + can use it. For suggestions on how to submit patches, please check + the <a href="dev/patches.html">Guidelines for + Patch Submissions</a>. If you'd like to make a feature request, + you can do so through the <a href="bugs.html">ht://Dig bug + database</a></p> + + <strong>1.7. <a name="q1.7">Is ht://Dig Y2K compliant?</a></strong><br> + <p> + ht://Dig should be y2k compliant since it never <em>stores</em> dates as + two-digit years. Under ht://Dig's copyright (GPL), there is no warranty + whatsoever as permitted by law. If you would like an iron-clad, + legally-binding guarantee, feel free to check the source code + itself. Versions prior to 3.1.2 did have a problem with the parsing + of the Last-Modified header returned by the HTTP server, which will + cause incorrect dates to be stored for documents modified after + February 28, 2000 (yes, it didn't recognize 2000 as a leap year). + Versions prior to 3.1.5 didn't correctly handle servers that return + two digit years in the Last-Modified header, for years after 99. + These problems are fixed in the current release. + If you discover something else, please let us know! + </p> + + <strong>1.8. <a name="q1.8">I think I found a bug. What should I + do?</a></strong><br> + <p>Well, there are probably bugs out there. You have two options + for bug-reporting. You can either mail the ht://Dig mailing list + at <<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general@lists.sourceforge.net</a>> or + better yet, report it to the <a href="bugs.html">bug + database</a>, which ensures it won't + become lost amongst all of the other mail on the list. + Please try to include as much information as possible, including + the version of ht://Dig (see question <a href="#q5.33">5.33</a>), + the OS, and anything else that might be helpful. + Often, running the programs with one "-v" or more + (e.g. "-vvv") gives useful debugging information. + If you are unsure whether the problem is a bug or a configuration + problem, you should discuss the problem on + <<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>> + (after carefully reading the FAQ and searching the + <a href="http://www.htdig.org/mailarchive.html">mail archive</a> + and <a href="#q2.5">patch archive</a>, + of course) + to sort out what it is. The mailing list has a wider audience, so + you're more likely to get help with configuration problems there + than by reporting them to the bug database. + </p> + + <p>Whether reporting problems to the bug database or mailing + list, we cannot stress enough the importance of + <strong>always</strong> indicating <strong>which version of + ht://Dig you are running</strong>. + See question <a href="#q5.33">5.33</a>. There + are still a lot of users, ISPs and software distributors using + older versions, and there have been a lot of bug fixes and + new features added in recent versions. Knowing which version + you're running is absolutely essential in helping to find a + solution. If you're unsure if your version is current, or what + fixes and features have been added in more recent versions, + please see the <a href="RELEASE.html"> + release notes</a>. See also question <a href="#q2.1">2.1</a>.</p> + + <strong>1.9. <a name="q1.9">Does ht://Dig support phrase or near + matching?</a></strong><br> + <p>Phrase searching has been added for the 3.2 release, + which is currently in the beta phase + (<a href="http://www.htdig.org/files/htdig-3.2.0b6.tar.gz">3.2.0b6</a> + as of this writing). Near or proximity matching will probably be added + in a future beta. + </p> + + <strong>1.10. <a name="q1.10">What are the practical and/or theoretical + limits of ht://Dig?</a></strong><br> + <p>The code itself doesn't put any real limit on the number of + pages. There are several sites in the hundreds of thousands + of pages. As for practical limits, it depends a lot on how + many pages you plan on indexing. Some operating systems limit + files to 2 GB in size, which can become a problem with a large + database. There are also slightly different limits to each of + the programs. Right now htmerge performs a sort on the words + indexed. Most sort programs use a fair amount of RAM and + temporary disk space as they assemble the sorted list. The + htdig program stores a fair amount of information about the + URLs it visits, in part to only index a page once. This takes + a fair amount of RAM. With cheap RAM, it never hurts to throw + more memory at indexing larger sites. In a pinch, swap will + work, but it obviously really slows things down.</p> + + <p>The 3.2 development code helps with many of these + limitations. In paticular, it generates the databases on the + fly, which means you don't have to sort them before + searching. Additionally, the new databases are compressed + significantly, making them usually around 50% the size of + those in previous versions.</p> + + <strong>1.11. <a name="q1.10">Do any ISPs offer ht://Dig as part of + their web hosting services?</a></strong><br> + <p>Yes. A list of such ISPs is <a href="isp.html">available + here</a> + </p> + + <strong>1.12. <a name="q1.12">Can I use ht://Dig on a + commercial website?</a></strong><br> + <p>Sure! The <a href="COPYING">GNU Library General Public License (LGPL)</a> has no + restrictions on use. So you are free to use ht://Dig however you + want on your website, personal files, etc. The license only + restricts distribution. So if you're planning on a + commercial software product that includes ht://Dig, you will + have to provide source code including any modifications upon + request. + </p> + + <strong>1.13. <a name="q1.13">Why do you use a non-free + product to index PDF files?</a></strong><br> + <p> + We don't. You <em>can</em> use the "acroread" + program to index PDF files, but this is no longer + recommended. Initially this program was the only reliable + way to extract data from PDF files. However, the <a + href="http://www.foolabs.com/xpdf/">xpdf package</a> is a + reliable, free software package for indexing and viewing PDF + files. See question <a href="#q4.9">4.9</a> for details on + using xpdf to index PDF files. We do not advocate using + acroread any longer because it is a proprietary product. + Additionally it is no longer reliable at extracting data. + </p> + + <strong>1.14. <a name="q1.14">Why do you have all those SourceForge + logos on your website?</a></strong><br> + <p><a href="http://sourceforge.net/">SourceForge</a> is a + new service for open source software. You can host your + project on SourceForge servers and use many of their + services like bug-tracking and the like. The ht://Dig + project currently uses SourceForge for a mirror of the main + website at <a + href="http://htdig.sourceforge.net/">htdig.sourceforge.net</a> + as well as a mirror of ht://Dig releases and contributed + work. + </p> + + <strong>1.15. <a name="q1.15">My question isn't answered here. + Where should I go for help?</a></strong><br> + <p> + Before you go anywhere else, think of other ways of phrasing your + question. Many times people have questions that are very similar to + other FAQ and while we try to phrase the queries in the FAQ closely to + the most common questions, we obviously can't get them all! The next + place to check is the documentation itself. In particular, take a + look at the list of configuration attributes, particularly the list <a + href="cf_byname.html">by name</a> and <a + href="cf_byprog.html">by program</a>. There are a + lot of them, but chances are there's something that might fit your needs. + You should also take a close look at all of + <a href="htsearch.html">htsearch</a>'s + documentation, especially the section "HTML form" which describes + all the CGI input parameters available for controlling the search, + including limiting the search to certain subdirectories. + You can find the answer yourself to almost all "how can I..." + questions by exploring what the various configuration attributes + and search form input parameters can do. + Also have a look at our collection of + <a href="http://www.htdig.org/contrib/guides.html">Contributed Guides</a> + for help on things like + <a href="http://www.htdig.org/files/contrib/guides/htmlhelp.html">HTML + forms</a> and CGI, tutorials on installing, configuring, using, and + internationalizing ht://Dig, as well as using PHP with htsearch. + </p> + <p> + Finally, if you've exhausted all the online documentation, there's the + <a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a> mailing list. + There are hundreds of users subscribed and chances are good that someone + has had a similar problem before or can suggest a solution. + </p> + + <strong>1.16. <a name="q1.16">Why do the developers get annoyed when + I e-mail questions directly to them rather than the mailing list?</a></strong><br> + <p>The <a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a> + mailing list exists for dealing with questions about the + software, its installation, configuration, and problems with + it. E-mailing the developers directly circumvents this forum + and its benefits. Most annoyingly, it puts the onus on an + individual to answer, even if that individual is not the best or + most qualified person to answer. This is not a one-man show. It + also circumvents the <a href="http://www.htdig.org/mailarchive.html">archiving + mechanism</a> of the mailing list, + so not only do subscribers not see these private messages + and replies, but future users who may run into the exact same + problems won't see them. Remember that the developers are all + volunteers, and they don't work for free for your benefit alone. + They volunteer for the benefit of the whole ht://Dig user + community, so don't expect extra support from them outside of + that community. See also questions <a href="#q1.4">1.4</a> + and <a href="#q1.5">1.5</a>.</p> + + <p>Note also that when you reply to a message on the list, you + should make sure the reply gets on the list as well, provided your + reply is still on-topic. See question <a href="#q1.17">1.17</a> + below.</p> + + <strong>1.17. <a name="q1.17">Why do replies to messages on the + mailing list only go to the sender and not to the list?</a></strong><br> + <p>The simple answer is that, unlike some mailing lists, the + lists on SourceForge don't force replies back on the list. This + is actually a good thing, because you can reply to the sender + directly if you want to, or you can use your mail program's + "reply to all" capability (sometimes called "group reply") + to reply to the mailing list as well. It does mean you have to + think before you post a reply, but some would argue that this + is a good thing too. There are some compelling reasons to try to + keep on-topic discussions on the list, though (see questions + <a href="#q1.16">1.16</a> and <a href="#q1.4">1.4</a> above).</p> + + <p>The technical answer is + <a href="http://sourceforge.net/docman/display_doc.php?docid=6693&group_id=1"> + SourceForge's policy on Reply-To: munging</a>, where you'll + find all the gory details about the pros and cons of the two + common ways of setting up a mailing list, and why SourceForge + turns off Reply-To munging. It so happens that the ht://Dig + maintainers agree with SourceForge's policy on this, even if + we did have a say in the matter. So, counterarguments to this + policy are rather moot, and it would be better not to waste + any more mailing list bandwidth debating them. (We've heard + all the arguments anyway.)</p> + + <strong>1.18. <a name="q1.18">Can I use ht://Dig to index and search + an SQL database?</a></strong><br> + <p>You can if your database has a web-based front end that can + be "spidered" by ht://Dig. The requirement is that every search + result must resolve to a unique URL which can be accessed via + HTTP. The htdig program uses these URLs, which you feed it via + the <a href="attrs.html#start_url">start_url</a> attribute, to + fetch and index each page of information. The search results + will then give a list of URLs for all pages that match the + search terms. If you don't have such a front end to your + database, or the search results must be given as something + other than URLs, then ht://Dig is probably not the best way of + dealing with this problem: you may be better off using an SQL + query engine that works directly on your own database, rather + than building a separate ht://Dig database for searching.</p> + + <p>Ted Stresen-Reuter had the following tips: "In my case, + because I like htdig's ability to rank results (and that + ranking can be modified), I created an index page that simply + walks through each record and indexes each record (with + <em>next</em> and <em>previous</em> links so the spider can + read all the records). And then I do one other thing: I make + the <code><title></code> tag start with the unique ID + of each record. Then, when I'm parsing the search results, I + do a lookup on the database using the title tag as the key."</p> + + <hr noshade size=2> + + <h3>2. Getting ht://Dig</h3> + <strong>2.1. <a name="q2.1">What's the latest version of ht://Dig?</a></strong><br> + <p>The latest version is 3.1.6 as of this writing. A beta + version of the 3.2 code, + <a href="http://www.htdig.org/files/htdig-3.2.0b6.tar.gz">3.2.0b6</a>, + is also available, for those who wish to test it. + You can find out about the latest version by reading the + <a href="RELEASE.html">release + notes</a>.</p> + + <p><strong>Note</strong> that if you're running any version + older than 3.1.5 (including 3.2.0b1) on a public web site, + you should upgrade immediately, as older versions have a + rather serious security hole which is explained in detail in + this <a + href="http://www.htdig.org/htdig-dev/2000/02/0272.html">advisory</a> + which was sent to the Bugtraq mailing list. + Another slightly less serious, but still troubling security hole + exists in 3.1.5 and older (including 3.2.0b3 and older), so you + should upgrade if you're running one of these. You can view details + on this vulnerability from the + <a href="http://www.securityfocus.com/bid/3410">bugtraq mailing list.</a> + If you're unsure of which version you're running, see question + <a href="#q5.33">5.33</a>.</p> + + <strong>2.2. <a name="q2.2">Are there binary distributions of + ht://Dig?</a></strong><br> + <p>We're trying to get consistent binary distributions for + popular platforms. Contributed binary releases will go in <a + href="http://www.htdig.org/files/contrib/binaries/"> + the contributed binaries section</a> + and contributions should be mentioned to the <a + href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a> + mailing list. + + <p>Anyone who would like to make consistent binary + distributions of ht://Dig at least should signup to the <a + href="mailing.html">htdig-announce mailing list</a>.</p> + + <strong>2.3. <a name="q2.3">Are there mirror sites for ht://Dig?</a></strong><br> + <p>Yes, see our <a href="mirrors.html">mirrors + listing</a>. If you'd like to mirror the site, please see + the <a href="howto-mirror.html">mirroring guide</a>.</p> + + <strong>2.4. <a name="q2.4">Is ht://Dig available by ftp?</a></strong><br> + <p>Yes. You can find the current versions and several older + versions at various <<a + href="mirrors.html">mirror sites</a>> + as well as the other locations mentioned in the <a + href="where.html">download page</a>.</p> + + <strong>2.5. <a name="q2.5">Are patches around to upgrade between + versions?</a></strong><br> + <p>Most versions are also distributed as a patch to the previous + version's source code. The most recent exception to this was + version 3.1.0b1. Since this version switched from the GDBM + database to DB2, the new database package needed to be shipped + with the distribution. This made the potential patch almost as large + as the regular distribution. Update patches resumed with version + 3.1.0b2. You can also find archives of patches submitted to + the htdig mailing lists, to fix specific bugs or add features, + at Joe Jah's <a href="ftp://ftp.ccsf.org/htdig-patches/"> + htdig-patches ftp site</a>.</p> + + <strong>2.6. <a name="q2.6">Is there a Windows 95/98/2000/NT + version of ht://Dig?</a></strong><br> + <p>The ht://Dig package can be built on the Win32 platform when + using the Cygwin package. For details, see the contributed guide, + <a href="http://www.htdig.org/files/contrib/guides/Installing_on_Win32.html"> + <em>Idiot's Guide to Installing ht://Dig on Win32</em></a>. + </p> + <p> + As of the <a href="http://www.htdig.org/files/htdig-3.2.0b5.tar.gz">3.2.0b5</a> + beta release, there is also native Win32 support, thanks to + Neal Richter. (Installation docs will be written soon...) + </p> + + <strong>2.7. <a name="q2.7">Where can I find the documentation for my + version of ht://Dig?</a></strong><br> + <p>The documentation for the most recent stable release is always + posted at <a href="http://www.htdig.org/">www.htdig.org</a>. + The documentation for the latest beta release can be found at + <a href="http://www.htdig.org/dev/htdig-3.2/">http://www.htdig.org/dev/htdig-3.2/</a>. + In all releases, the documentation is included in the + <strong>htdoc</strong> subdirectory of the source distribution, so + you always have access to the documentation for your current version. + </p> + + <hr noshade size=2> + + <h3>3. Compiling</h3> + <strong>3.1. <a name="q3.1">When I compile ht://Dig I get an error about + libht.a</a></strong><br> + <p>This usually indicates that either libstdc++ is not installed or + is installed incorrectly. To get libstdc++ or any other GNU too, + check + <a + href="ftp://ftp.gnu.org/gnu/">ftp://ftp.gnu.org/gnu/</a>. + Note that the most recent versions of gcc come with + libstdc++ included and are available from <a + href="http://gcc.gnu.org/">http://gcc.gnu.org/</a></p> + + <strong>3.2. <a name="q3.2">I get an error about -lg</a></strong><br> + <p>This is due to a bug in the Makefile.config.in of version + 3.1.0b1. Remove all flags "-ggdb" in Makefile.config.in. Then + type "./config.status" to rebuild the Makefiles and + recompile. This bug is fixed in version 3.1.0b2.</p> + + <strong>3.3. <a name="q3.3">I'm compiling on Digital Unix and I get + mesages about "unresolved" and "db_open."</a></strong><br> + <p>Answer contributed by George Adams + <learningapache@my-dejanews.com></p> + + <p>What you're seeing are problems related to the Berkeley DB + library. htdig needs a fairly modern version of db, which is + why it ships with one that works. (see that -L../db-2.4.14/dist + line? That's where htdig's db library is).<br> + + The solution is to modify the c++ command so it explicity + references the correct libdb.a . You can do this by replacing + the "-ldb" directive in the c++ command with + "../db-2.4.14/dist/libdb.a" This problem has been resolved as of + version 3.1.0.</p> + + <strong>3.4. <a name="q3.4">I'm compiling on FreeBSD and I get lots + of messages about '___error' being unresolved.</a></strong><br> + <p>Answer contributed by Laura Wingerd <laura@perforce.com><br> + I got a clean build of htdig-3.1.2 on FreeBSD 2.2.8 by taking + -D_THREAD_SAFE out of CPPFLAGS, and setting LIBS to null, in + db/dist/configure.</p> + + <strong>3.5. <a name="q3.5">I'm compiling on HP/UX and I get a complaint about + "Large Files not supported."</a></strong><br> + <p>The db/ pacakge, included with ht://Dig seems to be unable to complete + on HP/UX 10.20 in particular. After running the top-level configure + script, cd into db/dist and type:</p> + <code>./configure --disable-bigfile</code> + <p>Then continue with the normal compilation.</p> + + <strong>3.6. <a name="q3.6">I'm compiling on Solaris and when I run the + programs I get complaints about not finding libstdc++.</a></strong><br> + <p>Answer contributed by Adam Rice <adam@newsquest.co.uk></p> + <p>The problem is that the Solaris loader can't find the library. The + best thing to do is set the LD_RUN_PATH environment variable <em>during compile</em> + to the directory where libstdc++.so.2.8.1.1 lives. This tells the linker + to search that directory at runtime. + </p> + + <p>Note that LD_RUN_PATH is not to be confused with LD_LIBRARY_PATH. + The latter is parsed at run-time, while LD_RUN_PATH essentially + compiles in a library path into the executable, so that it doesn't + need a LD_LIBRARY_PATH setting to find its libraries. This allows + you to avoid all the complexities of setting an environment + variable for a CGI program run from the server. If all else fails, + you can always run your programs from wrapper shell scripts that + set the LD_LIBRARY_PATH environment variable appropriately.</p> + + <p>Note also that while this answer is specific to Solaris, it may + work for other OSes too, so you may want to give it a try. However, + not all versions of the <code>ld</code> program on all OSes support + the LD_RUN_PATH environment variable, even if these systems support + shared libraries. Try "<code>man ld</code>" on your system to + find out the best way of setting the runtime search path for shared + libraries. If <code>ld</code> doesn't support LD_RUN_PATH, but does + support the <code>-R</code> option, you can add one or more of these + options to LIBDIRS in Makefile.config before running make on a 3.1.x + release. (For a 3.2 beta release, you can add these options to the + LDFLAGS environment variable before you run ./configure.)</p> + + <strong>3.7. <a name="q3.7">I'm compiling on IRIX and I'm having + database problems when I run the program.</a></strong><br> + <p> + It is not entirely clear why these problems occur, though + they seem to only happen when older compilers are + used. Several people have reported that the problems go away + when using the latest version of <a href="http://gcc.gnu.org/">gcc</a>. + </p> + + <strong>3.8. <a name="q3.8">I'm compiling with gcc 3.2 and getting + all sorts of warnings/errors about ostream and such.</a></strong><br> + <p> + With versions before 3.2.0b5, + you should use the following command to configure the ht://Dig + package so it can be built with gcc 3.2: +<pre> +CXXFLAGS=-Wno-deprecated CPPFLAGS=-Wno-deprecated ./configure +</pre> + </p> + + <hr noshade size=2> + + <h3>4. Configuration</h3> + <strong>4.1. <a name="q4.1">How come I can't index my site?</a></strong><br> + <p>There are a variety of reasons ht://Dig won't index a + site. To get to the bottom of things, it's advisable to turn on + some debugging output from the htdig program. When running from + the command-line, try "-vvv" in addition to any other + flags. This will add debugging output, including the responses + from the server.</p> + <p>See also questions <a href="#q5.25">5.25</a>, + <a href="#q5.27">5.27</a>, <a href="#q5.16">5.16</a> and + <a href="#q5.18">5.18</a>.</p> + + <strong>4.2. <a name="q4.2">How can I change the output format of htsearch?</a></strong><br> +<p>Answer contributed by: Malki Cymbalista <Malki.Cymbalista@weizmann.ac.il></p> + +<p>You can change the output format of htsearch by creating different +header, footer and result files that specify how you want the output +to look. You then create a configuration file that specifies which +files to use. In the html document that links to the search, you +specify which configuration file to use.</p> + +<p>So the configuration file would have the lines:</p> +<pre> +search_results_header: ${common_dir}/ccheader.html +search_results_footer: ${common_dir}/ccfooter.html +template_map: Long long builtin-long \ + Short short builtin-short \ + Default default ${common_dir}/ccresult.html +template_name: Default +</pre> +<p>You would also put into the configuration file any other lines from the +default configuration file that apply to htsearch.</p> + +<p>The files ${common_dir}/ccheader.html and +${common_dir}/ccfooter.html and ${common_dir}/ccresult.html would be +tailored to give the output in the desired format.</p> + +<p>Assuming your configuration file is called cc.conf, the html file that +links to the search has to set the config parameter equal to cc. The +following line would do it:<br> +<code><input type="hidden" name="config" value="cc"></code></p> + + <p><strong>Note:</strong> Don't just add the line above to your + <a href="hts_form.html">search form</a> + without checking if there isn't already a similar + line giving the config attribute a different value. The sample + search.html form that comes with the package includes a line + like this already, giving "config" the default value of "htdig". + If it's there, modify it instead of adding another definition. + The config input parameter doesn't need to be hidden either, and + you may want to define it as a pull-down list to select different + databases (see question <a href="#q4.4">4.4</a>).</p> + + <strong>4.3. <a name="q4.3">How do I index pages that start with '~'?</a></strong><br> + <p> + ht://Dig should index pages starting with '~' as if it was another + web browser. If you are having problems with this, check your server + log files to see what file the server is attempting to return. + </p> + + <strong>4.4. <a name="q4.4">Can I use multiple databases?</a></strong><br> + <p>Yes, though you may find it easier to have one larger + database and use restrict or exclude fields on searches. To use + multiple databases, you will need a config file for each + database. Then each file will set the + <a href="attrs.html#database_dir">database_dir</a> or + <a href="attrs.html#database_base">database_base</a> attribute to + change the name of the databases. The config file is selected + by the <strong>config</strong> input field in the search form. + <br>See also questions <a href="#q4.2">4.2</a> and + <a href="#q4.20">4.20</a>.</p> + + <strong>4.5. <a name="q4.5">OK, I can use multiple databases. Can I + merge them into one?</a></strong><br> + <p>As of version 3.1.0, you can do this with the -m option to + <a href="htmerge.html">htmerge</a>.</p> + + <strong>4.6. <a name="q4.6">Wow, ht://Dig eats up a lot of disk + space. How can I cut down?</a></strong><br> + <p>There are several ways to cut down on disk space. One is + not to use the "-a" option, which creates work copies of the + databases. Naturally this essentially doubles the disk + usage. If you don't need to index and search at the same time, you can + ignore this flag.</p> + + <p>If you are running 3.2.0b5 or higher and don't have + <a href="dev/htdig-3.2/attrs.html#wordlist_compress_zlib">compression</a> + turned on, then turning that on will also save considerable space.</p> + + <p>Changing configuration variables can also help cut + down on disk usage. Decreasing + <a href="attrs.html#max_head_length">max_head_length</a> and + <a href="attrs.html#max_meta_description_length">max_meta_description_length</a> + will cut down on the size of the excerpts stored (in fact, if you + don't have + <a href="attrs.html#use_meta_description">use_meta_description</a> + set, you can set + max_meta_description_length to 0!).</p> + + <p>If you are running 3.2.0b6 or higher, you can turn off + <a href="dev/htdig-3.2/attrs.html#store_phrases">store_phrases</a>. This cuts the + database size by about 60%, at the expense of severely limiting + the effectiveness of phrase searches. It also reduces digging time + slightly.</p> + + <p>Other techniques include removing the db.wordlist file and adding + more words to the <a href="attrs.html#bad_words">bad_words</a> + file.</p> + + <p>The University of Leipzig has published + <a href="http://wortschatz.uni-leipzig.de/html/wliste.html"> + word lists</a> containing the 100, 1000 and 10000 most often used + words in English, German, French and Dutch. No copyrights or + restrictions seem to be applied to the downloadable files. These + can be very handy when putting together a bad_words file. Thanks + to Peter Asemann for this tip.</p> + + <strong>4.7. <a name="q4.7">Can I use SSI or other CGIs in my + htsearch results?</a></strong><br> + <p>Not really. Apache will not parse CGI output for SSI + statements (See the <a + href="http://www.apache.org/docs/misc/FAQ.html#ssi-part-iii">Apache + FAQ</a>). Thus,the htsearch CGI does not understand SSI + markup and thus cannot include other + CGIs. However, it is possible doing it the other way round: + you can have the htsearch results included in your dynamic + page. + </p> + <p> + The Apache project has mentioned that this will be a + feature added to the Apache 2.0 version, currently in development. + </p> + + <p>The easiest approach in the meantime is using SSI with + the help of the <a + href="attrs.html#script_name">script_name</a> configuration + file attribute. See the <code>contrib/scriptname</code> + directory for a small example using SSI.</p> + + <p>For CGI and PHP, you need a "wrapper" script to + do that. For perl script examples, see the files in + <code>contrib/ewswrap</code>. The PHP guide (see <a + href="http://www.htdig.org/contrib/guides.html">contributed + guides</a>) not only describes a wrapper script for PHP, but + also offers a step by step tutorial to the basics of + ht://dig and is well worth reading. + For other alternatives, see question <a href="#q4.11">4.11</a>. + </p> + + <strong>4.8. <a name="q4.8">How do I index Word, Excel, PowerPoint + or PostScript documents?</a></strong><br> + <p>This must be done with an + <a href="attrs.html#external_parsers">external parser or converter</a>. + A sample of such an external converter is the + contrib/doc2html/doc2html.pl Perl script. + It will parse Word, PostScript, PDF and other documents, when used + with the appropriate document to text converters. It uses catdoc to + parse Word documents, and ps2ascii to parse PostScript files. The + comments in the Perl script and accompanying documentation + indicate where you can obtain these converters.</p> + + <p>Versions of htdig before 3.1.4 don't support external converters, + so you have to use an external parser script such as + contrib/parse_doc.pl (or better yet, upgrade htdig if you can). + External converter scripts are simpler to write and maintain than a + full external parser, as they just convert input documents to + text/plain or text/html, and pass that back to htdig to be parsed. + Parsing is more consistent across document types with external + converters, because the final work is done by htdig's internal + parsers. External parser scripts tend to be hacks that don't + recognize a lot of the parsing attributes in your htdig.conf, so + they have to be hacked some more when you change your attributes.</p> + + <p>The most recent versions of parse_doc.pl, conv_doc.pl and + the doc2html package are available on our <a + href="http://www.htdig.org/files/contrib/parsers/">web site</a>.<br> + See below for an example of doc2html.pl, or see the comments in + conv_doc.pl and parse_doc.pl, or the documentation for doc2html + for examples of their usage. + For help with troubleshooting, see questions + <a href="#q5.37">5.37</a> and <a href="#q5.39">5.39</a>.</p> + + <strong>4.9. <a name="q4.9">How do I index PDF files?</a></strong><br> + <p>This too can be done with an + <a href="attrs.html#external_parsers">external parser or converter</a>, + in combination with the pdftotext program that is part of the + <a href="http://www.foolabs.com/xpdf/">xpdf</a> 0.90 package. A + sample of such a converter is the doc2html.pl Perl + script. It uses pdftotext to parse PDF documents, then processes + the text into external parser records. + The most recent version of doc2html.pl is available on our <a + href="http://www.htdig.org/files/contrib/parsers/">web + site</a>.</p> + + <p>For example, you could put this in your configuration file:</p> +<pre> +<a href="attrs.html#external_parsers">external_parsers</a>: application/msword->text/html /usr/local/bin/doc2html.pl \ + application/postscript->text/html /usr/local/bin/doc2html.pl \ + application/pdf->text/html /usr/local/bin/doc2html.pl +</pre> + <p>You would also need to configure the script to indicate where all + of the document to text converters are installed. See the DETAILS + file that comes with doc2html for more information.</p> + + <p>Versions of htdig before 3.1.4 don't support external converters, + so you have to use an external parser script such as + contrib/parse_doc.pl (or better yet, upgrade htdig if you can). + See question <a href="#q4.8">4.8</a> above.</p> + + <p>Whether you use this external parser or converter, or acroread + with the <a href="attrs.html#pdf_parser">pdf_parser</a> attribute, + to successfully index PDF files be sure to set the <a + href="attrs.html#max_doc_size">max_doc_size</a> attribute to + a value larger than the size of your largest PDF file. PDF + documents can not be parsed if they are truncated.</p> + + <p>This also raises the questions of why two different + methods of indexing PDFs are supported, and which method + is preferred. The built-in PDF support, which uses acroread + to convert the PDF to PostScript, was the first method which + was provided. It had a few problems with it: acroread is not + open source, it is not supported on all systems on which + ht://Dig can run, and for some PDFs, the PostScript that + acroread generated was very difficult to parse into indexable + text. Also, the built-in PDF support expected PDF documents to + use the same character encoding as is defined in your current + <a href="attrs.html#locale">locale</a>, which isn't always the + case. The external converters, which use pdftotext, were developed + to overcome these problems. xpdf 0.90 is free software, and its + pdftotext utility works very well as an indexing tool. + It also converts various PDF encodings to the Latin 1 set. + It is the opinion of the developers that this is the + preferred method. However, some users still prefer to stick + with acroread, as it works well for them, and is a little + easier to set up if you've already installed Acrobat.</p> + + <p>Also, pdftotext still has some difficulty handling text in + landscape orientation, even with its new -raw option in 0.90, + so if you need to index such text in PDFs, you may still get + better results with acroread. The pdf_parser attribute has been + removed from the 3.2 beta releases of htdig, so to use acroread + with htdig 3.2.0b5 or other 3.2 betas, use the acroconv.pl + external converter script from our <a + href="http://www.htdig.org/files/contrib/parsers/">web site</a>.</p> + + <p>See also question <a href="#q5.2">5.2</a> below and + question <a href="#q1.13">1.13</a> above. + See questions <a href="#q5.37">5.37</a> and <a href="#q5.39">5.39</a> + for troubleshooting tips.</p> + + <strong>4.10. <a name="q4.10">How do I index documents in other + languages?</a></strong><br> + <p>The first and most important thing you must do, + to allow ht://Dig to properly support international + characters, is to define the correct locale for the + language and country you wish to support. This is done + by setting the <a href="attrs.html#locale">locale</a> + attribute (see question <a href="#q5.8">5.8</a>). The + next step is to configure ht://Dig to use dictionary and + affix files for the language of your choice. These can + be the same dictionary and affix files as are used by the + ispell software. A collection of these is available from + Geoff Kuenning's + <a href="http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html"> + International Ispell Dictionaries page</a>, and we're slowly + building a collection of word lists on our <a + href="http://www.htdig.org/files/contrib/wordlists/">web site</a>.</p> + <p>For example, if you install German dictionaries in common/german, + you could use these lines in your configuration file:</p> +<pre> +<a href="attrs.html#locale">locale</a>: de_DE +lang_dir: ${<a href="attrs.html#common_dir">common_dir</a>}/german +<a href="attrs.html#bad_word_list">bad_word_list</a>: ${lang_dir}/bad_words +<a href="attrs.html#endings_affix_file">endings_affix_file</a>: ${lang_dir}/german.aff +<a href="attrs.html#endings_dictionary">endings_dictionary</a>: ${lang_dir}/german.0 +<a href="attrs.html#endings_root2word_db">endings_root2word_db</a>: ${lang_dir}/root2word.db +<a href="attrs.html#endings_word2root_db">endings_word2root_db</a>: ${lang_dir}/word2root.db +</pre> + <p> + You can build the endings database with <code>htfuzzy endings</code>. + (This command may actually take days to complete, for + releases older than 3.1.2. Current releases use faster regular + expression matching, which will speed this up by a few orders + of magnitude.) Note that the "*.0" files are not part of + the ispell dictionary distributions, but are easily made by + concatenating the partial dictionaries and sorting to remove + duplicates (e.g.: "<code>cat * | sort | uniq > lang.0</code>" + in most cases). You will also need to redefine the synonyms + file if you wish to use the synonyms search algorithm. This + file is not included with most of the dictionaries, nor is the + <a href="attrs.html#bad_words">bad_words</a> file.</p> + + <p>If you put all the language-specific + dictionaries and configuration files in separate directories, + and set all the attribute definitions accordingly in each + search config file to access the appropriate files, you can + have a multilingual setup where the user selects the language + by selecting the "config" input parameter value. In addition + to the attributes given in the example above, you may also + want custom settings for these language-specific attributes: + <a href="attrs.html#date_format">date_format</a>, + <a href="attrs.html#iso_8601">iso_8601</a>, + <a href="attrs.html#method_names">method_names</a>, + <a href="attrs.html#no_excerpt_text">no_excerpt_text</a>, + <a href="attrs.html#no_next_page_text">no_next_page_text</a>, + <a href="attrs.html#no_prev_page_text">no_prev_page_text</a>, + <a href="attrs.html#nothing_found_file">nothing_found_file</a>, + <a href="attrs.html#page_list_header">page_list_header</a>, + <a href="attrs.html#prev_page_text">prev_page_text</a>, + <a href="attrs.html#search_results_wrapper">search_results_wrapper</a> + (or <a href="attrs.html#search_results_header">search_results_header</a> + and <a href="attrs.html#search_results_footer">search_results_footer</a>), + <a href="attrs.html#sort_names">sort_names</a>, + <a href="attrs.html#synonym_db">synonym_db</a>, + <a href="attrs.html#synonym_dictionary">synonym_dictionary</a>, + <a href="attrs.html#syntax_error_file">syntax_error_file</a>, + <a href="attrs.html#template_map">template_map</a>, and of course + <a href="attrs.html#database_dir">database_dir</a> or + <a href="attrs.html#database_base">database_base</a> if you + maintain multiple databases for sites of different languages. + You could also change the definition of + <a href="attrs.html#common_dir">common_dir</a>, rather than + making up a lang_dir attribute as above, as many language-specific + files are defined relative to the common_dir setting.</p> + + <p>If you're running version 3.1.6 of ht://Dig, you may also + be interested in the <strong>accents</strong> fuzzy match + algorithm in the + <a href="attrs.html#search_algorithm">search_algorithm</a> + attribute, which lets you treat accented and unaccented letters + as equivalent in words. Note that if you use the accents algorithm, + you need to rebuild the accents database each time you update your + word database, using <code>"htfuzzy accents"</code>. This command + isn't in the default rundig script, so you may want to add it there. + The accents fuzzy match algorithm is also in the 3.2 beta releases. + There are also the + <a href="attrs.html#boolean_keywords">boolean_keywords</a> and + <a href="attrs.html#boolean_syntax_errors">boolean_syntax_errors</a> + attributes in 3.1.6 for changing other language-specific messages + in htsearch.</p> + + <p>Current versions of ht://Dig only support 8-bit + characters, so languages such as Chinese and Japanese, which + require 16-bit characters, are not currently supported.</p> + + <p>Didier Lebrun has written a guide for configuring htdig to + support French, entitled + <a href="http://www.quartier-rural.org/dl/elucu/htdig-vf/lisezmoi.html"> + Comment installer et configurer HtDig pour la langue française</a>. + His "kit de francisation" is also available on + <a + href="http://www.htdig.org/files/contrib/wordlists/">our + web site</a>.</p> + + <p>See also question <a href="#q4.2">4.2</a> for tips on customizing + htsearch, and question <a href="#q4.6">4.6</a> for tips where to find + bad_words files.</a></p> + + <strong>4.11. <a name="q4.11">How do I get rotating banner ads in + search results?</a></strong><br> + <p>While htsearch doesn't currently provide a means of doing + SSI on its output, or calling other CGI scripts, it does have + the capability of using environment variables in templates.</p> + + <p>The easiest way to get rotating banners in htsearch is + to replace htsearch with a wrapper script that sets an + environment variable to the banner content, or whatever + dynamically generated content you want. Your script can then + call the real htsearch to do the work. The wrapper script can be + written as a shell script, or in Perl, C, C++, or whatever you + like. You'd then need to reference that environment variable + in header.html (or wrapper.html if that's what you're using), + to indicate where the dynamic content should be placed.</p> + + <p>If the dynamic content is generated by a CGI script, your new + wrapper script which calls this CGI would then have to strip out + the parts that you don't want embedded in the output (headers, + some tags) so that only the relevant content gets put into the + environment variable you want. You'd also have to make sure + this CGI script doesn't grab the POST data or get confused by + the QUERY_STRING contents intended for htsearch. Your script + should not take anything out of, or add anything to, the + QUERY_STRING environment variable.</p> + + <p>An alternative approach is to have a cron job that periodically + regenerates a different header.html or wrapper.html with the + new banner ad, or changes a link to a different pre-generated + header.html or wrapper.html file. For other alternatives, see + question <a href="#q4.7">4.7</a>.</p> + + <strong>4.12. <a name="q4.12">How do I index numbers in documents?</a></strong><br> + <p>By default, htdig doesn't treat numbers without letters + as words, so it doesn't index them. + To change this behavior, you must set the + <a href="attrs.html#allow_numbers">allow_numbers</a> + attribute to true, and rebuild your index from scratch using + rundig or htdig with the -i option, so that bare numbers get + added to the index.</p> + + <strong>4.13. <a name="q4.13">How can I call htsearch from a hypertext + link, rather than from a search form?</a></strong><br> + <p>If you change the search.html form to use the GET method + rather than POST, you can see the URLs complete with all the + arguments that htsearch needs for a query. Here is an example:<br> +<code> +http://www.grommetsRus.com/cgi-bin/htsearch?config=htdig&restrict=&exclude=&method=and&format=builtin-long&words=grapple+grommets +</code> + which can actually be simplified to:<br> +<code> +http://www.grommetsRus.com/cgi-bin/htsearch?method=and&words=grapple+grommets +</code> + with the current defaults. The "&" character acts as a + separator for the input parameters, while the "+" character + acts as a space character within an input parameter. + In versions 3.1.5 or 3.2.0b2, or later, you can use a semicolon + character ";" as a parameter separator, rather than "&", for + HTML 4.0 compliance. + Most non-alphanumeric characters should be hex-encoded following + the convention for URL encoding (e.g. "%" becomes "%25", "+" + becomes "%2B", etc). Any htsearch input parameter that you'd + use in a search form can be added to the URL in this way. + This can be embedded into an <a href="..."> tag. + <br>See also question <a href="#q5.21">5.21</a>.</p> + + <strong>4.14. <a name="q4.14">How do I restrict a search to only meta + keywords entries in documents?</a></strong><br> + <p>First of all, you do <strong>not</strong> do this by using the + "keywords" field in the search form. This seems to be a + frequent cause of confusion. The "keywords" input parameter + to htsearch has absolutely nothing to do with searching meta + keywords fields. It actually predates the addition of meta + keyword support in 3.1.x. A better choice of name for the + parameter would have been "requiredwords", because that's what + it really means - a list of words that are all required to be + found somewhere in the document, in addition to the words the + user specifies in the search form.</p> + + <p>As of 3.2.0b5, the most direct way to search for a particular + meta keyword is to specify the word as "keyword:<word>". + Similarly, "title:", "heading:", and "author:" restrict searches + to the respective fields. To search for words in the body of the + text, use "text:".</p> + + <p>To restrict all search terms to meta keywords only, you can set all + <a href="attrs.html#heading_factor">factors</a> other than + keywords_factor to 0, and for 3.1.x, you + must then reindex your documents. In the 3.2 betas, you can + change factors at search time without needing to reindex. + As of 3.2.0b5, it is possible to restrict + the search in the query itself. Note that changing the scoring + factors in this way will only alter the scoring of search results, + and shift the low or zero scores to the end of the results when + sorting by score (as is done by default). For versions before + 3.2.0b5, the results with scores + of zero aren't actually removed from the search results.</p> + + <strong>4.15. <a name="q4.15">Can I use meta tags to prevent htdig from + indexing certain files?</a></strong><br> + <p>Yes, in each HTML file you want to exclude, add the following + between the <HEAD> and </HEAD> tags:</p> + <blockquote> + <META NAME="robots" CONTENT="noindex, follow"> + </blockquote> + <p>Doing so will allow htdig to still follow links to other documents, + but will prevent this document from being put into the index itself. + You can also use "nofollow" to prevent following of links. See + the section on <a href="meta.html">Recognized META information</a> + for more details. For documents produced automatically by MhonArc, + you can have that line inserted automatically by putting it in the + MhonArc resource file, in the sections IDXPGBEGIN and TIDXPGBEGIN.</p> + + <p>You can also use the + <a href="attrs.html#noindex_start">noindex_start</a> and + <a href="attrs.html#noindex_end">noindex_end</a> attributes to + define one set of tags which will mark sections to be stripped out + of documents, so they don't get indexed, or you can mark sections + with the non-DTD <noindex> and </noindex> tags. + The noindex_start and noindex_end attributes can also be used to + suppress in-line JavaScript code that wasn't properly enclosed in + HTML comment tags (see question <a href="#q4.26">4.26</a>). + In 3.1.6, you can also put a section between <noindex follow> + and </noindex> tags to turn off indexing of text but still + allow htdig to follow links.</p> + + <p>If you require much more elaborate schemes for avoiding indexing + certain parts of your HTML files, especially if you don't have + control over these files and can't add tags to them, you can + set up htdig's + <a href="attrs.html#external_parsers">external_parsers</a> attribute + with an external converter that will preprocess the HTML before + it's parsed and indexed by htdig. Examples of this are the + unhypermail.sh script in our + <a href="http://www.htdig.org/files/contrib/parsers/">contributed parsers</a> + and the ungeoify.sh script in our + <a href="http://www.htdig.org/files/contrib/scripts/">contributed scripts</a>. + By preprocessing the HTML, you can strip out parts you don't want, or + you can add or change tags wherever they're needed, if you're willing + to put in the effort to learn awk/sed/perl enough to do the job.</p> + + <strong>4.16. <a name="q4.16">How do I get htsearch to use the star image + in a different directory than the default /htdig?</a></strong><br> + <p>You must set either the + <a href="attrs.html#image_url_prefix">image_url_prefix</a> attribute, + or both <a href="attrs.html#star_blank">star_blank</a> and + <a href="attrs.html#star_image">star_image</a> in your + htdig.conf, to refer to the URL path for these files. You should + also set this URL path similarly in in common/header.html and + common/wrapper.html, as they also refer to the star.gif file. + If you want to relocate other graphics, such as the buttons or + the ht://Dig logo, you should change all references to these + in htdig.conf and common/*.html.</p> + + <strong>4.17. <a name="q4.17">How do I get htdig or htsearch to rewrite + URLs in the search results?</a></strong><br> + <p>This can be done by using the <a + href="attrs.html#url_part_aliases">url_part_aliases</a> + configuration file attribute. You have to set up different + configuration files for htdig and htsearch, to define a + different setting of this attribute for each one.</p> + + <p>A large number of users insist on ignoring that last point + and try to make do with just one definition, either for htdig + or htsearch, or sometimes for both. This seems to stem from + a fundamental misunderstanding of how this attribute works, + so perhaps a clarification is needed. The url_part_aliases + attribute uses a two stage process. In the first stage, htdig + encodes the URLs as they go into the database, by using the + pairs in url_part_aliases going from left to right. In the + second stage, htsearch decodes the encoded URLs taken from the + database, by using the pairs in url_part_aliases going from + right to left. If you have the same value for url_part_aliases + in htdig and htsearch, you end up with the same URLs in the + end. If you modify the first string (the from string) in + the pairs listed in url_part_aliases for htsearch, then when + htsearch decodes the URLs it ends up rewriting part of them.</p> + + <p>While you might think that if you don't use url_part_aliases + in htdig, then you can use it in htsearch to alter unencoded + URLs, the reality is that if you don't encode parts of URLs + using url_part_aliases, they still get encoded automatically + by the <a href="attrs.html#common_url_parts">common_url_parts</a> + attribute. This helps to reduce the size of your databases. So, + trying to use url_part_aliases only in htsearch doesn't work + because there are no unencoded URLs in the database, so the + right hand strings in the pairs you define won't match anything.</p> + + <p>You also can't put two different definitions of the + url_part_aliases attribute in a single configuration file, as + some users have attempted. When you define an attribute twice, + the second definition merely overrides the first. Pay close + attention to the description and examples for + <a href="attrs.html#url_part_aliases">url_part_aliases</a>. + You must put one definition of this attribute in your + configuration file for htdig, htmerge (or htpurge) and htnotify, + and a different definition of it in your configuration file + for htsearch.</p> + + <strong>4.18. <a name="q4.18">What are all the options in + htdig.conf, and are there others?</a></strong><br> + <p>In ht://Dig's terminology, the settings in its configuration + files are called <a href="attrs.html">configuration attributes</a>, + to distinguish them from <a href="htdig.html">command line + options</a>, <a href="hts_form.html">CGI input parameters</a> + and <a href="hts_templates.html">template variables</a>. There are + many, many attributes that can be set to control almost all + aspects of indexing, searching, customization of output and + internationalization. All attributes have a built-in default + setting, and only a subset of these appear in the sample htdig.conf + file. See the documentation for all default values for attributes + not overridden in the configuration file, and for help on using + any of them. + See also question <a href="#q1.15">1.15</a>.</p> + + <strong>4.19. <a name="q4.19">How do I get more than 10 pages of + 10 search results from htsearch?</a></strong><br> + <p>There are two attributes that control the number of matches per + page and the total number of pages. The number of matches per page + can be set in your configuration file, using the + <a href="attrs.html#matches_per_page">matches_per_page</a> attribute, + or in your <a href="hts_form.html">search form</a>, using the + <strong>matchesperpage</strong> input parameter.</p> + + <p>The number of pages is controlled by the + <a href="attrs.html#maximum_pages">maximum_pages</a> attribute in + your search configuration file. + The current default for maximum_pages is 10 because the ht://Dig + package comes with 10 images, with numbers 1 through 10, for + use as page list buttons. If we increased the limit, we'd have + to field a whole lot more questions from users irate because + only the first 10 buttons are graphics, and the rest are text. + If you want more than 10 pages of results, change maximum_pages, + but you may also want to set the + <a href="attrs.html#page_number_text">page_number_text</a> and + <a href="attrs.html#no_page_number_text">no_page_number_text</a> + attributes in your search configuration file to nothing, or + remove them, to use text rather than images for the links to + other pages.</p> + + <p>In version of htsearch before 3.1.4, maximum_pages + limited only the number of page list buttons, and not the + actual number of pages. This was changed because there was no + means of limiting the total number of pages, but this ended up + frustrating users who wanted the ability to have more pages than + buttons. In 3.2.0b3 and 3.1.6 we introduced a + <a href="attrs.html#maximum_page_buttons">maximum_page_buttons</a> + attribute for this purpose.</p> + + <strong>4.20. <a name="q4.20">How do I restrict a search to only + certain subdirectories or documents?</a></strong><br> + <p>That depends on whether you want to protect certain parts of + your site from prying eyes, or just limit the scope of search + results to certain relevant areas. For the latter, you just need + to set the <strong>restrict</strong> or <strong>exclude</strong> + input parameter in the <a href="hts_form.html">search form</a>. + This can be done using hidden input fields containing preset + values, text input fields, select lists, radio buttons or + checkboxes, as you see fit. If you use select lists, you can + propagate the choices to select lists in the follow-up search + forms using the + <a href="attrs.html#build_select_lists">build_select_lists</a> + configuration attribute. + The University at Albany has a good description of how to use + the <strong>restrict</strong> or <strong>exclude</strong> input + parameters: <a href="http://www.albany.edu/its/web/search/"> + Constructing a local search using ht://Dig Search forms</a>. + <br>To include a hex encoded character (such as a %20 for a space) + in a restrict or exclude string, the '%' must again be encoded. + For example, to match a filename containing a space, the URL must + contain %20, and so the CGI parameter passed to htsearch must + contain %2520. The %25 encodes the '%'. (Note that this is only + necessary for CGI input parameters, not for the corresponding + configuration attributes in your htdig.conf file, as attributes + aren't subjected to the same hex decoding step as parameters are.) + <br>See also question <a href="#q4.4">4.4</a>.</p> + + <p>If you wish to keep secure and non-secure areas on + your site separate, and avoid having unauthorized users + seeing documents from secure areas in their search results, + that takes a bit more effort. You certainly can't rely on + the <strong>restrict</strong> and <strong>exclude</strong> + parameters, or even the <strong>config</strong> parameter, + as any parameter in a search form can also be overridden + by the user in a URL with CGI parameters. The safest + option would be to host the secure and non-secure areas on + separate servers with independent installations of htsearch, + each with its own ht://Dig database, but that is often too + costly or impractical an option. The next best thing is to + host them on the same site, but make sure that everything + is very clearly separated to prevent any leakage of secure + data. You should maintain separate databases for the secure + and public areas of your site, by setting up different htdig + configuration files for each area. Use different settings + of the <a href="attrs.html#start_url">start_url</a>, + <a href="attrs.html#limit_urls_to">limit_urls_to</a> + and <a href="attrs.html#database_dir">database_dir</a> + configuration attributes, and possibly even different + <a href="attrs.html#common_dir">common_dir</a> settings as well. + Make sure your database_dir, and even your common_dir, are not + in any directories accessible from the web server. Run htdig + and htmerge (or rundig) with each separate configuration file, + to build your two databases.</p> + + <p>The tricky part is to make sure your htsearch program is + secure. You don't want to use the same htsearch for the secure + and public sites, because otherwise the public site could + access the configuration for the secure database, making its + data publicly accessible. You must either compile two separate + versions of htsearch, with different settings of the CONFIG_DIR + <em>make</em> variable, or you must make a simple wrapper + script for htsearch that overrides the compiled-in CONFIG_DIR + setting by a different setting of the CONFIG_DIR environment + variable. Make sure the CONFIG_DIR for the secure area is + not a subdirectory of the CONFIG_DIR for the public area. + In this way, you can maintain separate directories of config + files for the public and secure sites, so that the secure + config files are not accessible from the public htsearch.</p> + + <p>Put the htsearch binary or wrapper script for the secure site + in a different ScriptAlias'ed cgi-bin directory than the public + one, and protect the secure cgi-bin with a .htaccess file or + in your server configuration. Alternatively, you can put the + secure program, let's call it htssearch, in the same cgi-bin, + but protect that one CGI program in your server configuration, + e.g.:</p> +<pre> +<Location /cgi-bin/htssearch> +AuthType Basic +AuthName .... +AuthUserFile ... +AuthGroupFile ... +<Limit GET POST> +require group foo +</Limit> +</Location> +</pre> + <p>This describes the setup for an Apache server. You'd need to + work out an equivalent configuration for your server if you're + not running Apache.</p> + + <strong>4.21. <a name="q4.21">How can I allow people to search + while the index is updating?</a></strong><br> + <p>Answer contributed by Avi Rappoport <avirr@searchtools.com></p> + <p>If you have enough disk space for two copies of the index + database, use -a with the htdig and htmerge processes. This will + make use of a copy of the index database with the extension + ".work", and update the copy instead of the originals. + This way, htsearch can use those originals while the update is + going on. When it's done, you can move the .work versions to + replace the originals, and htsearch will use them. The current + rundig script will do this for you if you supply the -a flag + to it. However, rundig builds the database from scratch each + time you run it. If you want to update an alternate copy of + the database, see the + <a href="http://www.htdig.org/files/contrib/scripts/rundig.sh">contributed + rundig.sh script</a>.</p> + + <strong>4.22. <a name="q4.22">How can I get htdig to ignore the + robots.txt file or meta robots tags?</a></strong><br> + <p>You can't, and you shouldn't. The + <a href="http://www.robotstxt.org/wc/norobots.html"> + Standard for Robot Exclusion</a> exists for a very good reason, + and any well behaved indexing engine or spider should conform to it. + If you have a problem with a robots.txt file, you really should + take it up with the site's webmaster. If they don't have a problem + with you indexing their site, they shouldn't mind setting up a + User-agent entry in their robots.txt file with a name you both + agree on. The user agent setting that htdig uses for matching + entries in robots.txt can be changed via the + <a href="attrs.html#robotstxt_name">robotstxt_name</a> attribute in + your config file.</p> + + <p>If you have a problem with a robots meta tag in a document + (see question <a href="#q4.15">4.15</a>) you should take it up + with the author or maintainer of that page. These tags are an + all or nothing deal, as they can't be set up to allow some engines + and disallow others. If htdig encounters them, it has to give the + page's creator the benefit of the doubt and honour them. If + exceptions to the rule are wanted, this should be done with a + robots.txt file rather than a meta tag.</p> + + <strong>4.23. <a name="q4.23">How can I get htdig not to index + some directories, but still follow links?</a></strong><br> + <p>You can simply add the directory name to your robots.txt file + or to the <a href="attrs.html#exclude_urls">exclude_urls</a> + attribute in your configuration, but that will exclude all files + under that directory. If you want the files in that directory to + be indexed, you have a couple options. You can add an index.html + file to the directory, that will include a robots meta tag + (see question <a href="#q4.15">4.15</a>) to prevent indexing, + and will contain links to all your files in this directory. + The drawback of this is that you must maintain the index.html + file yourself, as it won't be automatically updated as new + files are added to the directory.</p> + + <p>The other technique you can use, if you want the directory + index to be made by the web server, is to get the server to + insert the robots meta tag into the index page it generates. + In Apache, this is done using the + <a href="http://httpd.apache.org/docs/mod/mod_autoindex.html#headername">HeaderName</a> + and <a href="http://httpd.apache.org/docs/mod/mod_autoindex.html#indexoptions">IndexOptions</a> + directives in the directory's <strong>.htaccess</strong> file. + For example:</p> +<pre> HeaderName .htrobots + IndexOptions FancyIndexing SuppressHTMLPreamble +</pre> + <p>and in the .htrobots file:</p> +<pre><HTML><head> +<META NAME="robots" CONTENT="noindex, follow"> +<title>Index of /this/dir</title> +</head> +</pre> + + <p>If you don't mind getting just one copy of each directory, + but want to suppress the multiple copies generated by Apache's + FancyIndexing option, you can either turn off FancyIndexing or + you can add "?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" to + the <a href="attrs.html#bad_querystr">bad_querystr</a> attribute + (without the quotes) to suppress the alternately sorted views of + the directory. For Apache 2.x, you'd use "C=D C=M C=N C=S O=A O=D" + instead in your bad_querystr setting.</p> + + <strong>4.24. <a name="q4.24">How can I get rid of duplicates in + search results?</a></strong><br> + <p>This depends on the cause of the duplicate documents. htdig + does keep track of the URLs it visits, so it never puts the + same URL more than once in the database. So, if you have + duplicate documents in your search results, it's because the + same document appears under different URLs. Sometimes the + URLs vary only slightly, and in subtle ways, so you may have + to look hard to find out what the variation is. Here are some + common reasons, each requiring a different solution.</p> + + <ul> + <li>You're indexing a case insensitive web + server (e.g. an NT based server), but the + <a href="attrs.html#case_sensitive">case_sensitive</a> attribute is + still set to true. In this case, if htdig encounters two URLs + pointing to the same document, but the case of the letters in + one is different than the other (even if it's only 1 letter), + it will not treat them as the same URL.<br><br> + <li>You have symbolic links (or hard links) to some of + these documents, so they can be reached by several URLs. + The solution here is to build an exclude list of URLs that + are actually symbolic links, and putting these in + <a href="attrs.html#exclude_urls">exclude_urls</a> + (or in your robots.txt file). You can automate this using a + technique similar to the find command in question + <a href="#q5.25">5.25</a> which builds the start_url list, but + adding a -type l to find symbolic links.<br><br> + <li>You have copies of the same documents in different + locations. This is similar to the symbolic link problem above, + but harder to fix automatically.<br><br> + <li>The duplicate URLs result from CGI, SSI or other dynamic pages + that give the same content even though there may be variations in + the query string or other parts of the URL. The approach to + fix this is similar to the fix above, but may be less easy + to automate, depending on what the variations are. You can + add patterns to exclude_urls or bad_querystr to get rid of + unwanted variations. These are especially important to bring + under control, because in some cases, if left unchecked, they + can result in an <em>infinite virtual hierarchy</em> which htdig + will never be able to finish indexing. For example, in a CGI-based + calendar, htdig could go on following next month or next + year links to infinity, but this can be stopped by adding a + stop year to <a href="attrs.html#bad_querystr">bad_querystr</a>. + <br><br>Another common example happens when htdig hits a link + to an SSI page and the URL has an extra trailing slash. This + can happen with either .shtml pages or .html pages that use + the XBitHack. The trailing slash causes the URL to be misinterpreted + as a directory URL, and any relative URLs in the document are added + to the URL, creating longer and longer URLs that still lead to the + same SSI document. There are two things you can do:<ol> + <li>hunt down the pages with the incorrect links, i.e. + search for ".shtml/" or ".html/" in URLs in your documents, + and fix these links; or + <li>add .shtml/ and .html/ to your + <a href="attrs.html#exclude_urls">exclude_urls</a> + setting to get htdig to ignore these defective links. + </ol>The second option is easier, but you run the risk that htdig + will miss some SSI pages if the only links to them have the trailing + slash, so you may want to try hunting down the links anyway. + <br><br>See also question <a href="#q5.29">5.29</a>.<br><br> + <li>The duplicates result from session IDs in PHP or other dynamic + pages that give the same content even though the ID changes during + the indexing process. This can lead not only to duplicates, but + also to URLs that become unusable because of expired session IDs. + Session IDs are the bane of search engines, and you should avoid + using them if at all possible. If getting rid of them altogether + isn't an option, then you can at least remove them while indexing, + using the <a href="attrs.html#url_rewrite_rules">url_rewrite_rules</a> + attribute. This will only work if htdig can access the documents + without a session ID, as htdig rewrites the URL before fetching the + document, and htsearch presents the rewritten URL (without session + ID) in search results. + </ul> + + <strong>4.25. <a name="q4.25">How can I change the scores in + search results, and what are the defaults?</a></strong><br> + <p>The scores are calculated mostly by htdig at indexing time, + with some tweaking done by htsearch at search time. There are + a number of <a href="attrs.html">configuration attributes</a>, + all called <em><something></em><strong>_factor</strong>, + which can control the scoring calculations. In addition, the + location of words within the document has an effect on score, + as word scores are also multiplied by a varying location + factor somewhere in between 1000 for words near the start + and 1 for words near the end of the document. As of yet, + there is no way to change this factor. For any of the scoring + factors you can configure, and which are used by htdig, you + will have to reindex your documents so the new factors take + effect. The default values for these scoring factors, as well as + information about whether they're used by htdig or htsearch, + are all listed in the <a href="attrs.html">configuration + attributes documentation</a>. Malcolm Austen has written some + <a href="http://wwwsearch.ox.ac.uk/scores.html">notes on page + scores</a> for 3.1.x which you may find helpful.</p> + + <p>Note that the above applies to the 3.1.x releases, while + in the 3.2 beta releases, all scores are calculated at search + time with no weight being put on the location of words within + the document.</p> + + <strong>4.26. <a name="q4.26">How can I get htdig not to index + JavaScript code or CSS?</a></strong><br> + <p>The HTML parser in htdig recognizes and parses only HTML, + which is all there should be within an HTML file. If your HTML + files contain in-line JavaScript code or Cascading Style Sheets + (CSS), these in-line codes, which are clearly not HTML, should + be enclosed within an HTML comment tag so they are hidden + from view from the HTML parser, or for that matter from any + web client that is not JavaScript-aware or CSS-aware. See + <a href="http://www.mcli.dist.maricopa.edu/show/interact/js_b.html"> + Behind the Scenes with JavaScript</a> for a description of the + technique, which applies equally well to in-line style sheets. + If fixing up all non-HTML compliant JavaScript or CSS code in + your HTML files is not an option, then see question + <a href="#q4.15">4.15</a> for an alternate technique.</p> + + <p>The HTML parser in htdig 3.1.6 tries skipping over bare + in-line JavaScript code in HTML, unlike previous versions, + but a small bug in the parser causes it to be thrown off by a + "<" sign in the JavaScript, and it may then miss the closing + </script> tag. This can be fixed by applying this + <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.6/JavaScript.0"> + patch</a>.</p> + + <hr noshade size=2> + + <h3>5. Troubleshooting</h3> + <strong>5.1. <a name="q5.1">I can't seem to index more than X documents + in a directory.</a></strong><br> + <p>This usually has to do with the default document size + limit. If you set <a href="attrs.html#max_doc_size"> + max_doc_size</a> in your config file to + something enough to read in the directory index (try 100000 for + 100K) this should fix this problem. Of course this will require + more memory to read the larger file. Don't set it to a value + larger than the amount of memory you have, and never more than + about 2 billion, the maximum value of a 32-bit integer. + If htdig is missing entire directories, see question + <a href="#q5.25">5.25</a>.</p> + + <strong>5.2. <a name="q5.2">I can't index PDF files.</a></strong><br> + <p>As above, this usually has to do with the default document + size. What happens is ht://Dig will read in part of a PDF file + and try to index it. This usually fails. Try setting + <a href="attrs.html#max_doc_size">max_doc_size</a> + in your config file to a larger value than the + size of your largest PDF file. Don't go overboard, though, as + you don't want to overflow a 32-bit integer (about 2 billion), + and you don't want to allocate much more memory than you need + to store the largest document.</p> + + <p>There is a bug in Adobe Acrobat Reader version 4, in its + handling of the -pairs option, which causes a segmentation + violation when using it with htdig 3.1.2 or earlier. There is + a workaround for this as of version 3.1.3 - you must remove + the -pairs option from your pdf_parser definition, if it's + there. However, acroread version 4 is still very unstable (on + Linux, anyway) so it is not recommended as a PDF parser. An + alternative is to use an external converter with the xpdf 0.90 + package installed on your system, as described in question <a + href="#q4.9">4.9</a> above.</p> + + <strong>5.3. <a name="q5.3">When I run "rundig," I get a message about + "DATABASE_DIR" not being found.</a></strong><br> + <p>This is due to a bug in the Makefile.in file in version + 3.1.0b1. The easiest fix is to edit the rundig file and change + the line "TMPDIR=@DATABASE_DIR@" to set TMPDIR to a directory + with a large amount of temporary disk space for htmerge. This + bug is fixed in version 3.1.0b2.</p> + + <strong>5.4. <a name="q5.4">When I run htmerge, it stops with an "out + of diskspace" message.</a></strong><br> + <p>This means that htmerge has run out of temporary disk space + for sorting. Either in your "rundig" script (if you run htmerge + through that) or before you run htmerge, set the variable TMPDIR + to a temp directory with lots of space.</p> + + <strong>5.5. <a name="q5.5">I have problems running rundig from cron + under Linux.</a></strong><br> + <p>This problem commonly occurs on Red Hat Linux 5.0 and 5.1, + because of a bug in vixie-cron. It causes htmerge to fail with a + "Word sort failed" error. It's fixed in Red Hat 5.2. + You can install vixie-cron-3.0.1-26.{arch}.rpm from a 5.2 + distribution to fix the problem on 5.0 or 5.1. A quick fix for + the problem is to change the first line of rundig to "#!/bin/ash" + which will run the script through the ash shell, but this doesn't + solve the underlying problem.</p> + + <strong>5.6. <a name="q5.6">When I run htmerge, it stops with an + "Unexpected file type" message.</a></strong><br> + <p>Often this is because the databases are corrupt. Try removing + them and rebuilding. If this doesn't work, some have found that + the solution for question <a href="#q3.2">3.2</a> works for this + as well. This should be fixed in versions from 3.1.x</p> + + <strong>5.7. <a name="q5.7">When I run htsearch, I get lots of Internal + Server Errors (#500).</a></strong><br> + <p>If you are running under Solaris, see <a href="#q3.6">3.6</a>. + The solution for Solaris may also work for other OSes that use shared + libraries in non-standard locations, so refer to question 3.6 if + you suspect a shared library problem. In any case, check your web + server error logs to see the cause of the internal server errors. + If it's not a problem with shared libraries, there's a good chance + that the error logs will still contain useful error messages that + will help you figure out what the problem is. + <br>See also questions <a href="#q5.13">5.13</a> and + <a href="#q5.23">5.23</a>.</p> + + <strong>5.8. <a name="q5.8">I'm having problems with indexing words + with accented characters.</a></strong><br> + <p> + Most of the time, this is caused by either not setting or + incorrectly setting the <a + href="attrs.html#locale">locale</a> attribute. The default locale + for most systems is the "portable" locale, which strips + everything down to standard ASCII. Most systems expect + something like <code>locale: en_US</code> or + <code>locale: fr_FR</code>. Locale files are often found in + <code>/usr/share/locale</code> or the <tt>$LANGUAGE</tt> + environment variable. See also question <a href="#q4.10">4.10</a>. + </p> + + <p>Setting the locale correctly seems to be a frequent source of + frustration for ht://Dig users, so here are a few pointers which + some have found useful. First of all, if you don't have any luck + with the settings of the <a href="attrs.html#locale">locale</a> + attribute that you try, make sure you use a locale that is + defined on your system. As mentioned above, these are usually + installed in <code>/usr/share/locale</code>, so look there + for a directory named for the locale you want to use. If + you don't find it, but find something close, try that locale + name. Note that the locale may not have to be specific to the + language you're indexing, as long as it uses the same character + set. E.g. most western European languages use the ISO-8859-1 + Latin 1 character set, so on most systems the locales for + all these languages define the same character types table + and can be used interchangeably. Some systems, however, + define only the accented letters used for a given language, + so "your mileage may vary." The important thing is that the + directory for your locale definition <strong>must</strong> + have a file named <code>LC_CTYPE</code> in it. For example, + on many Linux distributions, a language-specific locale like + <code>fr</code> won't contain this file, but country-specific + locales like <code>fr_FR</code> or <code>fr_CA</code> will. If + you don't find any appropriate locales installed on your system, + try obtaining and installing the locale definition files from + your OS distribution. Also, once you've set your locale, you need + to reindex all your documents in order for the locale to take + effect in the word database. This means rerunning the "rundig" + script, or running "htdig -i" and htmerge (or htpurge in the 3.2 + betas).</p> + + <p>Note also that some UNIX systems and libc5-based Linux + systems just don't have a working implementation of locales, + so you may not be able to get locales working at all on certain + systems. The + <a href="http://www.htdig.org/files/contrib/other/testlocale.c">testlocale.c</a> + program on our web site can let you see the LC_CTYPE tables + for any locale, to aid in finding one that works. Carefully + follow the directions in the program's comments to know how to + use it and what to look for in its output.</p> + + <strong>5.9. <a name="q5.9">When I run htmerge, it stops with a + "Word sort failed" message.</a></strong><br> + <p>There are three common causes of this. First of all, the sort + program may be running out of temporary file space. Fix this + by freeing up some space where sort puts its temporary files, + or change the setting of the TMPDIR environment variable to a + directory on a volume with more space. A second common problem + is on systems with a BSD version of the sort program (such as + FreeBSD or NetBSD). This program uses the -T option as a record + separator rather than an alternate temporary directory. On these + systems, you must remove the TMPDIR environment variable from + rundig, or change the code in htmerge/words.cc not to use the + -T option. A third cause is the cron program on Red Hat Linux + 5.0 or 5.1. (See question <a href="#q5.5">5.5</a> above.)</p> + + <strong>5.10. <a name="q5.10">When htsearch has a lot of matches, it runs + extremely slowly.</a></strong><br> + <p>When you run htsearch with no customization, on a + large database, and it gets a lot of hits, it tends to + take a long time to process those hits. Some users with + large databases have reported much higher performance, + for searches that yield lots of hits, by setting the <a + href="attrs.html#backlink_factor">backlink_factor</a> attribute + in htdig.conf to 0, and sorting by score. The scores calculated + this way aren't quite as good, but htsearch can process hits + much faster when it doesn't need to look up the db.docdb record + for each hit, just to get the backlink count, date or title, + either for scoring or for sorting. This affects versions + 3.1.0b3 and up. In version 3.2, currently under development, + the databases will be structured differently, so it should + perform searches more quickly.</p> + + <p>In version 3.1.6, the date range selection code also slows + down htsearch for the same reason. Unfortunately, a small bug + crept into the code so that even if you don't set any of the + date range input parameters (startyear, endyear, etc.), and + you set backlink_factor and date_factor to 0, htsearch still + looks at the date in the db.docdb record for each hit. You can + avoid this either by setting startyear to 1969 and endyear to + 2038 in your config file, or by applying this + <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.6/timet_enddate.1"> + patch</a>.</p> + + <strong>5.11. <a name="q5.11">When I run htsearch, it gives me a count of + matches, but doesn't list the matching documents.</a></strong><br> + <p>This most commonly happens when you run htsearch while the + database is currently being rebuilt or updated by htdig. + If htdig and htmerge have run to completion, and the problem still + occurs, this is usually an indication of a corrupted database. If + it's finding matches, it's because it found the matching + words in db.words.db. However, it isn't finding the document + records themselves in db.docdb, which would suggest that either + db.docdb, or db.docs.index (which maps document IDs used in + db.words.db to URLs used to look up records in db.docdb), is + incomplete or messed up. You'll likely need to rebuild your + database from scratch if it's corrupted. Older versions of + ht://Dig were susceptible to database corruption of this + sort. Versions 3.1.2 and later are much more stable.</p> + + <p>Another possible cause of this problem is unreadable result + template files. If you define external template files via the + <a href="attrs.html#template_map">template_map</a> attribute, + rather than using the builtin-short or builtin-long templates, + and the file names are incorrect or the files do not have + read permission for the user ID under which htsearch runs, + then htsearch won't be able to display the results. Also, + all directories leading up to these template files must be + searchable (i.e. executable) by htsearch, or it won't be able + to open the files. This is the opposite problem of that described + in question <a href="#q5.36">5.36</a>. If htsearch displays + nothing at all, you may have both problems.</p> + + <strong>5.12. <a name="q5.12">I can't seem to index documents with names + like left_index.html with htdig.</a></strong><br> + <p>There is a bug in the implementation of the <a + href="attrs.html#remove_default_doc">remove_default_doc</a> + attribute in htdig versions 3.1.0, 3.1.1 and 3.1.2, which causes + it to match more than it should. The default value for this + attribute is "index.html", so any URL in which the filename ends + with this string (rather than matches it entirely) will have + the filename stripped off. This is fixed in version 3.1.3.</p> + + <strong>5.13. <a name="q5.13">I get Premature End of Script Headers errors + when running htsearch.</a></strong><br> + <p>This happens when htsearch dies before putting out a + "Content-Type" header. If you are running Apache under Solaris, + or another system that may be using shared libraries in non-standard + locations, + first try the solution described in question <a href="#q3.6">3.6</a>. + If that doesn't work, or you're running on another system, try + running "htsearch -vvv" directly from the command line to see where + and why it's failing. It should prompt you for the search words, + as well as the format. + <br>If it works from the command line, but not from the web + server, it's almost certainly a web server configuration problem. + Check your web server's error log for any information related to + htsearch's failure. One increasingly common problem is Apache + configurations which expect all CGI scripts to be Perl, + rather than binary executables or other scripts, so they use + "perl-handler" rather than "cgi-handler". + <br>See also questions <a href="#q5.7">5.7</a>, + <a href="#q5.14">5.14</a> and <a href="#q5.23">5.23</a>.</p> + + <strong>5.14. <a name="q5.14">I get Segmentation faults when running + htdig, htsearch or htfuzzy.</a></strong><br> + <p>Despite a great deal of debugging of these programs, we haven't + been able to completely eliminate all such problems on all platforms. + If you're running htsearch or htfuzzy on a BSDI system, a common + cause of core dumps is due to a conflict between the GNU regex + code bundled in htdig 3.1.2 and later, and the BSD C or C++ library. + The solution is to use the BSD library's own rx code instead, + using version 3.1.6 or newer as summarized by Joe Jah:</p> + <ul> + <li> ./configure --with-rx + <li> make + </ul> + <p>This solution may work on some other platforms as well (we haven't + heard one way or the other), but will definitely not work on some + platforms. For instance, on libc5-based Linux systems, the bundled + regex code works fine by default, but using libc5's regex code + causes core dumps.</p> + + <p>Users of Cobalt Raq or Qube servers have complained of + segmentation faults in htdig. Apparently this is due to problems + in their C++ libraries, which are fixed in their experimental + compiler and libraries. The following commands should install + the packages you need:</p> + <blockquote> + rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/binutils-2.8.1-3C1.mips.rpm<br> + rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-1.0.2-9.mips.rpm<br> + rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-c++-1.0.2-9.mips.rpm<br> + rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-g77-1.0.2-9.mips.rpm<br> + rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-objc-1.0.2-9.mips.rpm<br> + rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-2.8.0-9.mips.rpm<br> + rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-devel-2.8.0-9.mips.rpm<br> + rpm -Uvh --force ftp://ftp.cobaltnet.com/pub/products/current/RPMS/gcc-2.7.2-C2.mips.rpm + </blockquote> + <p>You may have to remove the libg++ package, if you have it installed + before installing libstdc++, because of conflicts in these packages. + Be sure to do a "make clean" before a "make", to remove any object + files compiled with the old compiler and headers.</p> + + <p>For other causes of segmentation faults, or in other programs, + getting a stack backtrace after the fault can be useful in narrowing + down the problem. E.g.: try "gdb /path/to/htsearch /path/to/core", + then enter the command "bt". You can also try running the program + directly under the debugger, rather than attempting a post-mortem + analysis of the core dump. Options to the program can be given on + gdb's "run" command, and after the program is suspended on fault, + you can use the "bt" command. This may give you enough information + to find and fix the problem yourself, or at least it may help others + on the htdig mailing list to point out what to do next.</p> + + <strong>5.15. <a name="q5.15">Why does htdig 3.1.3 mangle URL parameters + that contain bare "&" characters?</a></strong><br> + <p>This is a known bug in 3.1.3, and is fixed with this + <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.3/HTML.cc.0"> + patch</a>. You can apply the patch by entering into the main + source directory for htdig-3.1.3, and using the command + "patch -p0 < /path/to/HTML.cc.0". This is + also fixed as of version 3.1.4.</p> + + <strong>5.16. <a name="q5.16">When I run htmerge, it stops with an + "Unable to open word list file '.../db.wordlist'" message.</a></strong><br> + <p>The most common cause of this error is that htdig did not + manage to index any documents, and so it did not create a word + list. You should repeat the htdig or rundig command with the + -vvv option to see where and why it is failing. + See question <a href="#q4.1">4.1</a>.</p> + + <strong>5.17. <a name="q5.17">When using Netscape, htsearch always returns the + "No match" page.</a></strong><br> + <p>Check your search form. Chances are there is a hidden input + field with no value defined. For example, one user had<br> + <code><input type=hidden name=restrict></code> + + in his search form, instead of<br> + + <code><input type=hidden name=restrict value=""></code> + + The problem is that Netscape sets the missing value to a default of " " + (two spaces), rather than an empty string. For the restrict parameter, + this is a problem, because htsearch won't likely find any URLs with two + spaces in them. Other input parameters may similarly pose a problem. + </p> + + <p>Another possibility, if you're running 3.2.0b1 or 3.2.0b2, is + that you need to make the db.words.db_weakcmpr file writeable by + the user ID under which the web server runs. This is a bug, and + is fixed in the 3.2.0b5 beta.</p> + + + <strong>5.18. <a name="q5.18">Why doesn't htdig follow links to other + pages in JavaScript code?</a></strong><br> + <p>There probably isn't any indexing tool in existance + that follows JavaScript links, because they don't know how + to initiate JavaScript events. Realistically, it would take a + full JavaScript parser in order to be able to figure out all the + possible URLs that the code could generate, something that's way + beyond the means of any search engine. You have a few options:</p> + <ul> + <li>Add "backup" links using plain HTML <a href=...> tags to + all the pages that could be accessed through JavaScript, + <li>Add <link> tags to point to all these pages (see + <a href="http://www.w3.org/TR/html4/struct/links.html#h-12.3.3">Links + and search engines</a> in W3C's HTML 4.0 Specification - requires + htdig 3.1.3 or greater, but then <em>everyone</em> should be running + 3.1.6 or greater anyway), + <li>Compose a list of all the unreachable documents, or write + a program to do so, and feed that list as part of htdig's + <a href="attrs.html#start_url">start_url</a> attribute. + See also question <a href="#q5.25">5.25</a>. + </ul> + + <strong>5.19. <a name="q5.19">When I run htsearch from the web server, + it returns a bunch of binary data.</a></strong><br> + <p>Your server is returning the contents of the htsearch binary. + Common causes of this are:</p> + <ul> + <li>no execute permission on the htsearch binary, + <li>the binary won't run on this system (it may be compiled + for the wrong system type), or + <li>the web server doesn't recognize the file as a CGI + (for Apache, you must have a ScriptAlias directive for the + program or the directory in which it's installed, or define + a cgi-script handler for some suffix, e.g. .cgi, and add that + suffix to the program file name). + </ul> + <p>By default, Apache is usually configured with one cgi-bin + directory as ScriptAlias, so all your CGI programs must go in + there, or have a .cgi suffix on them. Your configuration may + differ, however.</p> + + <strong>5.20. <a name="q5.20">Why are the betas of 3.2 so + slow at indexing?</a></strong><br> + <p> + As the release notes for these versions suggest, they are + somewhat unoptimized and are made available for testing + Since the 3.2 code indexes all locations of words to support + phrase searching and other advanced methods, this additional + data slows down the indexer. To compensate, the code has a + cache configured by the + <a href="dev/htdig-3.2/attrs.html#wordlist_cache_size">wordlist_cache_size</a> + attribute. + As of this writing, the word database code will slow down + considerably when the cache fills up. Setting the cache as + large as possible provides considerable performance + improvement. Development is in progress to improve cache + performance. + For 3.2.0b6 and higher, see also the + <a href="dev/htdig-3.2/attrs.html#store_phrases">store_phrases</a> attribute, + which can turn off support for phrase searches, improving the speed. + </p> + + <strong>5.21. <a name="q5.21">Why does htsearch use ";" instead of + "&" to separate URL parameters for the page buttons?</a></strong><br> + <p>In versions 3.1.5 and 3.2.0b2, and later, htsearch was + changed to use a semicolon character ";" as a parameter + separator for page button URLs, rather than "&", for HTML + 4.0 compliance. It now allows both the "&" and the ";" as + separators for input parameters, because the CGI specification + still uses the "&". This change may cause some PHP or CGI + wrapper scripts to stop working, but these scripts should be + similarly changed to recognize both separator characters. + For the definitive reference on this issue, please refer to + section B.2.2 of W3C's HTML 4.0 Specification, + <a href="http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2"> + Ampersands in URI attribute values</a>. We're all a little + tired of arguing about it. If you don't like the standard, you + can change the Display::createURL() code yourself to ignore it. + <br>See also question <a href="#q4.13">4.13</a>.</p> + + <p>If you want to try working within the new standard, you may + find it helpful to know that recent versions of CGI.pm will + allow either the ampersand or semicolon as a parameter separator, + which should fix any Perl scripts that use this library. + In PHP, you can simply set the following in your php.ini file + to allow either separator:</p> +<pre>arg_separator.input = ";&" +</pre> + + <strong>5.22. <a name="q5.22">Why does htsearch show the + "&" character as "&amp;" in search results?</a></strong><br> + <p>In version 3.1.5, htsearch was fixed to properly + re-encode the characters &, <, >, and " + into SGML entities. However, the default value for the + <a href="attrs.html#translate_amp">translate_amp</a>, + <a href="attrs.html#translate_lt_gt">translate_lt_gt</a> + and <a href="attrs.html#translate_quot">translate_quot</a> + attributes is still false, so these entities don't get converted + by htdig. If you set these three attributes to true in your + htdig.conf and reindex, the problem will go away.</p> + + <p>In the 3.2 betas there was a bug in the HTML parser that + caused it to fail when attempting to translate the "&amp;" + entity. This has been fixed in 3.2.0b3. The translate_* attributes + are gone as of 3.2.0b2.</p> + + <strong>5.23. <a name="q5.23">I get Internal Server or Unrecognized + character errors when running htsearch.</a></strong><br> + <p>An increasingly common problem is Apache configurations + which expect all CGI scripts to be Perl, rather than binary + executables or other scripts, so they use "perl-handler" + rather than "cgi-handler". The fix is to create a separate + directory for non-Perl CGI scripts, and define it as such in + your httpd.conf file. You should define it the same way as your + existing cgi-bin directory, but use "cgi-handler" instead of + "perl-handler". In any case, you should check your web server's + error log for any information related to htsearch's failure. + <br>See also questions <a href="#q5.7">5.7</a>, + <a href="#q5.14">5.14</a> and <a href="#q5.13">5.13</a>.</p> + + <strong>5.24. <a name="q5.24">I took some settings out of + my htdig.conf but they're still set.</a></strong><br> + <p>All configuration file attributes have compiled-in, default + values. Taking an attribute out of the file is not the same + thing as setting it to an empty string, a 0, or a value of + false. See question <a href="#q4.18">4.18</a>.</p> + + <strong>5.25. <a name="q5.25">When I run htdig on my site, + it misses entire directories.</a></strong><br> + <p>First of all, htdig doesn't look at directories itself. It + is a spider, and it follows hypertext links in HTML documents. + If htdig seems to be missing some documents or entire directory + sub-trees of your site, it is most likely because there are + no HTML links to these documents or directories. (See also + question <a href="#q5.18">5.18</a>.) If htdig does + not come across at least one hypertext link to a document + or directory, and it's not explicitly listed in the + <a href="attrs.html#start_url">start_url</a> attribute, then + this document or directory is essentially hidden from view + to htdig, or to any web browser or spider for that matter. + You can only get htdig to index directories, without providing + your own files with links to the contents of these directories, + by using your web server's automatic index generation feature. + In Apache, this is done with the mod_autoindex module, which + is usually compiled-in by default, and is enabled with the + "Indexes" option for a given directory hierarchy. For example, + you can put these directives in your Apache configuration:</p> +<pre> +<Directory "/path/to/your/document/root"> + Options Indexes FollowSymLinks Includes ExecCGI +</Directory> +</pre> + <p>This will cause Apache to automatically generate an index + for any directory that does not have an index.html or other + "DirectoryIndex" file in it. Other web servers will have + similar features, which you should look for in your server + documentation.</p> + + <p>As an alternative to relying on the web server's autoindex + feature, you can compose a list of all the unreachable + documents, or write a program to do so, and feed that list as + part of htdig's <a href="attrs.html#start_url">start_url</a> + attribute. Here is an example of simple shell script to make + a file of URLs you can use with a configuration entry like + <code>start_url: `/path/to/your/file`</code>:</p> +<pre> +find /path/to/your/document/root -type f -name \*.html -print | \ + sed -e 's|/path/to/your/document/root/|http://www.yourdomain.com/|' > \ + /path/to/your/file +</pre> + <p>Other reasons why htdig might be missing portions of your + site might be that they fall out of the bounds specified + by the <a href="attrs.html#limit_urls_to">limit_urls_to</a> + attribute (which takes on the value of start_url by default), + they are explicitly excluded using the + <a href="attrs.html#exclude_urls">exclude_urls</a> attribute, + or they are disallowed by a robots.txt file (see the + <a href="htdig.html">htdig</a> documentation for notes about + robot exclusion) or by a robots meta tag (see question + <a href="#q4.15">4.15</a>). If htdig seems to be missing the + last part of a large directory or document, see question + <a href="#q5.1">5.1</a>. For reasons why htdig may be rejecting + some links to parts of your site, see question + <a href="#q5.27">5.27</a>.</p> + + <strong>5.26. <a name="q5.26">What do all the numbers and symbols + in the htdig -v output mean?</a></strong><br> + <p>Output from htdig -v typically looks like this:</p> +<pre> +23000:35506:2:http://xxx.yyy.zz/index.html: ***-+****--++***+ size = 4056 +</pre> + <p>The first number is the number of documents parsed so far, + the second is the DocID for this document, and the third is + the hop count of the document (number of hops from one of the + start_url documents). After the URL, it shows a "*" for a link + in the document that it already visited (or at least queued + for retrieval), a "+" for a new link it just queued, and a + "-" for a link it rejected for any of a number of reasons. + To find out what those reasons are, you need to run htdig + with at least 3 "v" options, i.e. -vvv. If there are no "*", + "+" or "-" symbols after the URL, it doesn't mean the document + was not parsed or was empty, but only that no links to other + documents were found within it.</p> + + <strong>5.27. <a name="q5.27">Why is htdig rejecting some of the + links in my documents?</a></strong><br> + <p>When htdig parses documents and finds hypertext links to + other documents (hrefs), it may reject them for any of several + reasons. To find out what those reasons are, you need to run + htdig with at least 3 "v" options, i.e. -vvv. Here are the + meanings of some of the messages you might see at this verbosity + level.</p> + <dl> + <dt>Not an http or relative link!</dt> + <dd>In versions 3.1.5 and earlier, only "http://" URLs, or + URLs relative to those, are allowed.</dd> + <dt>Item in the exclude list: item # <em>n</em></dt> + <dd>A substring of the URL matches one of the items in the + <a href="attrs.html#exclude_urls">exclude_urls</a> + attribute. The given item number will indicate which + pattern matched, starting at 1. The 3.2.0 betas do not + give the item number.</dd> + <dt>Extension is invalid!</dt> + <dd>The file name extension or suffix matches one of those + listed in the + <a href="attrs.html#bad_extensions">bad_extensions</a> + attribute.</dd> + <dt>Extension is not valid!</dt> + <dd>The file name extension or suffix does not match one of those + listed in the + <a href="attrs.html#valid_extensions">valid_extensions</a> + attribute, if any are specified.</dd> + <dt>Invalid Querystring! <em>or</em><br>item in bad query list</dt> + <dd>The URL contains a query string which matches one of those + listed in the + <a href="attrs.html#bad_querystr">bad_querystr</a> + attribute.</dd> + <dt>URL not in the limits!</dt> + <dd>No substring of the URL entirely matches one of the items in the + <a href="attrs.html#limit_urls_to">limit_urls_to</a> + attribute. The purpose of this attribute is to keep htdig + from attempting to index the entire World Wide Web.</dd> + <dt>forbidden by server robots.txt!</dt> + <dd>A substring of the URL matches one of the items disallowed + in the servers robots.txt file. See + <a href="http://www.robotstxt.org/wc/norobots.html"> + A Standard for Robot Exclusion</a>. This message exists + only in the 3.2.0 betas. In 3.1.5 and earlier, this condition + is only caught later, resulting in the message + "robots.txt: discarding '<em>URL</em>'" from htdig, and a + later "Deleted: no excerpt" message from htmerge.</dd> + <dt>url rejected: (level 2)</dt> + <dd>No substring of the URL entirely matches one of the items in the + <a href="attrs.html#limit_normalized">limit_normalized</a> + attribute. All the other rejections above will be indicated + as level 1. The 3.2.0 betas give the much more meaningful + message 'not in "limit_normalized" list!'</dd> + </dl> + + <p>Another possibility, if none of the error messages above appear + for some of the links you think htdig should be accepting, is that + htdig isn't even finding the links at all. First, make sure you're + not making false assumptions about how htdig finds these. It only + reads links in HTML code, and not JavaScript, and it doesn't read + directories unless the HTTP server is feeding it directory listings. + You will need to take a close look at the htdig -vvv (or -vvvv) + output to see what htdig is finding, in and around the areas where + the desired links are supposed to be found in your HTML code, to see + if it's actually finding them. + See also question <a href="#q5.25">5.25</a>.</p> + + <strong>5.28. <a name="q5.28">When I run htdig or htmerge, I get a + "DB2 problem...: missing or empty key value specified" message.</a></strong><br> + <p>The most common cause of this error is that htdig or + htmerge rejected any documents that had been put in the + database, leaving an empty database. You need to find out the + reasons for the rejection of these documents. See questions + <a href="#q4.1">4.1</a>, <a href="#q5.25">5.25</a> and + <a href="#q5.27">5.27</a>.</p> + + <strong>5.29. <a name="q5.29">When I run htdig on my site, + it seems to go on and on without ending.</a></strong><br> + <p>There are some things that can cause htdig to run on without + ending, especially when indexing dynamic content (ASP, PHP, + SSI or CGI pages). This usually involves htdig getting caught + in an <em>infinite virtual hierarchy</em>. A sure sign of + this is if the current size of your database is much larger + than the total size of the site you are indexing, or if in the + verbose output of htdig (see question <a href="#q4.1">4.1</a>) + you see the same URLs come up again and again with only subtle + variations. In any case, you must figure out the reason htdig + keeps revisiting the same documents using different URLs, as + explained in question <a href="#q4.24">4.24</a>, and set your + <a href="attrs.html#exclude_urls">exclude_urls</a> and + <a href="attrs.html#bad_querystr">bad_querystr</a> attributes + appropriately to stop htdig from going down those paths. + </p> + + <strong>5.30. <a name="q5.30">Why does htsearch no longer recognize + the -c option when run from the web server?</a></strong><br> + <p>This was a security hole in 3.1.5 and older, and 3.2.0b3 and + older releases of ht://Dig. (See question <a href="#q2.1">2.1</a>.) + There's a compile-time macro you can set in htsearch.cc to disable + this security fix, but that's a bad idea because it reopens the hole. + This should only be done as a last recourse, when all other avenues + fail. The -c option was only intended for testing htsearch from the + command line, and not for use when calling htsearch on the web server. + Unfortunately, far too many users have needlessly latched onto this + option for CGI scripts. The preferred ways of specifying the config + file are as follows, in order of preference:</p> + <ol> + <li>use the "config" input parameter in your + <a href="hts_form.html">search form</a> + (see question <a href="#q4.2">4.2</a>). + <li>if you need to get at files outside the default CONFIG_DIR, use a + wrapper script that redefines the CONFIG_DIR environment variable, + then use the config input parameter as above + (see question <a href="#q4.20">4.20</a>). + <li>use a wrapper script to force htsearch to use a specific config + file using the -c option. This is especially for cases where you + want to prevent the user from selecting other config files in your + CONFIG_DIR using the config input parameter. This should + be done by using the GET method to call the wrapper script, and in + this script you must unset the REQUEST_METHOD enviroment variable + and pass "$QUERY_STRING" as a single argument to htsearch. + (This safely gets around htsearch's test which disables -c.) + <li>configure and compile different htsearch binaries with different + compile-time definitions of CONFIG_DIR, so you can avoid wrapper + scripts altogether. + <li>define ALLOW_INSECURE_CGI_CONFIG in htsearch.cc and recompile + htsearch if all other approaches above fail for you. + </ol> + + <strong>5.31. <a name="q5.31">I've set a config attribute exactly + as documented but it seems to have no effect.</a></strong><br> + <p>There are a few fairly common reasons why this might happen:</p> + <ol> + <li>You may have a typo. Spelling matters, so make sure the attribute + name is spelled exactly as it is in the + <a href="attrs.html">documentation</a>. Misspelled attribute + definitions are silently ignored. This is because you're allowed + to make up your own attribute definitions for use by other attribute + definitions, as <strong>${myownattribute}</strong>. Also remember + to put the colon ("<strong>:</strong>") separator between the + attribute name and value in your definition. + <li>The attribute isn't supported in your version of the software. + The <a href="attrs.html">documented configuration attributes</a> + on the www.htdig.org web site are for the most recent + <strong>stable</strong> release. See questions + <a href="#q2.1">2.1</a> and <a href="#q2.7">2.7</a> for details. + If you're running an older version, or even a more recent beta + release, you may not have the same set of attributes to work with. + Consult the appropriate documentation, or upgrade to the current + release. + <li>You're not modifying the right configuration file. The default + configuration file is specified when you first configure ht://Dig + before compiling, but other configuration files can be specified + at run time, using the -c command-line option for most programs, + or the <strong>config</strong> input parameter for htsearch + (see question <a href="#q4.2">4.2</a>). + <li>You've got more than one definition of the attribute. Only the + last occurrence of an attribute in the configuration file is the + definition that's used for that attribute, overriding earlier + definitions. This also applies for nested configuration files that + are loaded in via the <a href="attrs.html#include">include</a> + directive, so check for other definitions in all included files. + Similarly for htsearch, look out for multiple definitions of input + parameters in your search forms, as mentioned in question + <a href="#q4.2">4.2</a> - these don't override each other but they + get combined with a Ctrl-A as separator, which may not be what you + want either. + <li>Your attribute definition is being "swallowed up" by an + incomplete multi-line definition above it. Remember that when a line + of an attribute definition ends with a single backslash + ("<strong>\</strong>") before the end of the line (without any + space after the backslash), then the following line is appended to + it as a continuation of the same attribute definition. For an + attribute definition that spans several lines, all lines but the + last must end with a backslash. If you want a backslash to go into + the attribute definition literally, it must be doubled-up, as + <strong>\\</strong>. + <li>On a similar note, make sure your attribute definitions are all + terminated by a newline character. Beware of text editors that do + word wrapping. It may look like two separate lines on the screen, + when it fact you've got two attribute definitions on the same long + line, so the second is swallowed up as part of the first. + <li>Your attribute definition is being overridden by an htsearch + <a href="hts_form.html">CGI input parameter</a>. For example, + <a href="attrs.html#template_name">template_name</a> is ignored + if the <strong>format</strong> input parameter is defined. The + <a href="attrs.html#allow_in_form">allow_in_form</a> attribute + can define any number of new CGI input parameters that override + the attributes of the same name in your config file. + <li>Your attribute definition is being ignored or overridden + by a related attribute. Watch out for unexpected interactions + between different attributes. For instance, characters in + <a href="attrs.html#valid_punctuation">valid_punctuation</a> + are stripped out of words, so those characters may + not have the effect you want if you've added them to + <a href="attrs.html#extra_word_characters">extra_word_characters</a> + or + <a href="attrs.html#prefix_match_character">prefix_match_character</a>. + Also, + <a href="attrs.html#search_results_wrapper">search_results_wrapper</a> + will override + <a href="attrs.html#search_results_header">search_results_header</a> + and + <a href="attrs.html#search_results_footer">search_results_footer</a>, + but only if you've set up the wrapper file correctly. + <li>Watch out for possible "latent effects" of some attributes. For + example, when you change attributes used by htdig, they won't have + an immediate effect on entries already in the database, so you would + have to reindex your site before they take effect. Similarly, + attributes that affect how htfuzzy builds some of its databases + don't take effect until those databases are rebuilt. Another, more + subtle latent effect occurs with releases 3.1.6 and 3.2 betas: + when you interrupt htdig (i.e. with Control-C or a kill command), + it stores the list of currently queued URLs in db.log, in your + database directory, so that the next time you invoke htdig it can + resume the interrupted dig. A side-effect of this file is that if + you change some attributes like limit_urls_to or exclude_urls before + restarting, the URLs in the file are still taken as-is, having been + checked against the old settings of limit_urls_to or exclude_urls + before being queued. This might explain one reason htdig seems to + ignore your new settings of these. + </ol> + + <strong>5.32. <a name="q5.32">When I run htsearch, it gives a page + with an "Unable to read configuration file" message.</a></strong><br> + <p>The most common causes of this error are:</p> + <ul> + <li>Your configuration file name is misspelled in the "config" + input parameter of your search form, or you have two definitions + of this parameter (see question <a href="#q4.2">4.2</a>). + <li>You didn't install your configuration file in the directory + defined by the CONFIG_DIR compile-time Makefile variable + (see also question <a href="#q4.20">4.20</a>). This is where + htsearch will look for the configuration file specified by the + "config" input parameter. + <li>The configuration file is not readable by the user ID under + which your web server, and thus htsearch, runs. Similarly, + if the directories from CONFIG_DIR up to the root directory + are not executable by this same user ID, htsearch won't be + able to access the configuration files. + </ul> + + <strong>5.33. <a name="q5.33">How can I find out which version + of ht://Dig I have installed?</a></strong><br> + <p>You should always check which version of ht://Dig you're + running, before you report any problems, or even if you + suspect a problem. You can find out the version number of an + installed ht://Dig package by running the command:</p> + <blockquote> + <code>htdig -\? | head</code> + </blockquote> + <p>(or use "more" if you don't have a "head" command). The + full version number appears on the third line of output, + after "This program is part of ht://Dig", and it should also + include the snapshot date if you're running a pre-release + snapshot. Always include this full version number with any + bug report or problem report on a mailing list. You can save + yourself and others a lot of grief by being certain of which + version you're running, especially if you've installed more than + one. If you're running ht://Dig from an RPM package, you should + also report the package version and release number, which you + can determine with the command "<code>rpm -q htdig</code>", + and mention where you obtained the package. This will alert + us to the ideosyncracies and/or patches in a particular RPM + package. Also, if you've applied any patches yourself (see + question <a href="#q2.5">2.5</a>) please mention which ones. + See also question <a href="#q1.8">1.8</a>, on reporting bugs + or configuration problems.</p> + + <strong>5.34. <a name="q5.34">When running htdig, I get "Error (0): + PDF file is damaged - attempting to reconstruct xref table..."</a></strong><br> + <p>This message comes from the pdftotext utility, when a PDF file + has been truncated. Find the largest PDF file on the site you're + indexing, and set max_doc_size to at least that size (see question + <a href="#q5.2">5.2</a>). If you need to track down which PDF is + causing the error, try running "htdig -i -v > log.txt 2>&1" so you + can see which URL is being indexed when the error occurs. The output + redirects in that command combine stdout (where htdig's output goes) + and stderr (where pdftotext's error messages go) into one output + stream. If you're using acroread to index PDF files, the error + message for a truncated PDF file is simply "Could not repair file." + It's also possible to get errors like this from PDF files that are + smaller than max_doc_size, if they're already truncated or corrupted + on the server.</p> + + <strong>5.35. <a name="q5.35">When running htdig on Mandrake Linux, + I get "host not found" and "no server running" errors.</a></strong><br> + <p>The default htdig.conf configuration in Mandrake's RPM package + of htdig very stupidly enables the + <a href="attrs.html#local_urls_only">local_urls_only</a> attribute + by default, which means you can only index a limited set of files + on the local server. Anything else, where htdig would normally fall + back to using HTTP, will fail. To make matters worse, they put a very + misleading comment above that attribute setting, which throws users + off track. This attribute is useful in certain circumstances where + you never want htdig to fall back to HTTP, but enabling it by default + was a very bad judgement call on Mandrake's part.</p> + + <strong>5.36. <a name="q5.36">When I run htsearch, it gives me the + list of matching documents, but no header or footer.</a></strong><br> + <p>The header and footer typically contain the followup search + form, an indication of the total number of matches, and buttons + to other pages of matches if the results don't fit on one + page. If these don't show up, it could be that in attempting + to customize these (see question <a href="#q4.2">4.2</a>), + you removed them or rendered them unusable. Even if you didn't + customize them, make sure you installed the + <a href="attrs.html#search_results_header">search_results_header</a> + and + <a href="attrs.html#search_results_footer">search_results_footer</a> + files (or the + <a href="attrs.html#search_results_wrapper">search_results_wrapper</a> + file) in the correct location (where you told ht://Dig they'd be + when you configured prior to compiling). Also make sure they + have read permission for the user ID under which htsearch runs, + and all directories leading up to these template files are + searchable (i.e. executable) by htsearch, or it won't be able + to open the files.</p> + + <p>This is the opposite problem of that described in question + <a href="#q5.11">5.11</a>. If htsearch displays nothing at + all, you may have both problems or you may have no matches or + a boolean query syntax error and the + <a href="attrs.html#nothing_found_file">nothing_found_file</a> + or <a href="attrs.html#syntax_error_file">syntax_error_file</a> + is missing or unreadable.</p> + + <strong>5.37. <a name="q5.37">When I index files with doc2html.pl, + it fails with the "UNABLE to convert" error.</a></strong><br> + <p>This is an indication that doc2html.pl wasn't configured + properly. Carefully follow all the directions for installation + in the DETAILS file that comes with the script. In addition to + installing doc2html.pl, you must:</p> + <ul> + <li>Install xpdf and check that pdftotext and pdfinfo work from + the command line, + <li>Configure pdf2html.pl to use pdftotext and pdfinfo and check + that it works from the command line, + <li>Configure doc2html.pl to use pdf2html.pl and check that it + works from the command line: +<pre>doc2html.pl /full/path/to/sample/filename.pdf "application/pdf" url</pre> + </ul> + <p>You should repeat a similar set of steps to configure and test + doc2html.pl for other document types, such as Word, RTF, Excel and + other document types. See also questions <a href="#q4.8">4.8</a>, + <a href="#q4.9">4.9</a> and <a href="#q5.39">5.39</a>.</p> + + <strong>5.38. <a name="q5.38">Why do my searches find search terms + in pathnames, or how do I prevent matching filenames?</a></strong><br> + <p>htdig doesn't normally add the URL components to the index + itself, but when you index a directory where the filenames are + used as link description text (such as an automatic DirectoryIndex + created by Apache's mod_autoindex) then these link descriptions + get indexed, carrying the weight assigned to them by the + <a href="attrs.html#description_factor">description_factor</a> + attribute. Thus, a search for a filename will match this link + description, and the file will show up in search results. + To avoid that, make sure your DirctoryIndexes don't get indexed + as detailed in question <a href="#q4.23">4.23</a>.</p> + + <p>Conversely, there is no way to force htdig to index URL + components so that a search for a file name will yield a match + on that file, unless you index an HTML file (or several) containing + links to all the files you want, where the link description text + does contain the full URL or the pathname components you want.</p> + + <strong>5.39. <a name="q5.39">I set up an external parser but I still + can't index Word/Excel/PowerPoint/PDF documents.</a></strong><br> + <p>You probably need to carefully re-read and follow questions + <a href="#q4.8">4.8</a>, <a href="#q4.9">4.9</a>, + <a href="#q5.25">5.25</a> and <a href="#q5.27">5.27</a>. + When you can't index documents with an external parser or converter, + there are three main issues, or points of failure, that you need + to resolve. You need to figure out on which of the three stages the + process is failing, and focus on that stage to get to the bottom of + why it's not working at that stage. You need to run htdig with + anywhere from 1 to 4 -v options, to get the debugging output you + need to see where it's failing and why. This may be an iterative + process, if htdig is failing at more than one stage: you might fix + one problem only to run into another.</p> + + <ol> + <li>Is htdig actually finding links to the PDF, Word, etc. documents + you want to index? Make sure you're not making false assumptions + about how htdig finds these (questions <a href="#q5.25">5.25</a> + and <a href="#q5.18">5.18</a>), and then find out how htdig is + looking at the links in your HTML files to see if it's ignoring + or rejecting links to your externally parsed documents (questions + <a href="#q4.1">4.1</a> and <a href="#q5.27">5.27</a>).<br><br> + <li>If it is finding and accepting the links to these documents, is + it correctly fetching them and passing them on to the appropriate + external converter to be able to index them? Look at htdig -vvv + output, around the time it tries to fetch one of these, and see + what it does next. Does the file size look right? Are there any + error messages around there? If the external converter isn't even + being called, take a close look at your + <a href="attrs.html#external_parsers">external_parsers</a> + attribute setting to make sure it's correct (see question + <a href="#q5.31">5.31</a>).<br><br> + <li>If it is attempting to convert them, is the external converter + doing what it should, to feed some indexable text back into htdig's + parser? You can also try htdig -vvvv (4 -v options) to see if it's + actually parsing individual words from any of these. If this is + too much output to wade through, try setting + <a href="attrs.html#start_url">start_url</a> to the URL + of a single document that you want to test, so you can look in + detail at what htdig does with it. You can also try running the + external converter manually on one of these documents to see + what it spits out. See question <a href="#q5.37">5.37</a>. + Make sure your documents actually contain indexable text. Some + PDFs are nothing but scanned images of pages, so it looks like + text but it's just images with no computer-readable text. + </ol> + + <br> + + <hr noshade size=4> + Last modified: $Date: 2004/05/28 13:15:16 $ +<br> + <a href="http://sourceforge.net/"> + <img src="http://sourceforge.net/sflogo.php?group_id=4593&type=1" width="88" height="31" border="0" alt="SourceForge Logo"></a> + </body> +</html> |