summaryrefslogtreecommitdiffstats
path: root/debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html
diff options
context:
space:
mode:
Diffstat (limited to 'debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html')
-rw-r--r--debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html2590
1 files changed, 2590 insertions, 0 deletions
diff --git a/debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html b/debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html
new file mode 100644
index 00000000..9f2db468
--- /dev/null
+++ b/debian/htdig/htdig-3.2.0b6/htdoc/FAQ.html
@@ -0,0 +1,2590 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
+<html>
+ <head>
+ <title>ht://Dig Frequently Asked Questions</title>
+ <link rel="stylesheet" href="css/htdig.css">
+ </head>
+ <body bgcolor="#eef7ff">
+ <h1>Frequently Asked Questions</h1>
+ <p>
+ ht://Dig Copyright &copy; 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
+ Please see the file <a href="COPYING">COPYING</a> for
+ license information.
+ </p>
+ <hr noshade size=4>
+ <p class="main">This FAQ is compiled by the ht://Dig developers and the
+ most recent version is available at &lt;<a
+ href="http://www.htdig.org/FAQ.html">http://www.htdig.org/FAQ.html</a>&gt;.
+ Questions (and answers!) are greatly appreciated.
+ Please send questions and/or answers to the ht://Dig user
+ mailing list at: &lt;<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general@lists.sourceforge.net</a>&gt;.
+ </p>
+ <h2>Questions</h2>
+
+ <h3>1. General</h3>
+ 1.1. <a href="#q1.1">Can I search the internet with ht://Dig?</a><br>
+ 1.2. <a href="#q1.2">Can I index the internet with ht://Dig?</a><br>
+ 1.3. <a href="#q1.3">What's the difference between htdig and
+ ht://Dig?</a><br>
+ 1.4. <a href="#q1.4">I sent mail to Andrew or Geoff or
+ Gilles, but I never got a response!</a><br>
+ 1.5. <a href="#q1.5">I sent a question to the mailing list but I
+ never got a response!</a><br>
+ 1.6. <a href="#q1.6">I have a great idea/patch for ht://Dig!</a><br>
+ 1.7. <a href="#q1.7">Is ht://Dig Y2K compliant?</a><br>
+ 1.8. <a href="#q1.8">I think I found a bug. What should I do?</a><br>
+ 1.9. <a href="#q1.9">Does ht://Dig support phrase or near
+ matching?</a><br>
+ 1.10. <a href="#q1.10">What are the practical and/or theoretical
+ limits of ht://Dig?</a><br>
+ 1.11. <a href="#q1.11">Do any ISPs offer ht://Dig as part of
+ their web hosting services?</a><br>
+ 1.12. <a href="#q1.12">Can I use ht://Dig on a commercial website?</a><br>
+ 1.13. <a href="#q1.13">Why do you use a non-free product to
+ index PDF files?</a><br>
+ 1.14. <a href="#q1.14">Why do you have all those SourceForge
+ logos on your website?</a><br>
+ 1.15. <a href="#q1.15">My question isn't answered here. Where should I
+ go for help?</a><br>
+ 1.16. <a href="#q1.16">Why do the developers get annoyed when
+ I e-mail questions directly to them rather than the mailing list?</a><br>
+ 1.17. <a href="#q1.17">Why do replies to messages on the
+ mailing list only go to the sender and not to the list?</a><br>
+ 1.18. <a href="#q1.18">Can I use ht://Dig to index and search
+ an SQL database?</a><br>
+
+ <hr noshade size=2>
+
+ <h3>2. Getting ht://Dig</h3>
+ 2.1. <a href="#q2.1">What's the latest version of ht://Dig?</a><br>
+ 2.2. <a href="#q2.2">Are there binary distributions of ht://Dig?</a><br>
+ 2.3. <a href="#q2.3">Are there mirror sites for ht://Dig?</a><br>
+ 2.4. <a href="#q2.4">Is ht://Dig available by ftp?</a><br>
+ 2.5. <a href="#q2.5">Are patches around to upgrade between
+ versions?</a><br>
+ 2.6. <a href="#q2.6">Is there a Windows 95/98/2000/NT
+ version of ht://Dig?</a><br>
+ 2.7. <a href="#q2.7">Where can I find the documentation for my
+ version of ht://Dig?</a><br>
+
+ <hr noshade size=2>
+
+ <h3>3. Compiling</h3>
+ 3.1. <a href="#q3.1">When I compile ht://Dig I get an error
+ about libht.a.</a><br>
+ 3.2. <a href="#q3.2">I get an error about -lg</a><br>
+ 3.3. <a href="#q3.3">I'm compiling on Digital Unix and I get
+ mesages about "unresolved" and "db_open."</a><br>
+ 3.4. <a href="#q3.4">I'm compiling on FreeBSD and I get lots
+ of messages about '___error' being unresolved.</a><br>
+ 3.5. <a href="#q3.5">I'm compiling on HP/UX and I get a complaint about
+ "Large Files not supported."</a><br>
+ 3.6. <a href="#q3.6">I'm compiling on Solaris and when I run the
+ programs I get complaints about not finding libstdc++.</a><br>
+ 3.7. <a href="#q3.7">I'm compiling on IRIX and I'm having
+ database problems when I run the program.</a><br>
+ 3.8. <a href="#q3.8">I'm compiling with gcc 3.2 and getting
+ all sorts of warnings/errors about ostream and such.</a><br>
+
+ <hr noshade size=2>
+
+ <h3>4. Configuration</h3>
+ 4.1. <a href="#q4.1">How come I can't index my site?</a><br>
+ 4.2. <a href="#q4.2">How can I change the output format of
+ htsearch?</a><br>
+ 4.3. <a href="#q4.3">How do I index pages that start with '~'?</a><br>
+ 4.4. <a href="#q4.4">Can I use multiple databases?</a><br>
+ 4.5. <a href="#q4.5">OK, I can use multiple databases. Can I
+ merge them into one?</a><br>
+ 4.6. <a href="#q4.6">Wow, ht://Dig eats up a lot of disk
+ space. How can I cut down?</a><br>
+ 4.7. <a href="#q4.7">Can I use SSI or other CGIs in my
+ htsearch results?</a><br>
+ 4.8. <a href="#q4.8">How do I index Word, Excel, PowerPoint
+ or PostScript documents?</a><br>
+ 4.9. <a href="#q4.9">How do I index PDF files?</a><br>
+ 4.10. <a href="#q4.10">How do I index documents in other
+ languages?</a><br>
+ 4.11. <a href="#q4.11">How do I get rotating banner ads in
+ search results?</a><br>
+ 4.12. <a href="#q4.12">How do I index numbers in documents?</a><br>
+ 4.13. <a href="#q4.13">How can I call htsearch from a hypertext
+ link, rather than from a search form?</a><br>
+ 4.14. <a href="#q4.14">How do I restrict a search to only meta
+ keywords entries in documents?</a><br>
+ 4.15. <a href="#q4.15">Can I use meta tags to prevent htdig from
+ indexing certain files?</a><br>
+ 4.16. <a href="#q4.16">How do I get htsearch to use the star image
+ in a different directory than the default /htdig?</a><br>
+ 4.17. <a href="#q4.17">How do I get htdig or htsearch to rewrite
+ URLs in the search results?</a><br>
+ 4.18. <a href="#q4.18">What are all the options in
+ htdig.conf, and are there others?</a><br>
+ 4.19. <a href="#q4.19">How do I get more than 10 pages of
+ 10 search results from htsearch?</a><br>
+ 4.20. <a href="#q4.20">How do I restrict a search to only
+ certain subdirectories or documents?</a><br>
+ 4.21. <a href="#q4.21">How can I allow people to search
+ while the index is updating?</a><br>
+ 4.22. <a href="#q4.22">How can I get htdig to ignore the
+ robots.txt file or meta robots tags?</a><br>
+ 4.23. <a href="#q4.23">How can I get htdig not to index
+ some directories, but still follow links?</a><br>
+ 4.24. <a href="#q4.24">How can I get rid of duplicates in
+ search results?</a><br>
+ 4.25. <a href="#q4.25">How can I change the scores in
+ search results, and what are the defaults?</a><br>
+ 4.26. <a href="#q4.26">How can I get htdig not to index
+ JavaScript code or CSS?</a><br>
+
+ <hr noshade size=2>
+
+ <h3>5. Troubleshooting</h3>
+ 5.1. <a href="#q5.1">I can't seem to index more than X documents
+ in a directory.</a><br>
+ 5.2. <a href="#q5.2">I can't index PDF files.</a><br>
+ 5.3. <a href="#q5.3">When I run "rundig," I get a message about
+ "DATABASE_DIR" not being found.</a><br>
+ 5.4. <a href="#q5.4">When I run htmerge, it stops with an "out
+ of diskspace" message.</a><br>
+ 5.5. <a href="#q5.5">I have problems running rundig from cron
+ under Linux.</a><br>
+ 5.6. <a href="#q5.6">When I run htmerge, it stops with an
+ "Unexpected file type" message.</a><br>
+ 5.7. <a href="#q5.7">When I run htsearch, I get lots of Internal
+ Server Errors (#500).</a><br>
+ 5.8. <a href="#q5.8">I'm having problems with indexing words
+ with accented characters.</a><br>
+ 5.9. <a href="#q5.9">When I run htmerge, it stops with a
+ "Word sort failed" message.</a><br>
+ 5.10. <a href="#q5.10">When htsearch has a lot of matches, it runs
+ extremely slowly.</a><br>
+ 5.11. <a href="#q5.11">When I run htsearch, it gives me a count of
+ matches, but doesn't list the matching documents.</a><br>
+ 5.12. <a href="#q5.12">I can't seem to index documents with names
+ like left_index.html with htdig.</a><br>
+ 5.13. <a href="#q5.13">I get Premature End of Script Headers errors
+ when running htsearch.</a><br>
+ 5.14. <a href="#q5.14">I get Segmentation faults when running
+ htdig, htsearch or htfuzzy.</a><br>
+ 5.15. <a href="#q5.15">Why does htdig 3.1.3 mangle URL parameters
+ that contain bare "&amp;" characters?</a><br>
+ 5.16. <a href="#q5.16">When I run htmerge, it stops with an
+ "Unable to open word list file '.../db.wordlist'" message.</a><br>
+ 5.17. <a href="#q5.17">When using Netscape, htsearch always returns the
+ "No match" page.</a><br>
+ 5.18. <a href="#q5.18">Why doesn't htdig follow links to other
+ pages in JavaScript code?</a><br>
+ 5.19. <a href="#q5.19">When I run htsearch from the web server,
+ it returns a bunch of binary data.</a><br>
+ 5.20. <a href="#q5.20">Why are the betas of 3.2 so slow at indexing?</a><br>
+ 5.21. <a href="#q5.21">Why does htsearch use ";" instead of
+ "&amp;" to separate URL parameters for the page buttons?</a><br>
+ 5.22. <a href="#q5.22">Why does htsearch show the
+ "&amp;" character as "&amp;amp;" in search results?</a><br>
+ 5.23. <a href="#q5.23">I get Internal Server or Unrecognized
+ character errors when running htsearch.</a><br>
+ 5.24. <a href="#q5.24">I took some settings out of
+ my htdig.conf but they're still set.</a><br>
+ 5.25. <a href="#q5.25">When I run htdig on my site,
+ it misses entire directories.</a><br>
+ 5.26. <a href="#q5.26">What do all the numbers and symbols
+ in the htdig -v output mean?</a><br>
+ 5.27. <a href="#q5.27">Why is htdig rejecting some of the
+ links in my documents?</a><br>
+ 5.28. <a href="#q5.28">When I run htdig or htmerge, I get a
+ "DB2 problem...: missing or empty key value specified" message.</a><br>
+ 5.29. <a href="#q5.29">When I run htdig on my site,
+ it seems to go on and on without ending.</a><br>
+ 5.30. <a href="#q5.30">Why does htsearch no longer recognize
+ the -c option when run from the web server?</a><br>
+ 5.31. <a href="#q5.31">I've set a config attribute exactly
+ as documented but it seems to have no effect.</a><br>
+ 5.32. <a href="#q5.32">When I run htsearch, it gives a page
+ with an "Unable to read configuration file" message.</a><br>
+ 5.33. <a href="#q5.33">How can I find out which version
+ of ht://Dig I have installed?</a><br>
+ 5.34. <a href="#q5.34">When running htdig, I get "Error (0):
+ PDF file is damaged - attempting to reconstruct xref table..."</a><br>
+ 5.35. <a href="#q5.35">When running htdig on Mandrake Linux,
+ I get "host not found" and "no server running" errors.</a><br>
+ 5.36. <a href="#q5.36">When I run htsearch, it gives me the
+ list of matching documents, but no header or footer.</a><br>
+ 5.37. <a href="#q5.37">When I index files with doc2html.pl,
+ it fails with the "UNABLE to convert" error.</a><br>
+ 5.38. <a href="#q5.38">Why do my searches find search terms
+ in pathnames, or how do I prevent matching filenames?</a><br>
+ 5.39. <a href="#q5.39">I set up an external parser but I still
+ can't index Word/Excel/PowerPoint/PDF documents.</a><br>
+
+ <hr noshade size=4>
+ <h2>Answers</h2>
+
+ <h3>1. General</h3>
+ <strong>1.1. <a name="q1.1">Can I search the internet with
+ ht://Dig?</a></strong><br>
+ <p>No, ht://Dig is a system for indexing and searching a
+ finite (not necessarily small) set of sites or intranet. It
+ is not meant to replace any of the many internet-wide search
+ engines.</p>
+
+ <strong>1.2. <a name="q1.2">Can I index the internet with
+ ht://Dig?</a></strong><br>
+ <p>No, as above, ht://Dig is not meant as an
+ internet-wide search engine. While there is
+ <em>theoretically</em> nothing to stop you from indexing as
+ much as you wish, practical considerations (e.g. time, disk
+ space, memory, etc.) will limit this.</p>
+
+ <strong>1.3. <a name="q1.3">What's the difference between htdig and
+ ht://Dig?</a></strong><br>
+ <p>The complete ht://Dig package consists of several programs, one of
+ which is called "htdig." This program performs the "digging" or
+ indexing of the web pages. Of course an index doesn't do you much good
+ without a program to sort it, search through it, etc.</p>
+
+ <strong>1.4. <a name="q1.4">I sent mail to Andrew or Geoff
+ or Gilles, but I never got a response!</a></strong><br>
+ <p>Andrew no longer does much work on ht://Dig. He has started a
+ company, called <a href="http://www.contigo.com/">Contigo
+ Software</a> and is quite busy with that. To contact any of the
+ current developers, send mail to &lt;<a
+ href="mailto:htdig-dev@lists.sourceforge.net">htdig-dev</a>&gt;.
+ This list is intended primarily for the discussion of current
+ and future development of the software.</p>
+
+ <p>Geoff and Gilles are currently the maintainers of
+ ht://Dig, but they are both volunteers. So while they do
+ read all the e-mail they receive, they may not respond
+ immediately. Questions about ht://Dig in general, and especially
+ questions or requests for help in configuring the software,
+ should be posted to the &lt;<a
+ href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>&gt;
+ mailing list. When posting a followup to a message on the
+ list, you should use the "reply to all" or "group reply"
+ feature of your mail program, to make sure the mailing list
+ address is included in the reply, rather than replying only
+ to the author of the message.
+ See also question <a href="#q1.16">1.16</a> and the
+ <a href="http://www.htdig.org/mailarchive.html">mailing list</a>
+ page.</p>
+
+ <strong>1.5. <a name="q1.5">I sent a question to the mailing list but I
+ never got a response!</a></strong><br>
+ <p>Development of ht://Dig is done by volunteers. Since we all
+ have other jobs, it make take a while before someone gets back
+ to you. Please be patient and don't hound the volunteers with
+ direct or repeated requests. If you don't get a response after
+ 3 or 4 days, then a reminder may help.
+ See also question <a href="#q1.16">1.16</a>.</p>
+
+ <strong>1.6. <a name="q1.6">I have a great idea/patch for
+ ht://Dig!</a></strong><br>
+ <p>Great! Development of ht://Dig continues through suggestions
+ and improvements from users. If you have an idea (or even better,
+ a patch), please send it to the ht://Dig mailing list so others
+ can use it. For suggestions on how to submit patches, please check
+ the <a href="dev/patches.html">Guidelines for
+ Patch Submissions</a>. If you'd like to make a feature request,
+ you can do so through the <a href="bugs.html">ht://Dig bug
+ database</a></p>
+
+ <strong>1.7. <a name="q1.7">Is ht://Dig Y2K compliant?</a></strong><br>
+ <p>
+ ht://Dig should be y2k compliant since it never <em>stores</em> dates as
+ two-digit years. Under ht://Dig's copyright (GPL), there is no warranty
+ whatsoever as permitted by law. If you would like an iron-clad,
+ legally-binding guarantee, feel free to check the source code
+ itself. Versions prior to 3.1.2 did have a problem with the parsing
+ of the Last-Modified header returned by the HTTP server, which will
+ cause incorrect dates to be stored for documents modified after
+ February 28, 2000 (yes, it didn't recognize 2000 as a leap year).
+ Versions prior to 3.1.5 didn't correctly handle servers that return
+ two digit years in the Last-Modified header, for years after 99.
+ These problems are fixed in the current release.
+ If you discover something else, please let us know!
+ </p>
+
+ <strong>1.8. <a name="q1.8">I think I found a bug. What should I
+ do?</a></strong><br>
+ <p>Well, there are probably bugs out there. You have two options
+ for bug-reporting. You can either mail the ht://Dig mailing list
+ at &lt;<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general@lists.sourceforge.net</a>&gt; or
+ better yet, report it to the <a href="bugs.html">bug
+ database</a>, which ensures it won't
+ become lost amongst all of the other mail on the list.
+ Please try to include as much information as possible, including
+ the version of ht://Dig (see question <a href="#q5.33">5.33</a>),
+ the OS, and anything else that might be helpful.
+ Often, running the programs with one "-v" or more
+ (e.g. "-vvv") gives useful debugging information.
+ If you are unsure whether the problem is a bug or a configuration
+ problem, you should discuss the problem on
+ &lt;<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>&gt;
+ (after carefully reading the FAQ and searching the
+ <a href="http://www.htdig.org/mailarchive.html">mail archive</a>
+ and <a href="#q2.5">patch archive</a>,
+ of course)
+ to sort out what it is. The mailing list has a wider audience, so
+ you're more likely to get help with configuration problems there
+ than by reporting them to the bug database.
+ </p>
+
+ <p>Whether reporting problems to the bug database or mailing
+ list, we cannot stress enough the importance of
+ <strong>always</strong> indicating <strong>which version of
+ ht://Dig you are running</strong>.
+ See question <a href="#q5.33">5.33</a>. There
+ are still a lot of users, ISPs and software distributors using
+ older versions, and there have been a lot of bug fixes and
+ new features added in recent versions. Knowing which version
+ you're running is absolutely essential in helping to find a
+ solution. If you're unsure if your version is current, or what
+ fixes and features have been added in more recent versions,
+ please see the <a href="RELEASE.html">
+ release notes</a>. See also question <a href="#q2.1">2.1</a>.</p>
+
+ <strong>1.9. <a name="q1.9">Does ht://Dig support phrase or near
+ matching?</a></strong><br>
+ <p>Phrase searching has been added for the 3.2 release,
+ which is currently in the beta phase
+ (<a href="http://www.htdig.org/files/htdig-3.2.0b6.tar.gz">3.2.0b6</a>
+ as of this writing). Near or proximity matching will probably be added
+ in a future beta.
+ </p>
+
+ <strong>1.10. <a name="q1.10">What are the practical and/or theoretical
+ limits of ht://Dig?</a></strong><br>
+ <p>The code itself doesn't put any real limit on the number of
+ pages. There are several sites in the hundreds of thousands
+ of pages. As for practical limits, it depends a lot on how
+ many pages you plan on indexing. Some operating systems limit
+ files to 2 GB in size, which can become a problem with a large
+ database. There are also slightly different limits to each of
+ the programs. Right now htmerge performs a sort on the words
+ indexed. Most sort programs use a fair amount of RAM and
+ temporary disk space as they assemble the sorted list. The
+ htdig program stores a fair amount of information about the
+ URLs it visits, in part to only index a page once. This takes
+ a fair amount of RAM. With cheap RAM, it never hurts to throw
+ more memory at indexing larger sites. In a pinch, swap will
+ work, but it obviously really slows things down.</p>
+
+ <p>The 3.2 development code helps with many of these
+ limitations. In paticular, it generates the databases on the
+ fly, which means you don't have to sort them before
+ searching. Additionally, the new databases are compressed
+ significantly, making them usually around 50% the size of
+ those in previous versions.</p>
+
+ <strong>1.11. <a name="q1.10">Do any ISPs offer ht://Dig as part of
+ their web hosting services?</a></strong><br>
+ <p>Yes. A list of such ISPs is <a href="isp.html">available
+ here</a>
+ </p>
+
+ <strong>1.12. <a name="q1.12">Can I use ht://Dig on a
+ commercial website?</a></strong><br>
+ <p>Sure! The <a href="COPYING">GNU Library General Public License (LGPL)</a> has no
+ restrictions on use. So you are free to use ht://Dig however you
+ want on your website, personal files, etc. The license only
+ restricts distribution. So if you're planning on a
+ commercial software product that includes ht://Dig, you will
+ have to provide source code including any modifications upon
+ request.
+ </p>
+
+ <strong>1.13. <a name="q1.13">Why do you use a non-free
+ product to index PDF files?</a></strong><br>
+ <p>
+ We don't. You <em>can</em> use the &quot;acroread&quot;
+ program to index PDF files, but this is no longer
+ recommended. Initially this program was the only reliable
+ way to extract data from PDF files. However, the <a
+ href="http://www.foolabs.com/xpdf/">xpdf package</a> is a
+ reliable, free software package for indexing and viewing PDF
+ files. See question <a href="#q4.9">4.9</a> for details on
+ using xpdf to index PDF files. We do not advocate using
+ acroread any longer because it is a proprietary product.
+ Additionally it is no longer reliable at extracting data.
+ </p>
+
+ <strong>1.14. <a name="q1.14">Why do you have all those SourceForge
+ logos on your website?</a></strong><br>
+ <p><a href="http://sourceforge.net/">SourceForge</a> is a
+ new service for open source software. You can host your
+ project on SourceForge servers and use many of their
+ services like bug-tracking and the like. The ht://Dig
+ project currently uses SourceForge for a mirror of the main
+ website at <a
+ href="http://htdig.sourceforge.net/">htdig.sourceforge.net</a>
+ as well as a mirror of ht://Dig releases and contributed
+ work.
+ </p>
+
+ <strong>1.15. <a name="q1.15">My question isn't answered here.
+ Where should I go for help?</a></strong><br>
+ <p>
+ Before you go anywhere else, think of other ways of phrasing your
+ question. Many times people have questions that are very similar to
+ other FAQ and while we try to phrase the queries in the FAQ closely to
+ the most common questions, we obviously can't get them all! The next
+ place to check is the documentation itself. In particular, take a
+ look at the list of configuration attributes, particularly the list <a
+ href="cf_byname.html">by name</a> and <a
+ href="cf_byprog.html">by program</a>. There are a
+ lot of them, but chances are there's something that might fit your needs.
+ You should also take a close look at all of
+ <a href="htsearch.html">htsearch</a>'s
+ documentation, especially the section "HTML form" which describes
+ all the CGI input parameters available for controlling the search,
+ including limiting the search to certain subdirectories.
+ You can find the answer yourself to almost all "how can I..."
+ questions by exploring what the various configuration attributes
+ and search form input parameters can do.
+ Also have a look at our collection of
+ <a href="http://www.htdig.org/contrib/guides.html">Contributed Guides</a>
+ for help on things like
+ <a href="http://www.htdig.org/files/contrib/guides/htmlhelp.html">HTML
+ forms</a> and CGI, tutorials on installing, configuring, using, and
+ internationalizing ht://Dig, as well as using PHP with htsearch.
+ </p>
+ <p>
+ Finally, if you've exhausted all the online documentation, there's the
+ <a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a> mailing list.
+ There are hundreds of users subscribed and chances are good that someone
+ has had a similar problem before or can suggest a solution.
+ </p>
+
+ <strong>1.16. <a name="q1.16">Why do the developers get annoyed when
+ I e-mail questions directly to them rather than the mailing list?</a></strong><br>
+ <p>The <a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>
+ mailing list exists for dealing with questions about the
+ software, its installation, configuration, and problems with
+ it. E-mailing the developers directly circumvents this forum
+ and its benefits. Most annoyingly, it puts the onus on an
+ individual to answer, even if that individual is not the best or
+ most qualified person to answer. This is not a one-man show. It
+ also circumvents the <a href="http://www.htdig.org/mailarchive.html">archiving
+ mechanism</a> of the mailing list,
+ so not only do subscribers not see these private messages
+ and replies, but future users who may run into the exact same
+ problems won't see them. Remember that the developers are all
+ volunteers, and they don't work for free for your benefit alone.
+ They volunteer for the benefit of the whole ht://Dig user
+ community, so don't expect extra support from them outside of
+ that community. See also questions <a href="#q1.4">1.4</a>
+ and <a href="#q1.5">1.5</a>.</p>
+
+ <p>Note also that when you reply to a message on the list, you
+ should make sure the reply gets on the list as well, provided your
+ reply is still on-topic. See question <a href="#q1.17">1.17</a>
+ below.</p>
+
+ <strong>1.17. <a name="q1.17">Why do replies to messages on the
+ mailing list only go to the sender and not to the list?</a></strong><br>
+ <p>The simple answer is that, unlike some mailing lists, the
+ lists on SourceForge don't force replies back on the list. This
+ is actually a good thing, because you can reply to the sender
+ directly if you want to, or you can use your mail program's
+ "reply to all" capability (sometimes called "group reply")
+ to reply to the mailing list as well. It does mean you have to
+ think before you post a reply, but some would argue that this
+ is a good thing too. There are some compelling reasons to try to
+ keep on-topic discussions on the list, though (see questions
+ <a href="#q1.16">1.16</a> and <a href="#q1.4">1.4</a> above).</p>
+
+ <p>The technical answer is
+ <a href="http://sourceforge.net/docman/display_doc.php?docid=6693&group_id=1">
+ SourceForge's policy on Reply-To: munging</a>, where you'll
+ find all the gory details about the pros and cons of the two
+ common ways of setting up a mailing list, and why SourceForge
+ turns off Reply-To munging. It so happens that the ht://Dig
+ maintainers agree with SourceForge's policy on this, even if
+ we did have a say in the matter. So, counterarguments to this
+ policy are rather moot, and it would be better not to waste
+ any more mailing list bandwidth debating them. (We've heard
+ all the arguments anyway.)</p>
+
+ <strong>1.18. <a name="q1.18">Can I use ht://Dig to index and search
+ an SQL database?</a></strong><br>
+ <p>You can if your database has a web-based front end that can
+ be "spidered" by ht://Dig. The requirement is that every search
+ result must resolve to a unique URL which can be accessed via
+ HTTP. The htdig program uses these URLs, which you feed it via
+ the <a href="attrs.html#start_url">start_url</a> attribute, to
+ fetch and index each page of information. The search results
+ will then give a list of URLs for all pages that match the
+ search terms. If you don't have such a front end to your
+ database, or the search results must be given as something
+ other than URLs, then ht://Dig is probably not the best way of
+ dealing with this problem: you may be better off using an SQL
+ query engine that works directly on your own database, rather
+ than building a separate ht://Dig database for searching.</p>
+
+ <p>Ted Stresen-Reuter had the following tips: "In my case,
+ because I like htdig's ability to rank results (and that
+ ranking can be modified), I created an index page that simply
+ walks through each record and indexes each record (with
+ <em>next</em> and <em>previous</em> links so the spider can
+ read all the records). And then I do one other thing: I make
+ the <code>&lt;title&gt;</code> tag start with the unique ID
+ of each record. Then, when I'm parsing the search results, I
+ do a lookup on the database using the title tag as the key."</p>
+
+ <hr noshade size=2>
+
+ <h3>2. Getting ht://Dig</h3>
+ <strong>2.1. <a name="q2.1">What's the latest version of ht://Dig?</a></strong><br>
+ <p>The latest version is 3.1.6 as of this writing. A beta
+ version of the 3.2 code,
+ <a href="http://www.htdig.org/files/htdig-3.2.0b6.tar.gz">3.2.0b6</a>,
+ is also available, for those who wish to test it.
+ You can find out about the latest version by reading the
+ <a href="RELEASE.html">release
+ notes</a>.</p>
+
+ <p><strong>Note</strong> that if you're running any version
+ older than 3.1.5 (including 3.2.0b1) on a public web site,
+ you should upgrade immediately, as older versions have a
+ rather serious security hole which is explained in detail in
+ this <a
+ href="http://www.htdig.org/htdig-dev/2000/02/0272.html">advisory</a>
+ which was sent to the Bugtraq mailing list.
+ Another slightly less serious, but still troubling security hole
+ exists in 3.1.5 and older (including 3.2.0b3 and older), so you
+ should upgrade if you're running one of these. You can view details
+ on this vulnerability from the
+ <a href="http://www.securityfocus.com/bid/3410">bugtraq mailing list.</a>
+ If you're unsure of which version you're running, see question
+ <a href="#q5.33">5.33</a>.</p>
+
+ <strong>2.2. <a name="q2.2">Are there binary distributions of
+ ht://Dig?</a></strong><br>
+ <p>We're trying to get consistent binary distributions for
+ popular platforms. Contributed binary releases will go in <a
+ href="http://www.htdig.org/files/contrib/binaries/">
+ the contributed binaries section</a>
+ and contributions should be mentioned to the <a
+ href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>
+ mailing list.
+
+ <p>Anyone who would like to make consistent binary
+ distributions of ht://Dig at least should signup to the <a
+ href="mailing.html">htdig-announce mailing list</a>.</p>
+
+ <strong>2.3. <a name="q2.3">Are there mirror sites for ht://Dig?</a></strong><br>
+ <p>Yes, see our <a href="mirrors.html">mirrors
+ listing</a>. If you'd like to mirror the site, please see
+ the <a href="howto-mirror.html">mirroring guide</a>.</p>
+
+ <strong>2.4. <a name="q2.4">Is ht://Dig available by ftp?</a></strong><br>
+ <p>Yes. You can find the current versions and several older
+ versions at various &lt;<a
+ href="mirrors.html">mirror sites</a>&gt;
+ as well as the other locations mentioned in the <a
+ href="where.html">download page</a>.</p>
+
+ <strong>2.5. <a name="q2.5">Are patches around to upgrade between
+ versions?</a></strong><br>
+ <p>Most versions are also distributed as a patch to the previous
+ version's source code. The most recent exception to this was
+ version 3.1.0b1. Since this version switched from the GDBM
+ database to DB2, the new database package needed to be shipped
+ with the distribution. This made the potential patch almost as large
+ as the regular distribution. Update patches resumed with version
+ 3.1.0b2. You can also find archives of patches submitted to
+ the htdig mailing lists, to fix specific bugs or add features,
+ at Joe Jah's <a href="ftp://ftp.ccsf.org/htdig-patches/">
+ htdig-patches ftp site</a>.</p>
+
+ <strong>2.6. <a name="q2.6">Is there a Windows 95/98/2000/NT
+ version of ht://Dig?</a></strong><br>
+ <p>The ht://Dig package can be built on the Win32 platform when
+ using the Cygwin package. For details, see the contributed guide,
+ <a href="http://www.htdig.org/files/contrib/guides/Installing_on_Win32.html">
+ <em>Idiot's Guide to Installing ht://Dig on Win32</em></a>.
+ </p>
+ <p>
+ As of the <a href="http://www.htdig.org/files/htdig-3.2.0b5.tar.gz">3.2.0b5</a>
+ beta release, there is also native Win32 support, thanks to
+ Neal Richter. (Installation docs will be written soon...)
+ </p>
+
+ <strong>2.7. <a name="q2.7">Where can I find the documentation for my
+ version of ht://Dig?</a></strong><br>
+ <p>The documentation for the most recent stable release is always
+ posted at <a href="http://www.htdig.org/">www.htdig.org</a>.
+ The documentation for the latest beta release can be found at
+ <a href="http://www.htdig.org/dev/htdig-3.2/">http://www.htdig.org/dev/htdig-3.2/</a>.
+ In all releases, the documentation is included in the
+ <strong>htdoc</strong> subdirectory of the source distribution, so
+ you always have access to the documentation for your current version.
+ </p>
+
+ <hr noshade size=2>
+
+ <h3>3. Compiling</h3>
+ <strong>3.1. <a name="q3.1">When I compile ht://Dig I get an error about
+ libht.a</a></strong><br>
+ <p>This usually indicates that either libstdc++ is not installed or
+ is installed incorrectly. To get libstdc++ or any other GNU too,
+ check
+ <a
+ href="ftp://ftp.gnu.org/gnu/">ftp://ftp.gnu.org/gnu/</a>.
+ Note that the most recent versions of gcc come with
+ libstdc++ included and are available from <a
+ href="http://gcc.gnu.org/">http://gcc.gnu.org/</a></p>
+
+ <strong>3.2. <a name="q3.2">I get an error about -lg</a></strong><br>
+ <p>This is due to a bug in the Makefile.config.in of version
+ 3.1.0b1. Remove all flags "-ggdb" in Makefile.config.in. Then
+ type "./config.status" to rebuild the Makefiles and
+ recompile. This bug is fixed in version 3.1.0b2.</p>
+
+ <strong>3.3. <a name="q3.3">I'm compiling on Digital Unix and I get
+ mesages about "unresolved" and "db_open."</a></strong><br>
+ <p>Answer contributed by George Adams
+ &lt;learningapache@my-dejanews.com&gt;</p>
+
+ <p>What you're seeing are problems related to the Berkeley DB
+ library. htdig needs a fairly modern version of db, which is
+ why it ships with one that works. (see that -L../db-2.4.14/dist
+ line? That's where htdig's db library is).<br>
+
+ The solution is to modify the c++ command so it explicity
+ references the correct libdb.a . You can do this by replacing
+ the "-ldb" directive in the c++ command with
+ "../db-2.4.14/dist/libdb.a" This problem has been resolved as of
+ version 3.1.0.</p>
+
+ <strong>3.4. <a name="q3.4">I'm compiling on FreeBSD and I get lots
+ of messages about '___error' being unresolved.</a></strong><br>
+ <p>Answer contributed by Laura Wingerd &lt;laura@perforce.com&gt;<br>
+ I got a clean build of htdig-3.1.2 on FreeBSD 2.2.8 by taking
+ -D_THREAD_SAFE out of CPPFLAGS, and setting LIBS to null, in
+ db/dist/configure.</p>
+
+ <strong>3.5. <a name="q3.5">I'm compiling on HP/UX and I get a complaint about
+ "Large Files not supported."</a></strong><br>
+ <p>The db/ pacakge, included with ht://Dig seems to be unable to complete
+ on HP/UX 10.20 in particular. After running the top-level configure
+ script, cd into db/dist and type:</p>
+ <code>./configure --disable-bigfile</code>
+ <p>Then continue with the normal compilation.</p>
+
+ <strong>3.6. <a name="q3.6">I'm compiling on Solaris and when I run the
+ programs I get complaints about not finding libstdc++.</a></strong><br>
+ <p>Answer contributed by Adam Rice &lt;adam@newsquest.co.uk&gt;</p>
+ <p>The problem is that the Solaris loader can't find the library. The
+ best thing to do is set the LD_RUN_PATH environment variable <em>during compile</em>
+ to the directory where libstdc++.so.2.8.1.1 lives. This tells the linker
+ to search that directory at runtime.
+ </p>
+
+ <p>Note that LD_RUN_PATH is not to be confused with LD_LIBRARY_PATH.
+ The latter is parsed at run-time, while LD_RUN_PATH essentially
+ compiles in a library path into the executable, so that it doesn't
+ need a LD_LIBRARY_PATH setting to find its libraries. This allows
+ you to avoid all the complexities of setting an environment
+ variable for a CGI program run from the server. If all else fails,
+ you can always run your programs from wrapper shell scripts that
+ set the LD_LIBRARY_PATH environment variable appropriately.</p>
+
+ <p>Note also that while this answer is specific to Solaris, it may
+ work for other OSes too, so you may want to give it a try. However,
+ not all versions of the <code>ld</code> program on all OSes support
+ the LD_RUN_PATH environment variable, even if these systems support
+ shared libraries. Try "<code>man&nbsp;ld</code>" on your system to
+ find out the best way of setting the runtime search path for shared
+ libraries. If <code>ld</code> doesn't support LD_RUN_PATH, but does
+ support the <code>-R</code> option, you can add one or more of these
+ options to LIBDIRS in Makefile.config before running make on a 3.1.x
+ release. (For a 3.2 beta release, you can add these options to the
+ LDFLAGS environment variable before you run ./configure.)</p>
+
+ <strong>3.7. <a name="q3.7">I'm compiling on IRIX and I'm having
+ database problems when I run the program.</a></strong><br>
+ <p>
+ It is not entirely clear why these problems occur, though
+ they seem to only happen when older compilers are
+ used. Several people have reported that the problems go away
+ when using the latest version of <a href="http://gcc.gnu.org/">gcc</a>.
+ </p>
+
+ <strong>3.8. <a name="q3.8">I'm compiling with gcc 3.2 and getting
+ all sorts of warnings/errors about ostream and such.</a></strong><br>
+ <p>
+ With versions before 3.2.0b5,
+ you should use the following command to configure the ht://Dig
+ package so it can be built with gcc 3.2:
+<pre>
+CXXFLAGS=-Wno-deprecated CPPFLAGS=-Wno-deprecated ./configure
+</pre>
+ </p>
+
+ <hr noshade size=2>
+
+ <h3>4. Configuration</h3>
+ <strong>4.1. <a name="q4.1">How come I can't index my site?</a></strong><br>
+ <p>There are a variety of reasons ht://Dig won't index a
+ site. To get to the bottom of things, it's advisable to turn on
+ some debugging output from the htdig program. When running from
+ the command-line, try "-vvv" in addition to any other
+ flags. This will add debugging output, including the responses
+ from the server.</p>
+ <p>See also questions <a href="#q5.25">5.25</a>,
+ <a href="#q5.27">5.27</a>, <a href="#q5.16">5.16</a> and
+ <a href="#q5.18">5.18</a>.</p>
+
+ <strong>4.2. <a name="q4.2">How can I change the output format of htsearch?</a></strong><br>
+<p>Answer contributed by: Malki Cymbalista &lt;Malki.Cymbalista@weizmann.ac.il&gt;</p>
+
+<p>You can change the output format of htsearch by creating different
+header, footer and result files that specify how you want the output
+to look. You then create a configuration file that specifies which
+files to use. In the html document that links to the search, you
+specify which configuration file to use.</p>
+
+<p>So the configuration file would have the lines:</p>
+<pre>
+search_results_header: ${common_dir}/ccheader.html
+search_results_footer: ${common_dir}/ccfooter.html
+template_map: Long long builtin-long \
+ Short short builtin-short \
+ Default default ${common_dir}/ccresult.html
+template_name: Default
+</pre>
+<p>You would also put into the configuration file any other lines from the
+default configuration file that apply to htsearch.</p>
+
+<p>The files ${common_dir}/ccheader.html and
+${common_dir}/ccfooter.html and ${common_dir}/ccresult.html would be
+tailored to give the output in the desired format.</p>
+
+<p>Assuming your configuration file is called cc.conf, the html file that
+links to the search has to set the config parameter equal to cc. The
+following line would do it:<br>
+<code>&lt;input type="hidden" name="config" value="cc"&gt;</code></p>
+
+ <p><strong>Note:</strong> Don't just add the line above to your
+ <a href="hts_form.html">search form</a>
+ without checking if there isn't already a similar
+ line giving the config attribute a different value. The sample
+ search.html form that comes with the package includes a line
+ like this already, giving "config" the default value of "htdig".
+ If it's there, modify it instead of adding another definition.
+ The config input parameter doesn't need to be hidden either, and
+ you may want to define it as a pull-down list to select different
+ databases (see question <a href="#q4.4">4.4</a>).</p>
+
+ <strong>4.3. <a name="q4.3">How do I index pages that start with '~'?</a></strong><br>
+ <p>
+ ht://Dig should index pages starting with '~' as if it was another
+ web browser. If you are having problems with this, check your server
+ log files to see what file the server is attempting to return.
+ </p>
+
+ <strong>4.4. <a name="q4.4">Can I use multiple databases?</a></strong><br>
+ <p>Yes, though you may find it easier to have one larger
+ database and use restrict or exclude fields on searches. To use
+ multiple databases, you will need a config file for each
+ database. Then each file will set the
+ <a href="attrs.html#database_dir">database_dir</a> or
+ <a href="attrs.html#database_base">database_base</a> attribute to
+ change the name of the databases. The config file is selected
+ by the <strong>config</strong> input field in the search form.
+ <br>See also questions <a href="#q4.2">4.2</a> and
+ <a href="#q4.20">4.20</a>.</p>
+
+ <strong>4.5. <a name="q4.5">OK, I can use multiple databases. Can I
+ merge them into one?</a></strong><br>
+ <p>As of version 3.1.0, you can do this with the -m option to
+ <a href="htmerge.html">htmerge</a>.</p>
+
+ <strong>4.6. <a name="q4.6">Wow, ht://Dig eats up a lot of disk
+ space. How can I cut down?</a></strong><br>
+ <p>There are several ways to cut down on disk space. One is
+ not to use the "-a" option, which creates work copies of the
+ databases. Naturally this essentially doubles the disk
+ usage. If you don't need to index and search at the same time, you can
+ ignore this flag.</p>
+
+ <p>If you are running 3.2.0b5 or higher and don't have
+ <a href="dev/htdig-3.2/attrs.html#wordlist_compress_zlib">compression</a>
+ turned on, then turning that on will also save considerable space.</p>
+
+ <p>Changing configuration variables can also help cut
+ down on disk usage. Decreasing
+ <a href="attrs.html#max_head_length">max_head_length</a> and
+ <a href="attrs.html#max_meta_description_length">max_meta_description_length</a>
+ will cut down on the size of the excerpts stored (in fact, if you
+ don't have
+ <a href="attrs.html#use_meta_description">use_meta_description</a>
+ set, you can set
+ max_meta_description_length to 0!).</p>
+
+ <p>If you are running 3.2.0b6 or higher, you can turn off
+ <a href="dev/htdig-3.2/attrs.html#store_phrases">store_phrases</a>. This cuts the
+ database size by about 60%, at the expense of severely limiting
+ the effectiveness of phrase searches. It also reduces digging time
+ slightly.</p>
+
+ <p>Other techniques include removing the db.wordlist file and adding
+ more words to the <a href="attrs.html#bad_words">bad_words</a>
+ file.</p>
+
+ <p>The University of Leipzig has published
+ <a href="http://wortschatz.uni-leipzig.de/html/wliste.html">
+ word lists</a> containing the 100, 1000 and 10000 most often used
+ words in English, German, French and Dutch. No copyrights or
+ restrictions seem to be applied to the downloadable files. These
+ can be very handy when putting together a bad_words file. Thanks
+ to Peter Asemann for this tip.</p>
+
+ <strong>4.7. <a name="q4.7">Can I use SSI or other CGIs in my
+ htsearch results?</a></strong><br>
+ <p>Not really. Apache will not parse CGI output for SSI
+ statements (See the <a
+ href="http://www.apache.org/docs/misc/FAQ.html#ssi-part-iii">Apache
+ FAQ</a>). Thus,the htsearch CGI does not understand SSI
+ markup and thus cannot include other
+ CGIs. However, it is possible doing it the other way round:
+ you can have the htsearch results included in your dynamic
+ page.
+ </p>
+ <p>
+ The Apache project has mentioned that this will be a
+ feature added to the Apache 2.0 version, currently in development.
+ </p>
+
+ <p>The easiest approach in the meantime is using SSI with
+ the help of the <a
+ href="attrs.html#script_name">script_name</a> configuration
+ file attribute. See the <code>contrib/scriptname</code>
+ directory for a small example using SSI.</p>
+
+ <p>For CGI and PHP, you need a &quot;wrapper&quot; script to
+ do that. For perl script examples, see the files in
+ <code>contrib/ewswrap</code>. The PHP guide (see <a
+ href="http://www.htdig.org/contrib/guides.html">contributed
+ guides</a>) not only describes a wrapper script for PHP, but
+ also offers a step by step tutorial to the basics of
+ ht://dig and is well worth reading.
+ For other alternatives, see question <a href="#q4.11">4.11</a>.
+ </p>
+
+ <strong>4.8. <a name="q4.8">How do I index Word, Excel, PowerPoint
+ or PostScript documents?</a></strong><br>
+ <p>This must be done with an
+ <a href="attrs.html#external_parsers">external parser or converter</a>.
+ A sample of such an external converter is the
+ contrib/doc2html/doc2html.pl Perl script.
+ It will parse Word, PostScript, PDF and other documents, when used
+ with the appropriate document to text converters. It uses catdoc to
+ parse Word documents, and ps2ascii to parse PostScript files. The
+ comments in the Perl script and accompanying documentation
+ indicate where you can obtain these converters.</p>
+
+ <p>Versions of htdig before 3.1.4 don't support external converters,
+ so you have to use an external parser script such as
+ contrib/parse_doc.pl (or better yet, upgrade htdig if you can).
+ External converter scripts are simpler to write and maintain than a
+ full external parser, as they just convert input documents to
+ text/plain or text/html, and pass that back to htdig to be parsed.
+ Parsing is more consistent across document types with external
+ converters, because the final work is done by htdig's internal
+ parsers. External parser scripts tend to be hacks that don't
+ recognize a lot of the parsing attributes in your htdig.conf, so
+ they have to be hacked some more when you change your attributes.</p>
+
+ <p>The most recent versions of parse_doc.pl, conv_doc.pl and
+ the doc2html package are available on our <a
+ href="http://www.htdig.org/files/contrib/parsers/">web site</a>.<br>
+ See below for an example of doc2html.pl, or see the comments in
+ conv_doc.pl and parse_doc.pl, or the documentation for doc2html
+ for examples of their usage.
+ For help with troubleshooting, see questions
+ <a href="#q5.37">5.37</a> and <a href="#q5.39">5.39</a>.</p>
+
+ <strong>4.9. <a name="q4.9">How do I index PDF files?</a></strong><br>
+ <p>This too can be done with an
+ <a href="attrs.html#external_parsers">external parser or converter</a>,
+ in combination with the pdftotext program that is part of the
+ <a href="http://www.foolabs.com/xpdf/">xpdf</a> 0.90 package. A
+ sample of such a converter is the doc2html.pl Perl
+ script. It uses pdftotext to parse PDF documents, then processes
+ the text into external parser records.
+ The most recent version of doc2html.pl is available on our <a
+ href="http://www.htdig.org/files/contrib/parsers/">web
+ site</a>.</p>
+
+ <p>For example, you could put this in your configuration file:</p>
+<pre>
+<a href="attrs.html#external_parsers">external_parsers</a>: application/msword-&gt;text/html /usr/local/bin/doc2html.pl \
+ application/postscript-&gt;text/html /usr/local/bin/doc2html.pl \
+ application/pdf-&gt;text/html /usr/local/bin/doc2html.pl
+</pre>
+ <p>You would also need to configure the script to indicate where all
+ of the document to text converters are installed. See the DETAILS
+ file that comes with doc2html for more information.</p>
+
+ <p>Versions of htdig before 3.1.4 don't support external converters,
+ so you have to use an external parser script such as
+ contrib/parse_doc.pl (or better yet, upgrade htdig if you can).
+ See question <a href="#q4.8">4.8</a> above.</p>
+
+ <p>Whether you use this external parser or converter, or acroread
+ with the <a href="attrs.html#pdf_parser">pdf_parser</a> attribute,
+ to successfully index PDF files be sure to set the <a
+ href="attrs.html#max_doc_size">max_doc_size</a> attribute to
+ a value larger than the size of your largest PDF file. PDF
+ documents can not be parsed if they are truncated.</p>
+
+ <p>This also raises the questions of why two different
+ methods of indexing PDFs are supported, and which method
+ is preferred. The built-in PDF support, which uses acroread
+ to convert the PDF to PostScript, was the first method which
+ was provided. It had a few problems with it: acroread is not
+ open source, it is not supported on all systems on which
+ ht://Dig can run, and for some PDFs, the PostScript that
+ acroread generated was very difficult to parse into indexable
+ text. Also, the built-in PDF support expected PDF documents to
+ use the same character encoding as is defined in your current
+ <a href="attrs.html#locale">locale</a>, which isn't always the
+ case. The external converters, which use pdftotext, were developed
+ to overcome these problems. xpdf 0.90 is free software, and its
+ pdftotext utility works very well as an indexing tool.
+ It also converts various PDF encodings to the Latin 1 set.
+ It is the opinion of the developers that this is the
+ preferred method. However, some users still prefer to stick
+ with acroread, as it works well for them, and is a little
+ easier to set up if you've already installed Acrobat.</p>
+
+ <p>Also, pdftotext still has some difficulty handling text in
+ landscape orientation, even with its new -raw option in 0.90,
+ so if you need to index such text in PDFs, you may still get
+ better results with acroread. The pdf_parser attribute has been
+ removed from the 3.2 beta releases of htdig, so to use acroread
+ with htdig 3.2.0b5 or other 3.2 betas, use the acroconv.pl
+ external converter script from our <a
+ href="http://www.htdig.org/files/contrib/parsers/">web site</a>.</p>
+
+ <p>See also question <a href="#q5.2">5.2</a> below and
+ question <a href="#q1.13">1.13</a> above.
+ See questions <a href="#q5.37">5.37</a> and <a href="#q5.39">5.39</a>
+ for troubleshooting tips.</p>
+
+ <strong>4.10. <a name="q4.10">How do I index documents in other
+ languages?</a></strong><br>
+ <p>The first and most important thing you must do,
+ to allow ht://Dig to properly support international
+ characters, is to define the correct locale for the
+ language and country you wish to support. This is done
+ by setting the <a href="attrs.html#locale">locale</a>
+ attribute (see question <a href="#q5.8">5.8</a>). The
+ next step is to configure ht://Dig to use dictionary and
+ affix files for the language of your choice. These can
+ be the same dictionary and affix files as are used by the
+ ispell software. A collection of these is available from
+ Geoff Kuenning's
+ <a href="http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html">
+ International Ispell Dictionaries page</a>, and we're slowly
+ building a collection of word lists on our <a
+ href="http://www.htdig.org/files/contrib/wordlists/">web site</a>.</p>
+ <p>For example, if you install German dictionaries in common/german,
+ you could use these lines in your configuration file:</p>
+<pre>
+<a href="attrs.html#locale">locale</a>: de_DE
+lang_dir: ${<a href="attrs.html#common_dir">common_dir</a>}/german
+<a href="attrs.html#bad_word_list">bad_word_list</a>: ${lang_dir}/bad_words
+<a href="attrs.html#endings_affix_file">endings_affix_file</a>: ${lang_dir}/german.aff
+<a href="attrs.html#endings_dictionary">endings_dictionary</a>: ${lang_dir}/german.0
+<a href="attrs.html#endings_root2word_db">endings_root2word_db</a>: ${lang_dir}/root2word.db
+<a href="attrs.html#endings_word2root_db">endings_word2root_db</a>: ${lang_dir}/word2root.db
+</pre>
+ <p>
+ You can build the endings database with <code>htfuzzy endings</code>.
+ (This command may actually take days to complete, for
+ releases older than 3.1.2. Current releases use faster regular
+ expression matching, which will speed this up by a few orders
+ of magnitude.) Note that the "*.0" files are not part of
+ the ispell dictionary distributions, but are easily made by
+ concatenating the partial dictionaries and sorting to remove
+ duplicates (e.g.: "<code>cat * | sort | uniq &gt; lang.0</code>"
+ in most cases). You will also need to redefine the synonyms
+ file if you wish to use the synonyms search algorithm. This
+ file is not included with most of the dictionaries, nor is the
+ <a href="attrs.html#bad_words">bad_words</a> file.</p>
+
+ <p>If you put all the language-specific
+ dictionaries and configuration files in separate directories,
+ and set all the attribute definitions accordingly in each
+ search config file to access the appropriate files, you can
+ have a multilingual setup where the user selects the language
+ by selecting the "config" input parameter value. In addition
+ to the attributes given in the example above, you may also
+ want custom settings for these language-specific attributes:
+ <a href="attrs.html#date_format">date_format</a>,
+ <a href="attrs.html#iso_8601">iso_8601</a>,
+ <a href="attrs.html#method_names">method_names</a>,
+ <a href="attrs.html#no_excerpt_text">no_excerpt_text</a>,
+ <a href="attrs.html#no_next_page_text">no_next_page_text</a>,
+ <a href="attrs.html#no_prev_page_text">no_prev_page_text</a>,
+ <a href="attrs.html#nothing_found_file">nothing_found_file</a>,
+ <a href="attrs.html#page_list_header">page_list_header</a>,
+ <a href="attrs.html#prev_page_text">prev_page_text</a>,
+ <a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
+ (or <a href="attrs.html#search_results_header">search_results_header</a>
+ and <a href="attrs.html#search_results_footer">search_results_footer</a>),
+ <a href="attrs.html#sort_names">sort_names</a>,
+ <a href="attrs.html#synonym_db">synonym_db</a>,
+ <a href="attrs.html#synonym_dictionary">synonym_dictionary</a>,
+ <a href="attrs.html#syntax_error_file">syntax_error_file</a>,
+ <a href="attrs.html#template_map">template_map</a>, and of course
+ <a href="attrs.html#database_dir">database_dir</a> or
+ <a href="attrs.html#database_base">database_base</a> if you
+ maintain multiple databases for sites of different languages.
+ You could also change the definition of
+ <a href="attrs.html#common_dir">common_dir</a>, rather than
+ making up a lang_dir attribute as above, as many language-specific
+ files are defined relative to the common_dir setting.</p>
+
+ <p>If you're running version 3.1.6 of ht://Dig, you may also
+ be interested in the <strong>accents</strong> fuzzy match
+ algorithm in the
+ <a href="attrs.html#search_algorithm">search_algorithm</a>
+ attribute, which lets you treat accented and unaccented letters
+ as equivalent in words. Note that if you use the accents algorithm,
+ you need to rebuild the accents database each time you update your
+ word database, using <code>"htfuzzy accents"</code>. This command
+ isn't in the default rundig script, so you may want to add it there.
+ The accents fuzzy match algorithm is also in the 3.2 beta releases.
+ There are also the
+ <a href="attrs.html#boolean_keywords">boolean_keywords</a> and
+ <a href="attrs.html#boolean_syntax_errors">boolean_syntax_errors</a>
+ attributes in 3.1.6 for changing other language-specific messages
+ in htsearch.</p>
+
+ <p>Current versions of ht://Dig only support 8-bit
+ characters, so languages such as Chinese and Japanese, which
+ require 16-bit characters, are not currently supported.</p>
+
+ <p>Didier Lebrun has written a guide for configuring htdig to
+ support French, entitled
+ <a href="http://www.quartier-rural.org/dl/elucu/htdig-vf/lisezmoi.html">
+ Comment installer et configurer HtDig pour la langue fran&ccedil;aise</a>.
+ His "kit de francisation" is also available on
+ <a
+ href="http://www.htdig.org/files/contrib/wordlists/">our
+ web site</a>.</p>
+
+ <p>See also question <a href="#q4.2">4.2</a> for tips on customizing
+ htsearch, and question <a href="#q4.6">4.6</a> for tips where to find
+ bad_words files.</a></p>
+
+ <strong>4.11. <a name="q4.11">How do I get rotating banner ads in
+ search results?</a></strong><br>
+ <p>While htsearch doesn't currently provide a means of doing
+ SSI on its output, or calling other CGI scripts, it does have
+ the capability of using environment variables in templates.</p>
+
+ <p>The easiest way to get rotating banners in htsearch is
+ to replace htsearch with a wrapper script that sets an
+ environment variable to the banner content, or whatever
+ dynamically generated content you want. Your script can then
+ call the real htsearch to do the work. The wrapper script can be
+ written as a shell script, or in Perl, C, C++, or whatever you
+ like. You'd then need to reference that environment variable
+ in header.html (or wrapper.html if that's what you're using),
+ to indicate where the dynamic content should be placed.</p>
+
+ <p>If the dynamic content is generated by a CGI script, your new
+ wrapper script which calls this CGI would then have to strip out
+ the parts that you don't want embedded in the output (headers,
+ some tags) so that only the relevant content gets put into the
+ environment variable you want. You'd also have to make sure
+ this CGI script doesn't grab the POST data or get confused by
+ the QUERY_STRING contents intended for htsearch. Your script
+ should not take anything out of, or add anything to, the
+ QUERY_STRING environment variable.</p>
+
+ <p>An alternative approach is to have a cron job that periodically
+ regenerates a different header.html or wrapper.html with the
+ new banner ad, or changes a link to a different pre-generated
+ header.html or wrapper.html file. For other alternatives, see
+ question <a href="#q4.7">4.7</a>.</p>
+
+ <strong>4.12. <a name="q4.12">How do I index numbers in documents?</a></strong><br>
+ <p>By default, htdig doesn't treat numbers without letters
+ as words, so it doesn't index them.
+ To change this behavior, you must set the
+ <a href="attrs.html#allow_numbers">allow_numbers</a>
+ attribute to true, and rebuild your index from scratch using
+ rundig or htdig with the -i option, so that bare numbers get
+ added to the index.</p>
+
+ <strong>4.13. <a name="q4.13">How can I call htsearch from a hypertext
+ link, rather than from a search form?</a></strong><br>
+ <p>If you change the search.html form to use the GET method
+ rather than POST, you can see the URLs complete with all the
+ arguments that htsearch needs for a query. Here is an example:<br>
+<code>
+http://www.grommetsRus.com/cgi-bin/htsearch?config=htdig&amp;restrict=&amp;exclude=&amp;method=and&amp;format=builtin-long&amp;words=grapple+grommets
+</code>
+ which can actually be simplified to:<br>
+<code>
+http://www.grommetsRus.com/cgi-bin/htsearch?method=and&amp;words=grapple+grommets
+</code>
+ with the current defaults. The "&amp;" character acts as a
+ separator for the input parameters, while the "+" character
+ acts as a space character within an input parameter.
+ In versions 3.1.5 or 3.2.0b2, or later, you can use a semicolon
+ character ";" as a parameter separator, rather than "&amp;", for
+ HTML 4.0 compliance.
+ Most non-alphanumeric characters should be hex-encoded following
+ the convention for URL encoding (e.g. "%" becomes "%25", "+"
+ becomes "%2B", etc). Any htsearch input parameter that you'd
+ use in a search form can be added to the URL in this way.
+ This can be embedded into an &lt;a href="..."&gt; tag.
+ <br>See also question <a href="#q5.21">5.21</a>.</p>
+
+ <strong>4.14. <a name="q4.14">How do I restrict a search to only meta
+ keywords entries in documents?</a></strong><br>
+ <p>First of all, you do <strong>not</strong> do this by using the
+ "keywords" field in the search form. This seems to be a
+ frequent cause of confusion. The "keywords" input parameter
+ to htsearch has absolutely nothing to do with searching meta
+ keywords fields. It actually predates the addition of meta
+ keyword support in 3.1.x. A better choice of name for the
+ parameter would have been "requiredwords", because that's what
+ it really means - a list of words that are all required to be
+ found somewhere in the document, in addition to the words the
+ user specifies in the search form.</p>
+
+ <p>As of 3.2.0b5, the most direct way to search for a particular
+ meta keyword is to specify the word as "keyword:&lt;word&gt;".
+ Similarly, "title:", "heading:", and "author:" restrict searches
+ to the respective fields. To search for words in the body of the
+ text, use "text:".</p>
+
+ <p>To restrict all search terms to meta keywords only, you can set all
+ <a href="attrs.html#heading_factor">factors</a> other than
+ keywords_factor to 0, and for 3.1.x, you
+ must then reindex your documents. In the 3.2 betas, you can
+ change factors at search time without needing to reindex.
+ As of 3.2.0b5, it is possible to restrict
+ the search in the query itself. Note that changing the scoring
+ factors in this way will only alter the scoring of search results,
+ and shift the low or zero scores to the end of the results when
+ sorting by score (as is done by default). For versions before
+ 3.2.0b5, the results with scores
+ of zero aren't actually removed from the search results.</p>
+
+ <strong>4.15. <a name="q4.15">Can I use meta tags to prevent htdig from
+ indexing certain files?</a></strong><br>
+ <p>Yes, in each HTML file you want to exclude, add the following
+ between the &lt;HEAD&gt; and &lt;/HEAD&gt; tags:</p>
+ <blockquote>
+ &lt;META NAME="robots" CONTENT="noindex, follow"&gt;
+ </blockquote>
+ <p>Doing so will allow htdig to still follow links to other documents,
+ but will prevent this document from being put into the index itself.
+ You can also use "nofollow" to prevent following of links. See
+ the section on <a href="meta.html">Recognized META information</a>
+ for more details. For documents produced automatically by MhonArc,
+ you can have that line inserted automatically by putting it in the
+ MhonArc resource file, in the sections IDXPGBEGIN and TIDXPGBEGIN.</p>
+
+ <p>You can also use the
+ <a href="attrs.html#noindex_start">noindex_start</a> and
+ <a href="attrs.html#noindex_end">noindex_end</a> attributes to
+ define one set of tags which will mark sections to be stripped out
+ of documents, so they don't get indexed, or you can mark sections
+ with the non-DTD &lt;noindex&gt; and &lt;/noindex&gt; tags.
+ The noindex_start and noindex_end attributes can also be used to
+ suppress in-line JavaScript code that wasn't properly enclosed in
+ HTML comment tags (see question <a href="#q4.26">4.26</a>).
+ In 3.1.6, you can also put a section between &lt;noindex follow&gt;
+ and &lt;/noindex&gt; tags to turn off indexing of text but still
+ allow htdig to follow links.</p>
+
+ <p>If you require much more elaborate schemes for avoiding indexing
+ certain parts of your HTML files, especially if you don't have
+ control over these files and can't add tags to them, you can
+ set up htdig's
+ <a href="attrs.html#external_parsers">external_parsers</a> attribute
+ with an external converter that will preprocess the HTML before
+ it's parsed and indexed by htdig. Examples of this are the
+ unhypermail.sh script in our
+ <a href="http://www.htdig.org/files/contrib/parsers/">contributed parsers</a>
+ and the ungeoify.sh script in our
+ <a href="http://www.htdig.org/files/contrib/scripts/">contributed scripts</a>.
+ By preprocessing the HTML, you can strip out parts you don't want, or
+ you can add or change tags wherever they're needed, if you're willing
+ to put in the effort to learn awk/sed/perl enough to do the job.</p>
+
+ <strong>4.16. <a name="q4.16">How do I get htsearch to use the star image
+ in a different directory than the default /htdig?</a></strong><br>
+ <p>You must set either the
+ <a href="attrs.html#image_url_prefix">image_url_prefix</a> attribute,
+ or both <a href="attrs.html#star_blank">star_blank</a> and
+ <a href="attrs.html#star_image">star_image</a> in your
+ htdig.conf, to refer to the URL path for these files. You should
+ also set this URL path similarly in in common/header.html and
+ common/wrapper.html, as they also refer to the star.gif file.
+ If you want to relocate other graphics, such as the buttons or
+ the ht://Dig logo, you should change all references to these
+ in htdig.conf and common/*.html.</p>
+
+ <strong>4.17. <a name="q4.17">How do I get htdig or htsearch to rewrite
+ URLs in the search results?</a></strong><br>
+ <p>This can be done by using the <a
+ href="attrs.html#url_part_aliases">url_part_aliases</a>
+ configuration file attribute. You have to set up different
+ configuration files for htdig and htsearch, to define a
+ different setting of this attribute for each one.</p>
+
+ <p>A large number of users insist on ignoring that last point
+ and try to make do with just one definition, either for htdig
+ or htsearch, or sometimes for both. This seems to stem from
+ a fundamental misunderstanding of how this attribute works,
+ so perhaps a clarification is needed. The url_part_aliases
+ attribute uses a two stage process. In the first stage, htdig
+ encodes the URLs as they go into the database, by using the
+ pairs in url_part_aliases going from left to right. In the
+ second stage, htsearch decodes the encoded URLs taken from the
+ database, by using the pairs in url_part_aliases going from
+ right to left. If you have the same value for url_part_aliases
+ in htdig and htsearch, you end up with the same URLs in the
+ end. If you modify the first string (the from string) in
+ the pairs listed in url_part_aliases for htsearch, then when
+ htsearch decodes the URLs it ends up rewriting part of them.</p>
+
+ <p>While you might think that if you don't use url_part_aliases
+ in htdig, then you can use it in htsearch to alter unencoded
+ URLs, the reality is that if you don't encode parts of URLs
+ using url_part_aliases, they still get encoded automatically
+ by the <a href="attrs.html#common_url_parts">common_url_parts</a>
+ attribute. This helps to reduce the size of your databases. So,
+ trying to use url_part_aliases only in htsearch doesn't work
+ because there are no unencoded URLs in the database, so the
+ right hand strings in the pairs you define won't match anything.</p>
+
+ <p>You also can't put two different definitions of the
+ url_part_aliases attribute in a single configuration file, as
+ some users have attempted. When you define an attribute twice,
+ the second definition merely overrides the first. Pay close
+ attention to the description and examples for
+ <a href="attrs.html#url_part_aliases">url_part_aliases</a>.
+ You must put one definition of this attribute in your
+ configuration file for htdig, htmerge (or htpurge) and htnotify,
+ and a different definition of it in your configuration file
+ for htsearch.</p>
+
+ <strong>4.18. <a name="q4.18">What are all the options in
+ htdig.conf, and are there others?</a></strong><br>
+ <p>In ht://Dig's terminology, the settings in its configuration
+ files are called <a href="attrs.html">configuration attributes</a>,
+ to distinguish them from <a href="htdig.html">command line
+ options</a>, <a href="hts_form.html">CGI input parameters</a>
+ and <a href="hts_templates.html">template variables</a>. There are
+ many, many attributes that can be set to control almost all
+ aspects of indexing, searching, customization of output and
+ internationalization. All attributes have a built-in default
+ setting, and only a subset of these appear in the sample htdig.conf
+ file. See the documentation for all default values for attributes
+ not overridden in the configuration file, and for help on using
+ any of them.
+ See also question <a href="#q1.15">1.15</a>.</p>
+
+ <strong>4.19. <a name="q4.19">How do I get more than 10 pages of
+ 10 search results from htsearch?</a></strong><br>
+ <p>There are two attributes that control the number of matches per
+ page and the total number of pages. The number of matches per page
+ can be set in your configuration file, using the
+ <a href="attrs.html#matches_per_page">matches_per_page</a> attribute,
+ or in your <a href="hts_form.html">search form</a>, using the
+ <strong>matchesperpage</strong> input parameter.</p>
+
+ <p>The number of pages is controlled by the
+ <a href="attrs.html#maximum_pages">maximum_pages</a> attribute in
+ your search configuration file.
+ The current default for maximum_pages is 10 because the ht://Dig
+ package comes with 10 images, with numbers 1 through 10, for
+ use as page list buttons. If we increased the limit, we'd have
+ to field a whole lot more questions from users irate because
+ only the first 10 buttons are graphics, and the rest are text.
+ If you want more than 10 pages of results, change maximum_pages,
+ but you may also want to set the
+ <a href="attrs.html#page_number_text">page_number_text</a> and
+ <a href="attrs.html#no_page_number_text">no_page_number_text</a>
+ attributes in your search configuration file to nothing, or
+ remove them, to use text rather than images for the links to
+ other pages.</p>
+
+ <p>In version of htsearch before 3.1.4, maximum_pages
+ limited only the number of page list buttons, and not the
+ actual number of pages. This was changed because there was no
+ means of limiting the total number of pages, but this ended up
+ frustrating users who wanted the ability to have more pages than
+ buttons. In 3.2.0b3 and 3.1.6 we introduced a
+ <a href="attrs.html#maximum_page_buttons">maximum_page_buttons</a>
+ attribute for this purpose.</p>
+
+ <strong>4.20. <a name="q4.20">How do I restrict a search to only
+ certain subdirectories or documents?</a></strong><br>
+ <p>That depends on whether you want to protect certain parts of
+ your site from prying eyes, or just limit the scope of search
+ results to certain relevant areas. For the latter, you just need
+ to set the <strong>restrict</strong> or <strong>exclude</strong>
+ input parameter in the <a href="hts_form.html">search form</a>.
+ This can be done using hidden input fields containing preset
+ values, text input fields, select lists, radio buttons or
+ checkboxes, as you see fit. If you use select lists, you can
+ propagate the choices to select lists in the follow-up search
+ forms using the
+ <a href="attrs.html#build_select_lists">build_select_lists</a>
+ configuration attribute.
+ The University at Albany has a good description of how to use
+ the <strong>restrict</strong> or <strong>exclude</strong> input
+ parameters: <a href="http://www.albany.edu/its/web/search/">
+ Constructing a local search using ht://Dig Search forms</a>.
+ <br>To include a hex encoded character (such as a %20 for a space)
+ in a restrict or exclude string, the '%' must again be encoded.
+ For example, to match a filename containing a space, the URL must
+ contain %20, and so the CGI parameter passed to htsearch must
+ contain %2520. The %25 encodes the '%'. (Note that this is only
+ necessary for CGI input parameters, not for the corresponding
+ configuration attributes in your htdig.conf file, as attributes
+ aren't subjected to the same hex decoding step as parameters are.)
+ <br>See also question <a href="#q4.4">4.4</a>.</p>
+
+ <p>If you wish to keep secure and non-secure areas on
+ your site separate, and avoid having unauthorized users
+ seeing documents from secure areas in their search results,
+ that takes a bit more effort. You certainly can't rely on
+ the <strong>restrict</strong> and <strong>exclude</strong>
+ parameters, or even the <strong>config</strong> parameter,
+ as any parameter in a search form can also be overridden
+ by the user in a URL with CGI parameters. The safest
+ option would be to host the secure and non-secure areas on
+ separate servers with independent installations of htsearch,
+ each with its own ht://Dig database, but that is often too
+ costly or impractical an option. The next best thing is to
+ host them on the same site, but make sure that everything
+ is very clearly separated to prevent any leakage of secure
+ data. You should maintain separate databases for the secure
+ and public areas of your site, by setting up different htdig
+ configuration files for each area. Use different settings
+ of the <a href="attrs.html#start_url">start_url</a>,
+ <a href="attrs.html#limit_urls_to">limit_urls_to</a>
+ and <a href="attrs.html#database_dir">database_dir</a>
+ configuration attributes, and possibly even different
+ <a href="attrs.html#common_dir">common_dir</a> settings as well.
+ Make sure your database_dir, and even your common_dir, are not
+ in any directories accessible from the web server. Run htdig
+ and htmerge (or rundig) with each separate configuration file,
+ to build your two databases.</p>
+
+ <p>The tricky part is to make sure your htsearch program is
+ secure. You don't want to use the same htsearch for the secure
+ and public sites, because otherwise the public site could
+ access the configuration for the secure database, making its
+ data publicly accessible. You must either compile two separate
+ versions of htsearch, with different settings of the CONFIG_DIR
+ <em>make</em> variable, or you must make a simple wrapper
+ script for htsearch that overrides the compiled-in CONFIG_DIR
+ setting by a different setting of the CONFIG_DIR environment
+ variable. Make sure the CONFIG_DIR for the secure area is
+ not a subdirectory of the CONFIG_DIR for the public area.
+ In this way, you can maintain separate directories of config
+ files for the public and secure sites, so that the secure
+ config files are not accessible from the public htsearch.</p>
+
+ <p>Put the htsearch binary or wrapper script for the secure site
+ in a different ScriptAlias'ed cgi-bin directory than the public
+ one, and protect the secure cgi-bin with a .htaccess file or
+ in your server configuration. Alternatively, you can put the
+ secure program, let's call it htssearch, in the same cgi-bin,
+ but protect that one CGI program in your server configuration,
+ e.g.:</p>
+<pre>
+&lt;Location /cgi-bin/htssearch&gt;
+AuthType Basic
+AuthName ....
+AuthUserFile ...
+AuthGroupFile ...
+&lt;Limit GET POST&gt;
+require group foo
+&lt;/Limit&gt;
+&lt;/Location&gt;
+</pre>
+ <p>This describes the setup for an Apache server. You'd need to
+ work out an equivalent configuration for your server if you're
+ not running Apache.</p>
+
+ <strong>4.21. <a name="q4.21">How can I allow people to search
+ while the index is updating?</a></strong><br>
+ <p>Answer contributed by Avi Rappoport &lt;avirr@searchtools.com&gt;</p>
+ <p>If you have enough disk space for two copies of the index
+ database, use -a with the htdig and htmerge processes. This will
+ make use of a copy of the index database with the extension
+ ".work", and update the copy instead of the originals.
+ This way, htsearch can use those originals while the update is
+ going on. When it's done, you can move the .work versions to
+ replace the originals, and htsearch will use them. The current
+ rundig script will do this for you if you supply the -a flag
+ to it. However, rundig builds the database from scratch each
+ time you run it. If you want to update an alternate copy of
+ the database, see the
+ <a href="http://www.htdig.org/files/contrib/scripts/rundig.sh">contributed
+ rundig.sh script</a>.</p>
+
+ <strong>4.22. <a name="q4.22">How can I get htdig to ignore the
+ robots.txt file or meta robots tags?</a></strong><br>
+ <p>You can't, and you shouldn't. The
+ <a href="http://www.robotstxt.org/wc/norobots.html">
+ Standard for Robot Exclusion</a> exists for a very good reason,
+ and any well behaved indexing engine or spider should conform to it.
+ If you have a problem with a robots.txt file, you really should
+ take it up with the site's webmaster. If they don't have a problem
+ with you indexing their site, they shouldn't mind setting up a
+ User-agent entry in their robots.txt file with a name you both
+ agree on. The user agent setting that htdig uses for matching
+ entries in robots.txt can be changed via the
+ <a href="attrs.html#robotstxt_name">robotstxt_name</a> attribute in
+ your config file.</p>
+
+ <p>If you have a problem with a robots meta tag in a document
+ (see question <a href="#q4.15">4.15</a>) you should take it up
+ with the author or maintainer of that page. These tags are an
+ all or nothing deal, as they can't be set up to allow some engines
+ and disallow others. If htdig encounters them, it has to give the
+ page's creator the benefit of the doubt and honour them. If
+ exceptions to the rule are wanted, this should be done with a
+ robots.txt file rather than a meta tag.</p>
+
+ <strong>4.23. <a name="q4.23">How can I get htdig not to index
+ some directories, but still follow links?</a></strong><br>
+ <p>You can simply add the directory name to your robots.txt file
+ or to the <a href="attrs.html#exclude_urls">exclude_urls</a>
+ attribute in your configuration, but that will exclude all files
+ under that directory. If you want the files in that directory to
+ be indexed, you have a couple options. You can add an index.html
+ file to the directory, that will include a robots meta tag
+ (see question <a href="#q4.15">4.15</a>) to prevent indexing,
+ and will contain links to all your files in this directory.
+ The drawback of this is that you must maintain the index.html
+ file yourself, as it won't be automatically updated as new
+ files are added to the directory.</p>
+
+ <p>The other technique you can use, if you want the directory
+ index to be made by the web server, is to get the server to
+ insert the robots meta tag into the index page it generates.
+ In Apache, this is done using the
+ <a href="http://httpd.apache.org/docs/mod/mod_autoindex.html#headername">HeaderName</a>
+ and <a href="http://httpd.apache.org/docs/mod/mod_autoindex.html#indexoptions">IndexOptions</a>
+ directives in the directory's <strong>.htaccess</strong> file.
+ For example:</p>
+<pre> HeaderName .htrobots
+ IndexOptions FancyIndexing SuppressHTMLPreamble
+</pre>
+ <p>and in the .htrobots file:</p>
+<pre>&lt;HTML&gt;&lt;head&gt;
+&lt;META NAME="robots" CONTENT="noindex, follow"&gt;
+&lt;title&gt;Index of /this/dir&lt;/title&gt;
+&lt;/head&gt;
+</pre>
+
+ <p>If you don't mind getting just one copy of each directory,
+ but want to suppress the multiple copies generated by Apache's
+ FancyIndexing option, you can either turn off FancyIndexing or
+ you can add "?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" to
+ the <a href="attrs.html#bad_querystr">bad_querystr</a> attribute
+ (without the quotes) to suppress the alternately sorted views of
+ the directory. For Apache 2.x, you'd use "C=D C=M C=N C=S O=A O=D"
+ instead in your bad_querystr setting.</p>
+
+ <strong>4.24. <a name="q4.24">How can I get rid of duplicates in
+ search results?</a></strong><br>
+ <p>This depends on the cause of the duplicate documents. htdig
+ does keep track of the URLs it visits, so it never puts the
+ same URL more than once in the database. So, if you have
+ duplicate documents in your search results, it's because the
+ same document appears under different URLs. Sometimes the
+ URLs vary only slightly, and in subtle ways, so you may have
+ to look hard to find out what the variation is. Here are some
+ common reasons, each requiring a different solution.</p>
+
+ <ul>
+ <li>You're indexing a case insensitive web
+ server (e.g. an NT based server), but the
+ <a href="attrs.html#case_sensitive">case_sensitive</a> attribute is
+ still set to true. In this case, if htdig encounters two URLs
+ pointing to the same document, but the case of the letters in
+ one is different than the other (even if it's only 1 letter),
+ it will not treat them as the same URL.<br><br>
+ <li>You have symbolic links (or hard links) to some of
+ these documents, so they can be reached by several URLs.
+ The solution here is to build an exclude list of URLs that
+ are actually symbolic links, and putting these in
+ <a href="attrs.html#exclude_urls">exclude_urls</a>
+ (or in your robots.txt file). You can automate this using a
+ technique similar to the find command in question
+ <a href="#q5.25">5.25</a> which builds the start_url list, but
+ adding a -type l to find symbolic links.<br><br>
+ <li>You have copies of the same documents in different
+ locations. This is similar to the symbolic link problem above,
+ but harder to fix automatically.<br><br>
+ <li>The duplicate URLs result from CGI, SSI or other dynamic pages
+ that give the same content even though there may be variations in
+ the query string or other parts of the URL. The approach to
+ fix this is similar to the fix above, but may be less easy
+ to automate, depending on what the variations are. You can
+ add patterns to exclude_urls or bad_querystr to get rid of
+ unwanted variations. These are especially important to bring
+ under control, because in some cases, if left unchecked, they
+ can result in an <em>infinite virtual hierarchy</em> which htdig
+ will never be able to finish indexing. For example, in a CGI-based
+ calendar, htdig could go on following next month or next
+ year links to infinity, but this can be stopped by adding a
+ stop year to <a href="attrs.html#bad_querystr">bad_querystr</a>.
+ <br><br>Another common example happens when htdig hits a link
+ to an SSI page and the URL has an extra trailing slash. This
+ can happen with either .shtml pages or .html pages that use
+ the XBitHack. The trailing slash causes the URL to be misinterpreted
+ as a directory URL, and any relative URLs in the document are added
+ to the URL, creating longer and longer URLs that still lead to the
+ same SSI document. There are two things you can do:<ol>
+ <li>hunt down the pages with the incorrect links, i.e.
+ search for ".shtml/" or ".html/" in URLs in your documents,
+ and fix these links; or
+ <li>add .shtml/ and .html/ to your
+ <a href="attrs.html#exclude_urls">exclude_urls</a>
+ setting to get htdig to ignore these defective links.
+ </ol>The second option is easier, but you run the risk that htdig
+ will miss some SSI pages if the only links to them have the trailing
+ slash, so you may want to try hunting down the links anyway.
+ <br><br>See also question <a href="#q5.29">5.29</a>.<br><br>
+ <li>The duplicates result from session IDs in PHP or other dynamic
+ pages that give the same content even though the ID changes during
+ the indexing process. This can lead not only to duplicates, but
+ also to URLs that become unusable because of expired session IDs.
+ Session IDs are the bane of search engines, and you should avoid
+ using them if at all possible. If getting rid of them altogether
+ isn't an option, then you can at least remove them while indexing,
+ using the <a href="attrs.html#url_rewrite_rules">url_rewrite_rules</a>
+ attribute. This will only work if htdig can access the documents
+ without a session ID, as htdig rewrites the URL before fetching the
+ document, and htsearch presents the rewritten URL (without session
+ ID) in search results.
+ </ul>
+
+ <strong>4.25. <a name="q4.25">How can I change the scores in
+ search results, and what are the defaults?</a></strong><br>
+ <p>The scores are calculated mostly by htdig at indexing time,
+ with some tweaking done by htsearch at search time. There are
+ a number of <a href="attrs.html">configuration attributes</a>,
+ all called <em>&lt;something&gt;</em><strong>_factor</strong>,
+ which can control the scoring calculations. In addition, the
+ location of words within the document has an effect on score,
+ as word scores are also multiplied by a varying location
+ factor somewhere in between 1000 for words near the start
+ and 1 for words near the end of the document. As of yet,
+ there is no way to change this factor. For any of the scoring
+ factors you can configure, and which are used by htdig, you
+ will have to reindex your documents so the new factors take
+ effect. The default values for these scoring factors, as well as
+ information about whether they're used by htdig or htsearch,
+ are all listed in the <a href="attrs.html">configuration
+ attributes documentation</a>. Malcolm Austen has written some
+ <a href="http://wwwsearch.ox.ac.uk/scores.html">notes on page
+ scores</a> for 3.1.x which you may find helpful.</p>
+
+ <p>Note that the above applies to the 3.1.x releases, while
+ in the 3.2 beta releases, all scores are calculated at search
+ time with no weight being put on the location of words within
+ the document.</p>
+
+ <strong>4.26. <a name="q4.26">How can I get htdig not to index
+ JavaScript code or CSS?</a></strong><br>
+ <p>The HTML parser in htdig recognizes and parses only HTML,
+ which is all there should be within an HTML file. If your HTML
+ files contain in-line JavaScript code or Cascading Style Sheets
+ (CSS), these in-line codes, which are clearly not HTML, should
+ be enclosed within an HTML comment tag so they are hidden
+ from view from the HTML parser, or for that matter from any
+ web client that is not JavaScript-aware or CSS-aware. See
+ <a href="http://www.mcli.dist.maricopa.edu/show/interact/js_b.html">
+ Behind the Scenes with JavaScript</a> for a description of the
+ technique, which applies equally well to in-line style sheets.
+ If fixing up all non-HTML compliant JavaScript or CSS code in
+ your HTML files is not an option, then see question
+ <a href="#q4.15">4.15</a> for an alternate technique.</p>
+
+ <p>The HTML parser in htdig 3.1.6 tries skipping over bare
+ in-line JavaScript code in HTML, unlike previous versions,
+ but a small bug in the parser causes it to be thrown off by a
+ "&lt;" sign in the JavaScript, and it may then miss the closing
+ &lt;/script&gt; tag. This can be fixed by applying this
+ <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.6/JavaScript.0">
+ patch</a>.</p>
+
+ <hr noshade size=2>
+
+ <h3>5. Troubleshooting</h3>
+ <strong>5.1. <a name="q5.1">I can't seem to index more than X documents
+ in a directory.</a></strong><br>
+ <p>This usually has to do with the default document size
+ limit. If you set <a href="attrs.html#max_doc_size">
+ max_doc_size</a> in your config file to
+ something enough to read in the directory index (try 100000 for
+ 100K) this should fix this problem. Of course this will require
+ more memory to read the larger file. Don't set it to a value
+ larger than the amount of memory you have, and never more than
+ about 2 billion, the maximum value of a 32-bit integer.
+ If htdig is missing entire directories, see question
+ <a href="#q5.25">5.25</a>.</p>
+
+ <strong>5.2. <a name="q5.2">I can't index PDF files.</a></strong><br>
+ <p>As above, this usually has to do with the default document
+ size. What happens is ht://Dig will read in part of a PDF file
+ and try to index it. This usually fails. Try setting
+ <a href="attrs.html#max_doc_size">max_doc_size</a>
+ in your config file to a larger value than the
+ size of your largest PDF file. Don't go overboard, though, as
+ you don't want to overflow a 32-bit integer (about 2 billion),
+ and you don't want to allocate much more memory than you need
+ to store the largest document.</p>
+
+ <p>There is a bug in Adobe Acrobat Reader version 4, in its
+ handling of the -pairs option, which causes a segmentation
+ violation when using it with htdig 3.1.2 or earlier. There is
+ a workaround for this as of version 3.1.3 - you must remove
+ the -pairs option from your pdf_parser definition, if it's
+ there. However, acroread version 4 is still very unstable (on
+ Linux, anyway) so it is not recommended as a PDF parser. An
+ alternative is to use an external converter with the xpdf 0.90
+ package installed on your system, as described in question <a
+ href="#q4.9">4.9</a> above.</p>
+
+ <strong>5.3. <a name="q5.3">When I run "rundig," I get a message about
+ "DATABASE_DIR" not being found.</a></strong><br>
+ <p>This is due to a bug in the Makefile.in file in version
+ 3.1.0b1. The easiest fix is to edit the rundig file and change
+ the line "TMPDIR=@DATABASE_DIR@" to set TMPDIR to a directory
+ with a large amount of temporary disk space for htmerge. This
+ bug is fixed in version 3.1.0b2.</p>
+
+ <strong>5.4. <a name="q5.4">When I run htmerge, it stops with an "out
+ of diskspace" message.</a></strong><br>
+ <p>This means that htmerge has run out of temporary disk space
+ for sorting. Either in your "rundig" script (if you run htmerge
+ through that) or before you run htmerge, set the variable TMPDIR
+ to a temp directory with lots of space.</p>
+
+ <strong>5.5. <a name="q5.5">I have problems running rundig from cron
+ under Linux.</a></strong><br>
+ <p>This problem commonly occurs on Red Hat Linux 5.0 and 5.1,
+ because of a bug in vixie-cron. It causes htmerge to fail with a
+ "Word sort failed" error. It's fixed in Red Hat 5.2.
+ You can install vixie-cron-3.0.1-26.{arch}.rpm from a 5.2
+ distribution to fix the problem on 5.0 or 5.1. A quick fix for
+ the problem is to change the first line of rundig to "#!/bin/ash"
+ which will run the script through the ash shell, but this doesn't
+ solve the underlying problem.</p>
+
+ <strong>5.6. <a name="q5.6">When I run htmerge, it stops with an
+ "Unexpected file type" message.</a></strong><br>
+ <p>Often this is because the databases are corrupt. Try removing
+ them and rebuilding. If this doesn't work, some have found that
+ the solution for question <a href="#q3.2">3.2</a> works for this
+ as well. This should be fixed in versions from 3.1.x</p>
+
+ <strong>5.7. <a name="q5.7">When I run htsearch, I get lots of Internal
+ Server Errors (#500).</a></strong><br>
+ <p>If you are running under Solaris, see <a href="#q3.6">3.6</a>.
+ The solution for Solaris may also work for other OSes that use shared
+ libraries in non-standard locations, so refer to question 3.6 if
+ you suspect a shared library problem. In any case, check your web
+ server error logs to see the cause of the internal server errors.
+ If it's not a problem with shared libraries, there's a good chance
+ that the error logs will still contain useful error messages that
+ will help you figure out what the problem is.
+ <br>See also questions <a href="#q5.13">5.13</a> and
+ <a href="#q5.23">5.23</a>.</p>
+
+ <strong>5.8. <a name="q5.8">I'm having problems with indexing words
+ with accented characters.</a></strong><br>
+ <p>
+ Most of the time, this is caused by either not setting or
+ incorrectly setting the <a
+ href="attrs.html#locale">locale</a> attribute. The default locale
+ for most systems is the "portable" locale, which strips
+ everything down to standard ASCII. Most systems expect
+ something like <code>locale: en_US</code> or
+ <code>locale: fr_FR</code>. Locale files are often found in
+ <code>/usr/share/locale</code> or the <tt>$LANGUAGE</tt>
+ environment variable. See also question <a href="#q4.10">4.10</a>.
+ </p>
+
+ <p>Setting the locale correctly seems to be a frequent source of
+ frustration for ht://Dig users, so here are a few pointers which
+ some have found useful. First of all, if you don't have any luck
+ with the settings of the <a href="attrs.html#locale">locale</a>
+ attribute that you try, make sure you use a locale that is
+ defined on your system. As mentioned above, these are usually
+ installed in <code>/usr/share/locale</code>, so look there
+ for a directory named for the locale you want to use. If
+ you don't find it, but find something close, try that locale
+ name. Note that the locale may not have to be specific to the
+ language you're indexing, as long as it uses the same character
+ set. E.g. most western European languages use the ISO-8859-1
+ Latin 1 character set, so on most systems the locales for
+ all these languages define the same character types table
+ and can be used interchangeably. Some systems, however,
+ define only the accented letters used for a given language,
+ so "your mileage may vary." The important thing is that the
+ directory for your locale definition <strong>must</strong>
+ have a file named <code>LC_CTYPE</code> in it. For example,
+ on many Linux distributions, a language-specific locale like
+ <code>fr</code> won't contain this file, but country-specific
+ locales like <code>fr_FR</code> or <code>fr_CA</code> will. If
+ you don't find any appropriate locales installed on your system,
+ try obtaining and installing the locale definition files from
+ your OS distribution. Also, once you've set your locale, you need
+ to reindex all your documents in order for the locale to take
+ effect in the word database. This means rerunning the "rundig"
+ script, or running "htdig -i" and htmerge (or htpurge in the 3.2
+ betas).</p>
+
+ <p>Note also that some UNIX systems and libc5-based Linux
+ systems just don't have a working implementation of locales,
+ so you may not be able to get locales working at all on certain
+ systems. The
+ <a href="http://www.htdig.org/files/contrib/other/testlocale.c">testlocale.c</a>
+ program on our web site can let you see the LC_CTYPE tables
+ for any locale, to aid in finding one that works. Carefully
+ follow the directions in the program's comments to know how to
+ use it and what to look for in its output.</p>
+
+ <strong>5.9. <a name="q5.9">When I run htmerge, it stops with a
+ "Word sort failed" message.</a></strong><br>
+ <p>There are three common causes of this. First of all, the sort
+ program may be running out of temporary file space. Fix this
+ by freeing up some space where sort puts its temporary files,
+ or change the setting of the TMPDIR environment variable to a
+ directory on a volume with more space. A second common problem
+ is on systems with a BSD version of the sort program (such as
+ FreeBSD or NetBSD). This program uses the -T option as a record
+ separator rather than an alternate temporary directory. On these
+ systems, you must remove the TMPDIR environment variable from
+ rundig, or change the code in htmerge/words.cc not to use the
+ -T option. A third cause is the cron program on Red Hat Linux
+ 5.0 or 5.1. (See question <a href="#q5.5">5.5</a> above.)</p>
+
+ <strong>5.10. <a name="q5.10">When htsearch has a lot of matches, it runs
+ extremely slowly.</a></strong><br>
+ <p>When you run htsearch with no customization, on a
+ large database, and it gets a lot of hits, it tends to
+ take a long time to process those hits. Some users with
+ large databases have reported much higher performance,
+ for searches that yield lots of hits, by setting the <a
+ href="attrs.html#backlink_factor">backlink_factor</a> attribute
+ in htdig.conf to 0, and sorting by score. The scores calculated
+ this way aren't quite as good, but htsearch can process hits
+ much faster when it doesn't need to look up the db.docdb record
+ for each hit, just to get the backlink count, date or title,
+ either for scoring or for sorting. This affects versions
+ 3.1.0b3 and up. In version 3.2, currently under development,
+ the databases will be structured differently, so it should
+ perform searches more quickly.</p>
+
+ <p>In version 3.1.6, the date range selection code also slows
+ down htsearch for the same reason. Unfortunately, a small bug
+ crept into the code so that even if you don't set any of the
+ date range input parameters (startyear, endyear, etc.), and
+ you set backlink_factor and date_factor to 0, htsearch still
+ looks at the date in the db.docdb record for each hit. You can
+ avoid this either by setting startyear to 1969 and endyear to
+ 2038 in your config file, or by applying this
+ <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.6/timet_enddate.1">
+ patch</a>.</p>
+
+ <strong>5.11. <a name="q5.11">When I run htsearch, it gives me a count of
+ matches, but doesn't list the matching documents.</a></strong><br>
+ <p>This most commonly happens when you run htsearch while the
+ database is currently being rebuilt or updated by htdig.
+ If htdig and htmerge have run to completion, and the problem still
+ occurs, this is usually an indication of a corrupted database. If
+ it's finding matches, it's because it found the matching
+ words in db.words.db. However, it isn't finding the document
+ records themselves in db.docdb, which would suggest that either
+ db.docdb, or db.docs.index (which maps document IDs used in
+ db.words.db to URLs used to look up records in db.docdb), is
+ incomplete or messed up. You'll likely need to rebuild your
+ database from scratch if it's corrupted. Older versions of
+ ht://Dig were susceptible to database corruption of this
+ sort. Versions 3.1.2 and later are much more stable.</p>
+
+ <p>Another possible cause of this problem is unreadable result
+ template files. If you define external template files via the
+ <a href="attrs.html#template_map">template_map</a> attribute,
+ rather than using the builtin-short or builtin-long templates,
+ and the file names are incorrect or the files do not have
+ read permission for the user ID under which htsearch runs,
+ then htsearch won't be able to display the results. Also,
+ all directories leading up to these template files must be
+ searchable (i.e. executable) by htsearch, or it won't be able
+ to open the files. This is the opposite problem of that described
+ in question <a href="#q5.36">5.36</a>. If htsearch displays
+ nothing at all, you may have both problems.</p>
+
+ <strong>5.12. <a name="q5.12">I can't seem to index documents with names
+ like left_index.html with htdig.</a></strong><br>
+ <p>There is a bug in the implementation of the <a
+ href="attrs.html#remove_default_doc">remove_default_doc</a>
+ attribute in htdig versions 3.1.0, 3.1.1 and 3.1.2, which causes
+ it to match more than it should. The default value for this
+ attribute is "index.html", so any URL in which the filename ends
+ with this string (rather than matches it entirely) will have
+ the filename stripped off. This is fixed in version 3.1.3.</p>
+
+ <strong>5.13. <a name="q5.13">I get Premature End of Script Headers errors
+ when running htsearch.</a></strong><br>
+ <p>This happens when htsearch dies before putting out a
+ "Content-Type" header. If you are running Apache under Solaris,
+ or another system that may be using shared libraries in non-standard
+ locations,
+ first try the solution described in question <a href="#q3.6">3.6</a>.
+ If that doesn't work, or you're running on another system, try
+ running "htsearch -vvv" directly from the command line to see where
+ and why it's failing. It should prompt you for the search words,
+ as well as the format.
+ <br>If it works from the command line, but not from the web
+ server, it's almost certainly a web server configuration problem.
+ Check your web server's error log for any information related to
+ htsearch's failure. One increasingly common problem is Apache
+ configurations which expect all CGI scripts to be Perl,
+ rather than binary executables or other scripts, so they use
+ "perl-handler" rather than "cgi-handler".
+ <br>See also questions <a href="#q5.7">5.7</a>,
+ <a href="#q5.14">5.14</a> and <a href="#q5.23">5.23</a>.</p>
+
+ <strong>5.14. <a name="q5.14">I get Segmentation faults when running
+ htdig, htsearch or htfuzzy.</a></strong><br>
+ <p>Despite a great deal of debugging of these programs, we haven't
+ been able to completely eliminate all such problems on all platforms.
+ If you're running htsearch or htfuzzy on a BSDI system, a common
+ cause of core dumps is due to a conflict between the GNU regex
+ code bundled in htdig 3.1.2 and later, and the BSD C or C++ library.
+ The solution is to use the BSD library's own rx code instead,
+ using version 3.1.6 or newer as summarized by Joe Jah:</p>
+ <ul>
+ <li> ./configure --with-rx
+ <li> make
+ </ul>
+ <p>This solution may work on some other platforms as well (we haven't
+ heard one way or the other), but will definitely not work on some
+ platforms. For instance, on libc5-based Linux systems, the bundled
+ regex code works fine by default, but using libc5's regex code
+ causes core dumps.</p>
+
+ <p>Users of Cobalt Raq or Qube servers have complained of
+ segmentation faults in htdig. Apparently this is due to problems
+ in their C++ libraries, which are fixed in their experimental
+ compiler and libraries. The following commands should install
+ the packages you need:</p>
+ <blockquote>
+ rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/binutils-2.8.1-3C1.mips.rpm<br>
+ rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-1.0.2-9.mips.rpm<br>
+ rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-c++-1.0.2-9.mips.rpm<br>
+ rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-g77-1.0.2-9.mips.rpm<br>
+ rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-objc-1.0.2-9.mips.rpm<br>
+ rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-2.8.0-9.mips.rpm<br>
+ rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-devel-2.8.0-9.mips.rpm<br>
+ rpm -Uvh --force ftp://ftp.cobaltnet.com/pub/products/current/RPMS/gcc-2.7.2-C2.mips.rpm
+ </blockquote>
+ <p>You may have to remove the libg++ package, if you have it installed
+ before installing libstdc++, because of conflicts in these packages.
+ Be sure to do a "make clean" before a "make", to remove any object
+ files compiled with the old compiler and headers.</p>
+
+ <p>For other causes of segmentation faults, or in other programs,
+ getting a stack backtrace after the fault can be useful in narrowing
+ down the problem. E.g.: try "gdb /path/to/htsearch /path/to/core",
+ then enter the command "bt". You can also try running the program
+ directly under the debugger, rather than attempting a post-mortem
+ analysis of the core dump. Options to the program can be given on
+ gdb's "run" command, and after the program is suspended on fault,
+ you can use the "bt" command. This may give you enough information
+ to find and fix the problem yourself, or at least it may help others
+ on the htdig mailing list to point out what to do next.</p>
+
+ <strong>5.15. <a name="q5.15">Why does htdig 3.1.3 mangle URL parameters
+ that contain bare "&amp;" characters?</a></strong><br>
+ <p>This is a known bug in 3.1.3, and is fixed with this
+ <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.3/HTML.cc.0">
+ patch</a>. You can apply the patch by entering into the main
+ source directory for htdig-3.1.3, and using the command
+ "patch -p0 &lt; /path/to/HTML.cc.0". This is
+ also fixed as of version 3.1.4.</p>
+
+ <strong>5.16. <a name="q5.16">When I run htmerge, it stops with an
+ "Unable to open word list file '.../db.wordlist'" message.</a></strong><br>
+ <p>The most common cause of this error is that htdig did not
+ manage to index any documents, and so it did not create a word
+ list. You should repeat the htdig or rundig command with the
+ -vvv option to see where and why it is failing.
+ See question <a href="#q4.1">4.1</a>.</p>
+
+ <strong>5.17. <a name="q5.17">When using Netscape, htsearch always returns the
+ "No match" page.</a></strong><br>
+ <p>Check your search form. Chances are there is a hidden input
+ field with no value defined. For example, one user had<br>
+ <code>&lt;input type=hidden name=restrict&gt;</code>
+
+ in his search form, instead of<br>
+
+ <code>&lt;input type=hidden name=restrict value=""&gt;</code>
+
+ The problem is that Netscape sets the missing value to a default of " "
+ (two spaces), rather than an empty string. For the restrict parameter,
+ this is a problem, because htsearch won't likely find any URLs with two
+ spaces in them. Other input parameters may similarly pose a problem.
+ </p>
+
+ <p>Another possibility, if you're running 3.2.0b1 or 3.2.0b2, is
+ that you need to make the db.words.db_weakcmpr file writeable by
+ the user ID under which the web server runs. This is a bug, and
+ is fixed in the 3.2.0b5 beta.</p>
+
+
+ <strong>5.18. <a name="q5.18">Why doesn't htdig follow links to other
+ pages in JavaScript code?</a></strong><br>
+ <p>There probably isn't any indexing tool in existance
+ that follows JavaScript links, because they don't know how
+ to initiate JavaScript events. Realistically, it would take a
+ full JavaScript parser in order to be able to figure out all the
+ possible URLs that the code could generate, something that's way
+ beyond the means of any search engine. You have a few options:</p>
+ <ul>
+ <li>Add "backup" links using plain HTML &lt;a href=...&gt; tags to
+ all the pages that could be accessed through JavaScript,
+ <li>Add &lt;link&gt; tags to point to all these pages (see
+ <a href="http://www.w3.org/TR/html4/struct/links.html#h-12.3.3">Links
+ and search engines</a> in W3C's HTML 4.0 Specification - requires
+ htdig 3.1.3 or greater, but then <em>everyone</em> should be running
+ 3.1.6 or greater anyway),
+ <li>Compose a list of all the unreachable documents, or write
+ a program to do so, and feed that list as part of htdig's
+ <a href="attrs.html#start_url">start_url</a> attribute.
+ See also question <a href="#q5.25">5.25</a>.
+ </ul>
+
+ <strong>5.19. <a name="q5.19">When I run htsearch from the web server,
+ it returns a bunch of binary data.</a></strong><br>
+ <p>Your server is returning the contents of the htsearch binary.
+ Common causes of this are:</p>
+ <ul>
+ <li>no execute permission on the htsearch binary,
+ <li>the binary won't run on this system (it may be compiled
+ for the wrong system type), or
+ <li>the web server doesn't recognize the file as a CGI
+ (for Apache, you must have a ScriptAlias directive for the
+ program or the directory in which it's installed, or define
+ a cgi-script handler for some suffix, e.g. .cgi, and add that
+ suffix to the program file name).
+ </ul>
+ <p>By default, Apache is usually configured with one cgi-bin
+ directory as ScriptAlias, so all your CGI programs must go in
+ there, or have a .cgi suffix on them. Your configuration may
+ differ, however.</p>
+
+ <strong>5.20. <a name="q5.20">Why are the betas of 3.2 so
+ slow at indexing?</a></strong><br>
+ <p>
+ As the release notes for these versions suggest, they are
+ somewhat unoptimized and are made available for testing
+ Since the 3.2 code indexes all locations of words to support
+ phrase searching and other advanced methods, this additional
+ data slows down the indexer. To compensate, the code has a
+ cache configured by the
+ <a href="dev/htdig-3.2/attrs.html#wordlist_cache_size">wordlist_cache_size</a>
+ attribute.
+ As of this writing, the word database code will slow down
+ considerably when the cache fills up. Setting the cache as
+ large as possible provides considerable performance
+ improvement. Development is in progress to improve cache
+ performance.
+ For 3.2.0b6 and higher, see also the
+ <a href="dev/htdig-3.2/attrs.html#store_phrases">store_phrases</a> attribute,
+ which can turn off support for phrase searches, improving the speed.
+ </p>
+
+ <strong>5.21. <a name="q5.21">Why does htsearch use ";" instead of
+ "&amp;" to separate URL parameters for the page buttons?</a></strong><br>
+ <p>In versions 3.1.5 and 3.2.0b2, and later, htsearch was
+ changed to use a semicolon character ";" as a parameter
+ separator for page button URLs, rather than "&amp;", for HTML
+ 4.0 compliance. It now allows both the "&amp;" and the ";" as
+ separators for input parameters, because the CGI specification
+ still uses the "&amp;". This change may cause some PHP or CGI
+ wrapper scripts to stop working, but these scripts should be
+ similarly changed to recognize both separator characters.
+ For the definitive reference on this issue, please refer to
+ section B.2.2 of W3C's HTML 4.0 Specification,
+ <a href="http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2">
+ Ampersands in URI attribute values</a>. We're all a little
+ tired of arguing about it. If you don't like the standard, you
+ can change the Display::createURL() code yourself to ignore it.
+ <br>See also question <a href="#q4.13">4.13</a>.</p>
+
+ <p>If you want to try working within the new standard, you may
+ find it helpful to know that recent versions of CGI.pm will
+ allow either the ampersand or semicolon as a parameter separator,
+ which should fix any Perl scripts that use this library.
+ In PHP, you can simply set the following in your php.ini file
+ to allow either separator:</p>
+<pre>arg_separator.input = ";&amp;"
+</pre>
+
+ <strong>5.22. <a name="q5.22">Why does htsearch show the
+ "&amp;" character as "&amp;amp;" in search results?</a></strong><br>
+ <p>In version 3.1.5, htsearch was fixed to properly
+ re-encode the characters &amp;, &lt;, &gt;, and &quot;
+ into SGML entities. However, the default value for the
+ <a href="attrs.html#translate_amp">translate_amp</a>,
+ <a href="attrs.html#translate_lt_gt">translate_lt_gt</a>
+ and <a href="attrs.html#translate_quot">translate_quot</a>
+ attributes is still false, so these entities don't get converted
+ by htdig. If you set these three attributes to true in your
+ htdig.conf and reindex, the problem will go away.</p>
+
+ <p>In the 3.2 betas there was a bug in the HTML parser that
+ caused it to fail when attempting to translate the "&amp;amp;"
+ entity. This has been fixed in 3.2.0b3. The translate_* attributes
+ are gone as of 3.2.0b2.</p>
+
+ <strong>5.23. <a name="q5.23">I get Internal Server or Unrecognized
+ character errors when running htsearch.</a></strong><br>
+ <p>An increasingly common problem is Apache configurations
+ which expect all CGI scripts to be Perl, rather than binary
+ executables or other scripts, so they use "perl-handler"
+ rather than "cgi-handler". The fix is to create a separate
+ directory for non-Perl CGI scripts, and define it as such in
+ your httpd.conf file. You should define it the same way as your
+ existing cgi-bin directory, but use "cgi-handler" instead of
+ "perl-handler". In any case, you should check your web server's
+ error log for any information related to htsearch's failure.
+ <br>See also questions <a href="#q5.7">5.7</a>,
+ <a href="#q5.14">5.14</a> and <a href="#q5.13">5.13</a>.</p>
+
+ <strong>5.24. <a name="q5.24">I took some settings out of
+ my htdig.conf but they're still set.</a></strong><br>
+ <p>All configuration file attributes have compiled-in, default
+ values. Taking an attribute out of the file is not the same
+ thing as setting it to an empty string, a 0, or a value of
+ false. See question <a href="#q4.18">4.18</a>.</p>
+
+ <strong>5.25. <a name="q5.25">When I run htdig on my site,
+ it misses entire directories.</a></strong><br>
+ <p>First of all, htdig doesn't look at directories itself. It
+ is a spider, and it follows hypertext links in HTML documents.
+ If htdig seems to be missing some documents or entire directory
+ sub-trees of your site, it is most likely because there are
+ no HTML links to these documents or directories. (See also
+ question <a href="#q5.18">5.18</a>.) If htdig does
+ not come across at least one hypertext link to a document
+ or directory, and it's not explicitly listed in the
+ <a href="attrs.html#start_url">start_url</a> attribute, then
+ this document or directory is essentially hidden from view
+ to htdig, or to any web browser or spider for that matter.
+ You can only get htdig to index directories, without providing
+ your own files with links to the contents of these directories,
+ by using your web server's automatic index generation feature.
+ In Apache, this is done with the mod_autoindex module, which
+ is usually compiled-in by default, and is enabled with the
+ "Indexes" option for a given directory hierarchy. For example,
+ you can put these directives in your Apache configuration:</p>
+<pre>
+&lt;Directory "/path/to/your/document/root"&gt;
+ Options Indexes FollowSymLinks Includes ExecCGI
+&lt;/Directory&gt;
+</pre>
+ <p>This will cause Apache to automatically generate an index
+ for any directory that does not have an index.html or other
+ "DirectoryIndex" file in it. Other web servers will have
+ similar features, which you should look for in your server
+ documentation.</p>
+
+ <p>As an alternative to relying on the web server's autoindex
+ feature, you can compose a list of all the unreachable
+ documents, or write a program to do so, and feed that list as
+ part of htdig's <a href="attrs.html#start_url">start_url</a>
+ attribute. Here is an example of simple shell script to make
+ a file of URLs you can use with a configuration entry like
+ <code>start_url: `/path/to/your/file`</code>:</p>
+<pre>
+find /path/to/your/document/root -type f -name \*.html -print | \
+ sed -e 's|/path/to/your/document/root/|http://www.yourdomain.com/|' > \
+ /path/to/your/file
+</pre>
+ <p>Other reasons why htdig might be missing portions of your
+ site might be that they fall out of the bounds specified
+ by the <a href="attrs.html#limit_urls_to">limit_urls_to</a>
+ attribute (which takes on the value of start_url by default),
+ they are explicitly excluded using the
+ <a href="attrs.html#exclude_urls">exclude_urls</a> attribute,
+ or they are disallowed by a robots.txt file (see the
+ <a href="htdig.html">htdig</a> documentation for notes about
+ robot exclusion) or by a robots meta tag (see question
+ <a href="#q4.15">4.15</a>). If htdig seems to be missing the
+ last part of a large directory or document, see question
+ <a href="#q5.1">5.1</a>. For reasons why htdig may be rejecting
+ some links to parts of your site, see question
+ <a href="#q5.27">5.27</a>.</p>
+
+ <strong>5.26. <a name="q5.26">What do all the numbers and symbols
+ in the htdig -v output mean?</a></strong><br>
+ <p>Output from htdig -v typically looks like this:</p>
+<pre>
+23000:35506:2:http://xxx.yyy.zz/index.html: ***-+****--++***+ size = 4056
+</pre>
+ <p>The first number is the number of documents parsed so far,
+ the second is the DocID for this document, and the third is
+ the hop count of the document (number of hops from one of the
+ start_url documents). After the URL, it shows a "*" for a link
+ in the document that it already visited (or at least queued
+ for retrieval), a "+" for a new link it just queued, and a
+ "-" for a link it rejected for any of a number of reasons.
+ To find out what those reasons are, you need to run htdig
+ with at least 3 "v" options, i.e. -vvv. If there are no "*",
+ "+" or "-" symbols after the URL, it doesn't mean the document
+ was not parsed or was empty, but only that no links to other
+ documents were found within it.</p>
+
+ <strong>5.27. <a name="q5.27">Why is htdig rejecting some of the
+ links in my documents?</a></strong><br>
+ <p>When htdig parses documents and finds hypertext links to
+ other documents (hrefs), it may reject them for any of several
+ reasons. To find out what those reasons are, you need to run
+ htdig with at least 3 "v" options, i.e. -vvv. Here are the
+ meanings of some of the messages you might see at this verbosity
+ level.</p>
+ <dl>
+ <dt>Not an http or relative link!</dt>
+ <dd>In versions 3.1.5 and earlier, only "http://" URLs, or
+ URLs relative to those, are allowed.</dd>
+ <dt>Item in the exclude list: item # <em>n</em></dt>
+ <dd>A substring of the URL matches one of the items in the
+ <a href="attrs.html#exclude_urls">exclude_urls</a>
+ attribute. The given item number will indicate which
+ pattern matched, starting at 1. The 3.2.0 betas do not
+ give the item number.</dd>
+ <dt>Extension is invalid!</dt>
+ <dd>The file name extension or suffix matches one of those
+ listed in the
+ <a href="attrs.html#bad_extensions">bad_extensions</a>
+ attribute.</dd>
+ <dt>Extension is not valid!</dt>
+ <dd>The file name extension or suffix does not match one of those
+ listed in the
+ <a href="attrs.html#valid_extensions">valid_extensions</a>
+ attribute, if any are specified.</dd>
+ <dt>Invalid Querystring! <em>or</em><br>item in bad query list</dt>
+ <dd>The URL contains a query string which matches one of those
+ listed in the
+ <a href="attrs.html#bad_querystr">bad_querystr</a>
+ attribute.</dd>
+ <dt>URL not in the limits!</dt>
+ <dd>No substring of the URL entirely matches one of the items in the
+ <a href="attrs.html#limit_urls_to">limit_urls_to</a>
+ attribute. The purpose of this attribute is to keep htdig
+ from attempting to index the entire World Wide Web.</dd>
+ <dt>forbidden by server robots.txt!</dt>
+ <dd>A substring of the URL matches one of the items disallowed
+ in the servers robots.txt file. See
+ <a href="http://www.robotstxt.org/wc/norobots.html">
+ A Standard for Robot Exclusion</a>. This message exists
+ only in the 3.2.0 betas. In 3.1.5 and earlier, this condition
+ is only caught later, resulting in the message
+ "robots.txt: discarding '<em>URL</em>'" from htdig, and a
+ later "Deleted: no excerpt" message from htmerge.</dd>
+ <dt>url rejected: (level 2)</dt>
+ <dd>No substring of the URL entirely matches one of the items in the
+ <a href="attrs.html#limit_normalized">limit_normalized</a>
+ attribute. All the other rejections above will be indicated
+ as level 1. The 3.2.0 betas give the much more meaningful
+ message 'not in "limit_normalized" list!'</dd>
+ </dl>
+
+ <p>Another possibility, if none of the error messages above appear
+ for some of the links you think htdig should be accepting, is that
+ htdig isn't even finding the links at all. First, make sure you're
+ not making false assumptions about how htdig finds these. It only
+ reads links in HTML code, and not JavaScript, and it doesn't read
+ directories unless the HTTP server is feeding it directory listings.
+ You will need to take a close look at the htdig -vvv (or -vvvv)
+ output to see what htdig is finding, in and around the areas where
+ the desired links are supposed to be found in your HTML code, to see
+ if it's actually finding them.
+ See also question <a href="#q5.25">5.25</a>.</p>
+
+ <strong>5.28. <a name="q5.28">When I run htdig or htmerge, I get a
+ "DB2 problem...: missing or empty key value specified" message.</a></strong><br>
+ <p>The most common cause of this error is that htdig or
+ htmerge rejected any documents that had been put in the
+ database, leaving an empty database. You need to find out the
+ reasons for the rejection of these documents. See questions
+ <a href="#q4.1">4.1</a>, <a href="#q5.25">5.25</a> and
+ <a href="#q5.27">5.27</a>.</p>
+
+ <strong>5.29. <a name="q5.29">When I run htdig on my site,
+ it seems to go on and on without ending.</a></strong><br>
+ <p>There are some things that can cause htdig to run on without
+ ending, especially when indexing dynamic content (ASP, PHP,
+ SSI or CGI pages). This usually involves htdig getting caught
+ in an <em>infinite virtual hierarchy</em>. A sure sign of
+ this is if the current size of your database is much larger
+ than the total size of the site you are indexing, or if in the
+ verbose output of htdig (see question <a href="#q4.1">4.1</a>)
+ you see the same URLs come up again and again with only subtle
+ variations. In any case, you must figure out the reason htdig
+ keeps revisiting the same documents using different URLs, as
+ explained in question <a href="#q4.24">4.24</a>, and set your
+ <a href="attrs.html#exclude_urls">exclude_urls</a> and
+ <a href="attrs.html#bad_querystr">bad_querystr</a> attributes
+ appropriately to stop htdig from going down those paths.
+ </p>
+
+ <strong>5.30. <a name="q5.30">Why does htsearch no longer recognize
+ the -c option when run from the web server?</a></strong><br>
+ <p>This was a security hole in 3.1.5 and older, and 3.2.0b3 and
+ older releases of ht://Dig. (See question <a href="#q2.1">2.1</a>.)
+ There's a compile-time macro you can set in htsearch.cc to disable
+ this security fix, but that's a bad idea because it reopens the hole.
+ This should only be done as a last recourse, when all other avenues
+ fail. The -c option was only intended for testing htsearch from the
+ command line, and not for use when calling htsearch on the web server.
+ Unfortunately, far too many users have needlessly latched onto this
+ option for CGI scripts. The preferred ways of specifying the config
+ file are as follows, in order of preference:</p>
+ <ol>
+ <li>use the "config" input parameter in your
+ <a href="hts_form.html">search form</a>
+ (see question <a href="#q4.2">4.2</a>).
+ <li>if you need to get at files outside the default CONFIG_DIR, use a
+ wrapper script that redefines the CONFIG_DIR environment variable,
+ then use the config input parameter as above
+ (see question <a href="#q4.20">4.20</a>).
+ <li>use a wrapper script to force htsearch to use a specific config
+ file using the -c option. This is especially for cases where you
+ want to prevent the user from selecting other config files in your
+ CONFIG_DIR using the config input parameter. This should
+ be done by using the GET method to call the wrapper script, and in
+ this script you must unset the REQUEST_METHOD enviroment variable
+ and pass "$QUERY_STRING" as a single argument to htsearch.
+ (This safely gets around htsearch's test which disables -c.)
+ <li>configure and compile different htsearch binaries with different
+ compile-time definitions of CONFIG_DIR, so you can avoid wrapper
+ scripts altogether.
+ <li>define ALLOW_INSECURE_CGI_CONFIG in htsearch.cc and recompile
+ htsearch if all other approaches above fail for you.
+ </ol>
+
+ <strong>5.31. <a name="q5.31">I've set a config attribute exactly
+ as documented but it seems to have no effect.</a></strong><br>
+ <p>There are a few fairly common reasons why this might happen:</p>
+ <ol>
+ <li>You may have a typo. Spelling matters, so make sure the attribute
+ name is spelled exactly as it is in the
+ <a href="attrs.html">documentation</a>. Misspelled attribute
+ definitions are silently ignored. This is because you're allowed
+ to make up your own attribute definitions for use by other attribute
+ definitions, as <strong>${myownattribute}</strong>. Also remember
+ to put the colon ("<strong>:</strong>") separator between the
+ attribute name and value in your definition.
+ <li>The attribute isn't supported in your version of the software.
+ The <a href="attrs.html">documented configuration attributes</a>
+ on the www.htdig.org web site are for the most recent
+ <strong>stable</strong> release. See questions
+ <a href="#q2.1">2.1</a> and <a href="#q2.7">2.7</a> for details.
+ If you're running an older version, or even a more recent beta
+ release, you may not have the same set of attributes to work with.
+ Consult the appropriate documentation, or upgrade to the current
+ release.
+ <li>You're not modifying the right configuration file. The default
+ configuration file is specified when you first configure ht://Dig
+ before compiling, but other configuration files can be specified
+ at run time, using the -c command-line option for most programs,
+ or the <strong>config</strong> input parameter for htsearch
+ (see question <a href="#q4.2">4.2</a>).
+ <li>You've got more than one definition of the attribute. Only the
+ last occurrence of an attribute in the configuration file is the
+ definition that's used for that attribute, overriding earlier
+ definitions. This also applies for nested configuration files that
+ are loaded in via the <a href="attrs.html#include">include</a>
+ directive, so check for other definitions in all included files.
+ Similarly for htsearch, look out for multiple definitions of input
+ parameters in your search forms, as mentioned in question
+ <a href="#q4.2">4.2</a> - these don't override each other but they
+ get combined with a Ctrl-A as separator, which may not be what you
+ want either.
+ <li>Your attribute definition is being "swallowed up" by an
+ incomplete multi-line definition above it. Remember that when a line
+ of an attribute definition ends with a single backslash
+ ("<strong>\</strong>") before the end of the line (without any
+ space after the backslash), then the following line is appended to
+ it as a continuation of the same attribute definition. For an
+ attribute definition that spans several lines, all lines but the
+ last must end with a backslash. If you want a backslash to go into
+ the attribute definition literally, it must be doubled-up, as
+ <strong>\\</strong>.
+ <li>On a similar note, make sure your attribute definitions are all
+ terminated by a newline character. Beware of text editors that do
+ word wrapping. It may look like two separate lines on the screen,
+ when it fact you've got two attribute definitions on the same long
+ line, so the second is swallowed up as part of the first.
+ <li>Your attribute definition is being overridden by an htsearch
+ <a href="hts_form.html">CGI input parameter</a>. For example,
+ <a href="attrs.html#template_name">template_name</a> is ignored
+ if the <strong>format</strong> input parameter is defined. The
+ <a href="attrs.html#allow_in_form">allow_in_form</a> attribute
+ can define any number of new CGI input parameters that override
+ the attributes of the same name in your config file.
+ <li>Your attribute definition is being ignored or overridden
+ by a related attribute. Watch out for unexpected interactions
+ between different attributes. For instance, characters in
+ <a href="attrs.html#valid_punctuation">valid_punctuation</a>
+ are stripped out of words, so those characters may
+ not have the effect you want if you've added them to
+ <a href="attrs.html#extra_word_characters">extra_word_characters</a>
+ or
+ <a href="attrs.html#prefix_match_character">prefix_match_character</a>.
+ Also,
+ <a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
+ will override
+ <a href="attrs.html#search_results_header">search_results_header</a>
+ and
+ <a href="attrs.html#search_results_footer">search_results_footer</a>,
+ but only if you've set up the wrapper file correctly.
+ <li>Watch out for possible "latent effects" of some attributes. For
+ example, when you change attributes used by htdig, they won't have
+ an immediate effect on entries already in the database, so you would
+ have to reindex your site before they take effect. Similarly,
+ attributes that affect how htfuzzy builds some of its databases
+ don't take effect until those databases are rebuilt. Another, more
+ subtle latent effect occurs with releases 3.1.6 and 3.2 betas:
+ when you interrupt htdig (i.e. with Control-C or a kill command),
+ it stores the list of currently queued URLs in db.log, in your
+ database directory, so that the next time you invoke htdig it can
+ resume the interrupted dig. A side-effect of this file is that if
+ you change some attributes like limit_urls_to or exclude_urls before
+ restarting, the URLs in the file are still taken as-is, having been
+ checked against the old settings of limit_urls_to or exclude_urls
+ before being queued. This might explain one reason htdig seems to
+ ignore your new settings of these.
+ </ol>
+
+ <strong>5.32. <a name="q5.32">When I run htsearch, it gives a page
+ with an "Unable to read configuration file" message.</a></strong><br>
+ <p>The most common causes of this error are:</p>
+ <ul>
+ <li>Your configuration file name is misspelled in the "config"
+ input parameter of your search form, or you have two definitions
+ of this parameter (see question <a href="#q4.2">4.2</a>).
+ <li>You didn't install your configuration file in the directory
+ defined by the CONFIG_DIR compile-time Makefile variable
+ (see also question <a href="#q4.20">4.20</a>). This is where
+ htsearch will look for the configuration file specified by the
+ "config" input parameter.
+ <li>The configuration file is not readable by the user ID under
+ which your web server, and thus htsearch, runs. Similarly,
+ if the directories from CONFIG_DIR up to the root directory
+ are not executable by this same user ID, htsearch won't be
+ able to access the configuration files.
+ </ul>
+
+ <strong>5.33. <a name="q5.33">How can I find out which version
+ of ht://Dig I have installed?</a></strong><br>
+ <p>You should always check which version of ht://Dig you're
+ running, before you report any problems, or even if you
+ suspect a problem. You can find out the version number of an
+ installed ht://Dig package by running the command:</p>
+ <blockquote>
+ <code>htdig -\? | head</code>
+ </blockquote>
+ <p>(or use "more" if you don't have a "head" command). The
+ full version number appears on the third line of output,
+ after "This program is part of ht://Dig", and it should also
+ include the snapshot date if you're running a pre-release
+ snapshot. Always include this full version number with any
+ bug report or problem report on a mailing list. You can save
+ yourself and others a lot of grief by being certain of which
+ version you're running, especially if you've installed more than
+ one. If you're running ht://Dig from an RPM package, you should
+ also report the package version and release number, which you
+ can determine with the command "<code>rpm -q htdig</code>",
+ and mention where you obtained the package. This will alert
+ us to the ideosyncracies and/or patches in a particular RPM
+ package. Also, if you've applied any patches yourself (see
+ question <a href="#q2.5">2.5</a>) please mention which ones.
+ See also question <a href="#q1.8">1.8</a>, on reporting bugs
+ or configuration problems.</p>
+
+ <strong>5.34. <a name="q5.34">When running htdig, I get "Error (0):
+ PDF file is damaged - attempting to reconstruct xref table..."</a></strong><br>
+ <p>This message comes from the pdftotext utility, when a PDF file
+ has been truncated. Find the largest PDF file on the site you're
+ indexing, and set max_doc_size to at least that size (see question
+ <a href="#q5.2">5.2</a>). If you need to track down which PDF is
+ causing the error, try running "htdig -i -v &gt; log.txt 2&gt;&amp;1" so you
+ can see which URL is being indexed when the error occurs. The output
+ redirects in that command combine stdout (where htdig's output goes)
+ and stderr (where pdftotext's error messages go) into one output
+ stream. If you're using acroread to index PDF files, the error
+ message for a truncated PDF file is simply "Could not repair file."
+ It's also possible to get errors like this from PDF files that are
+ smaller than max_doc_size, if they're already truncated or corrupted
+ on the server.</p>
+
+ <strong>5.35. <a name="q5.35">When running htdig on Mandrake Linux,
+ I get "host not found" and "no server running" errors.</a></strong><br>
+ <p>The default htdig.conf configuration in Mandrake's RPM package
+ of htdig very stupidly enables the
+ <a href="attrs.html#local_urls_only">local_urls_only</a> attribute
+ by default, which means you can only index a limited set of files
+ on the local server. Anything else, where htdig would normally fall
+ back to using HTTP, will fail. To make matters worse, they put a very
+ misleading comment above that attribute setting, which throws users
+ off track. This attribute is useful in certain circumstances where
+ you never want htdig to fall back to HTTP, but enabling it by default
+ was a very bad judgement call on Mandrake's part.</p>
+
+ <strong>5.36. <a name="q5.36">When I run htsearch, it gives me the
+ list of matching documents, but no header or footer.</a></strong><br>
+ <p>The header and footer typically contain the followup search
+ form, an indication of the total number of matches, and buttons
+ to other pages of matches if the results don't fit on one
+ page. If these don't show up, it could be that in attempting
+ to customize these (see question <a href="#q4.2">4.2</a>),
+ you removed them or rendered them unusable. Even if you didn't
+ customize them, make sure you installed the
+ <a href="attrs.html#search_results_header">search_results_header</a>
+ and
+ <a href="attrs.html#search_results_footer">search_results_footer</a>
+ files (or the
+ <a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
+ file) in the correct location (where you told ht://Dig they'd be
+ when you configured prior to compiling). Also make sure they
+ have read permission for the user ID under which htsearch runs,
+ and all directories leading up to these template files are
+ searchable (i.e. executable) by htsearch, or it won't be able
+ to open the files.</p>
+
+ <p>This is the opposite problem of that described in question
+ <a href="#q5.11">5.11</a>. If htsearch displays nothing at
+ all, you may have both problems or you may have no matches or
+ a boolean query syntax error and the
+ <a href="attrs.html#nothing_found_file">nothing_found_file</a>
+ or <a href="attrs.html#syntax_error_file">syntax_error_file</a>
+ is missing or unreadable.</p>
+
+ <strong>5.37. <a name="q5.37">When I index files with doc2html.pl,
+ it fails with the "UNABLE to convert" error.</a></strong><br>
+ <p>This is an indication that doc2html.pl wasn't configured
+ properly. Carefully follow all the directions for installation
+ in the DETAILS file that comes with the script. In addition to
+ installing doc2html.pl, you must:</p>
+ <ul>
+ <li>Install xpdf and check that pdftotext and pdfinfo work from
+ the command line,
+ <li>Configure pdf2html.pl to use pdftotext and pdfinfo and check
+ that it works from the command line,
+ <li>Configure doc2html.pl to use pdf2html.pl and check that it
+ works from the command line:
+<pre>doc2html.pl /full/path/to/sample/filename.pdf "application/pdf" url</pre>
+ </ul>
+ <p>You should repeat a similar set of steps to configure and test
+ doc2html.pl for other document types, such as Word, RTF, Excel and
+ other document types. See also questions <a href="#q4.8">4.8</a>,
+ <a href="#q4.9">4.9</a> and <a href="#q5.39">5.39</a>.</p>
+
+ <strong>5.38. <a name="q5.38">Why do my searches find search terms
+ in pathnames, or how do I prevent matching filenames?</a></strong><br>
+ <p>htdig doesn't normally add the URL components to the index
+ itself, but when you index a directory where the filenames are
+ used as link description text (such as an automatic DirectoryIndex
+ created by Apache's mod_autoindex) then these link descriptions
+ get indexed, carrying the weight assigned to them by the
+ <a href="attrs.html#description_factor">description_factor</a>
+ attribute. Thus, a search for a filename will match this link
+ description, and the file will show up in search results.
+ To avoid that, make sure your DirctoryIndexes don't get indexed
+ as detailed in question <a href="#q4.23">4.23</a>.</p>
+
+ <p>Conversely, there is no way to force htdig to index URL
+ components so that a search for a file name will yield a match
+ on that file, unless you index an HTML file (or several) containing
+ links to all the files you want, where the link description text
+ does contain the full URL or the pathname components you want.</p>
+
+ <strong>5.39. <a name="q5.39">I set up an external parser but I still
+ can't index Word/Excel/PowerPoint/PDF documents.</a></strong><br>
+ <p>You probably need to carefully re-read and follow questions
+ <a href="#q4.8">4.8</a>, <a href="#q4.9">4.9</a>,
+ <a href="#q5.25">5.25</a> and <a href="#q5.27">5.27</a>.
+ When you can't index documents with an external parser or converter,
+ there are three main issues, or points of failure, that you need
+ to resolve. You need to figure out on which of the three stages the
+ process is failing, and focus on that stage to get to the bottom of
+ why it's not working at that stage. You need to run htdig with
+ anywhere from 1 to 4 -v options, to get the debugging output you
+ need to see where it's failing and why. This may be an iterative
+ process, if htdig is failing at more than one stage: you might fix
+ one problem only to run into another.</p>
+
+ <ol>
+ <li>Is htdig actually finding links to the PDF, Word, etc. documents
+ you want to index? Make sure you're not making false assumptions
+ about how htdig finds these (questions <a href="#q5.25">5.25</a>
+ and <a href="#q5.18">5.18</a>), and then find out how htdig is
+ looking at the links in your HTML files to see if it's ignoring
+ or rejecting links to your externally parsed documents (questions
+ <a href="#q4.1">4.1</a> and <a href="#q5.27">5.27</a>).<br><br>
+ <li>If it is finding and accepting the links to these documents, is
+ it correctly fetching them and passing them on to the appropriate
+ external converter to be able to index them? Look at htdig -vvv
+ output, around the time it tries to fetch one of these, and see
+ what it does next. Does the file size look right? Are there any
+ error messages around there? If the external converter isn't even
+ being called, take a close look at your
+ <a href="attrs.html#external_parsers">external_parsers</a>
+ attribute setting to make sure it's correct (see question
+ <a href="#q5.31">5.31</a>).<br><br>
+ <li>If it is attempting to convert them, is the external converter
+ doing what it should, to feed some indexable text back into htdig's
+ parser? You can also try htdig -vvvv (4 -v options) to see if it's
+ actually parsing individual words from any of these. If this is
+ too much output to wade through, try setting
+ <a href="attrs.html#start_url">start_url</a> to the URL
+ of a single document that you want to test, so you can look in
+ detail at what htdig does with it. You can also try running the
+ external converter manually on one of these documents to see
+ what it spits out. See question <a href="#q5.37">5.37</a>.
+ Make sure your documents actually contain indexable text. Some
+ PDFs are nothing but scanned images of pages, so it looks like
+ text but it's just images with no computer-readable text.
+ </ol>
+
+ <br>
+
+ <hr noshade size=4>
+ Last modified: $Date: 2004/05/28 13:15:16 $
+<br>
+ <a href="http://sourceforge.net/">
+ <img src="http://sourceforge.net/sflogo.php?group_id=4593&amp;type=1" width="88" height="31" border="0" alt="SourceForge Logo"></a>
+ </body>
+</html>