summaryrefslogtreecommitdiffstats
path: root/debian/htdig/htdig-3.2.0b6/htdoc/require.html
diff options
context:
space:
mode:
Diffstat (limited to 'debian/htdig/htdig-3.2.0b6/htdoc/require.html')
-rw-r--r--debian/htdig/htdig-3.2.0b6/htdoc/require.html392
1 files changed, 392 insertions, 0 deletions
diff --git a/debian/htdig/htdig-3.2.0b6/htdoc/require.html b/debian/htdig/htdig-3.2.0b6/htdoc/require.html
new file mode 100644
index 00000000..d1975701
--- /dev/null
+++ b/debian/htdig/htdig-3.2.0b6/htdoc/require.html
@@ -0,0 +1,392 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
+<html>
+ <head>
+ <title>
+ ht://Dig: Features and System requirements
+ </title>
+ </head>
+ <body bgcolor="#eef7ff">
+ <h1>
+ Features and System requirements
+ </h1>
+ <p>
+ ht://Dig Copyright &copy; 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
+ Please see the file <a href="COPYING">COPYING</a> for
+ license information.
+ </p>
+ <hr noshade>
+ <h2>
+ Features
+ </h2>
+ <p>
+ Here are some of the major features of ht://Dig. They are in
+ no particular order.
+ </p>
+ <blockquote>
+ <dl>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Intranet searching</strong>
+ </dt>
+ <dd>
+ ht://Dig has the ability to search through many servers
+ on a network by acting as a WWW browser.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ It is free</strong>
+ </dt>
+ <dd>
+ The whole system is released under the
+ <a href="COPYING">GNU Library General Public License (LGPL)</a>
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Robot exclusion is supported</strong>
+ </dt>
+ <dd>
+ The <a href="http://www.robotstxt.org/wc/norobots.html">
+ Standard for Robot Exclusion</a> is
+ <a href="meta.html#robots">supported by ht://Dig.</a>
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Boolean expression searching</strong>
+ </dt>
+ <dd>
+ Searches can be arbitrarily complex using boolean
+ expressions.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Phrase searching</strong>
+ </dt>
+ <dd>
+ A phrase can be searched for by enclosing it in quotes.
+ Phrase searches can be combined with word searches, as in
+ <code>Linux and "high quality"</code>.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Configurable search results</strong>
+ </dt>
+ <dd>
+ The output of a search can easily be tailored to your
+ needs by means of providing HTML templates.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Fuzzy searching</strong>
+ </dt>
+ <dd>
+ Searches can be performed using various
+ <a href="attrs.html#search_algorithm">configurable algorithms</a>.
+ Currently the following algorithms are
+ supported (in any combination):
+ <ul>
+ <li>
+ exact
+ </li>
+ <li>
+ soundex
+ </li>
+ <li>
+ metaphone
+ </li>
+ <li>
+ common word endings
+ </li>
+ <li>
+ synonyms
+ </li>
+ <li>
+ accent stripping
+ </li>
+ <li>
+ substring and prefix
+ </li>
+ <li>
+ regular expressions
+ </li>
+ <li>
+ simple spelling corrections
+ </li>
+ </ul>
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Searching of many file formats</strong>
+ </dt>
+ <dd>
+ Both HTML documents and plain text files can be
+ searched directly ht://Dig itself. There is also a
+ <a href="attrs.html#external_parsers">mechanism
+ to allow external programs ("external parsers")</a> to be used
+ while building the database so that arbitrary file formats
+ can be searched. <br>
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Document retrieval using many transport services</strong>
+ </dt>
+ <dd>
+ Several transport services can be handled by ht://Dig,
+ including http://, ftp:// and file:///.
+ There is also a
+ <a href="attrs.html#external_protocols">mechanism
+ to allow external programs ("external protocols")</a> to be used
+ while building the database so that arbitrary transport
+ services can be used. <br>
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Keywords can be added to HTML documents</strong>
+ </dt>
+ <dd>
+ Any number of <a href="meta.html">keywords</a>
+ can be added to HTML documents
+ which will not show up when the document is viewed.
+ This is used to make a document more like to be found
+ and also to make it appear higher in the list of
+ matches.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Email notification of expired documents</strong>
+ </dt>
+ <dd>
+ Special meta information can be added to HTML documents
+ which can be used to
+ <a href="notification.html">notify the maintainer</a> of those
+ documents at a certain time. It is handy to get
+ reminded when to remove the "New" images from a certain
+ page, for example.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ A Protected server can be indexed</strong>
+ </dt>
+ <dd>
+ ht://Dig can be told to use a specific
+ <a href="attrs.html#authorization">username and password</a>
+ when it retrieves documents. This can be used
+ to index a server or parts of a server that are
+ protected by a username and password.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Searches on subsections of the database</strong>
+ </dt>
+ <dd>
+ It is easy to set up a search which only returns
+ documents whose
+ <a href="hts_form.html#restrict">URL matches a certain pattern.</a>
+ This becomes very useful for people who want to make their
+ own data searchable without having to use a separate
+ search engine or database.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Full source code included</strong>
+ </dt>
+ <dd>
+ The search engine comes with full source code. The
+ whole system is released under the terms and conditions
+ of the <a href="COPYING">GNU Library General Public License (LGPL) version
+ 2.0</a>
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ The depth of the search can be limited</strong>
+ </dt>
+ <dd>
+ Instead of limiting the search to a set of machines, it
+ can also be restricted to documents that are a certain
+ number of <a href="attrs.html#max_hop_count">"mouse-clicks"</a>
+ away from the start document.
+ </dd>
+ <dt>
+ <strong><img src="bdot.gif" width=9 height=9 alt="*">
+ Full support for the ISO-Latin-1 character set</strong>
+ </dt>
+ <dd>
+ Both SGML entities like '&amp;agrave;' and ISO-Latin-1
+ characters can be indexed and searched.
+ </dd>
+ </dl>
+ </blockquote>
+ <hr size="4" noshade>
+ <h1>
+ Requirements to build ht://Dig
+ </h1>
+ <p>
+ ht://Dig was developed under Unix using C++.
+ </p>
+ <p>
+ For this reason, you will need a Unix machine, a C compiler
+ and a C++ compiler. (The C compiler is needed to compile some
+ of the GNU libraries)
+ </p>
+ <p>
+ Unfortunately, we only have access to a couple of different
+ Unix machines. ht://Dig has been tested on these machines:
+ </p>
+ <ul>
+<!--
+ <li>
+ Sun Solaris 2.5 SPARC (using gcc/g++ 2.7.2)
+ </li>
+ <li>
+ Sun SunOS 4.1.4 SPARC (using gcc/gcc 2.7.0)
+ </li>
+ <li>
+ HP/UX A.09.01 (using gcc/g++ 2.6.0)
+ </li>
+ <li>
+ IRIX 5.3 (SGI C++ compiler. Don't know the version)
+ </li>
+ <li>
+ Debian Linux 2.0 (using egcs 1.1b)
+ </li>
+-->
+ <li>
+ FreeBSD 4.6 (using gcc 2.95.3) <!-- lha -->
+ </li>
+ <li>
+ Mandrake Linux 8.2 (using gcc 3.2) <!-- lha -->
+ </li>
+ <li>
+ Debian, 2.2.19 kernel (using gcc 2.95.4) <!-- lha -->
+ </li>
+ <li>
+ Debian on an Alpha <!-- lha -->
+ </li>
+ <li>
+ RedHat 7.3, 8.0 <!-- Jim Cole -->
+ </li>
+ <li>
+ Sun Solaris 2.8 = SunOS 5.8 (using gcc 3.1) <!-- lha -->
+ </li>
+ <li>
+ Sun Solaris 2.8 = SunOS 5.8 (using Sun's cc / g++ 3.1) <!-- lha -->
+ </li>
+ <li>
+ Mac OS X 10.2 (using gcc) <!-- Jim Cole -->
+ </li>
+
+ </ul>
+ There are reports of ht://Dig working on a number of other platforms.
+ <h3>
+ libstdc++
+ </h3>
+ <p>
+ If you plan on using g++ to compile ht://Dig, you have to make
+ sure that libstdc++ has been installed. Unfortunately, libstdc++ is a
+ separate package from gcc/g++. You can get libstdc++ from the
+ <a href="ftp://ftp.gnu.org/pub/gnu/">GNU software archive</a>.
+ </p>
+
+<!-- The current Makefiles don't use include...
+ <h3>
+ Berkeley 'make'
+ </h3>
+ <p>
+ The building relies heavily on the make program. The problem
+ with this is that not all make programs are the same. The
+ requirement for the make program is that it understands the
+ 'include' statement as in
+ </p>
+ <blockquote>
+ <code>include somefile otherfile</code>
+ </blockquote>
+ <p>
+ The Berkeley 4.4 make program doesn't use this syntax, instead
+ it wants
+ </p>
+ <blockquote>
+ <code>.include "somefile"</code><br>
+ <code>.include "otherfile"</code>
+ </blockquote>
+ <p>
+ and hence it cannot be used to build ht://Dig.
+ </p>
+ <p>
+ If your make program doesn't understand the right 'include'
+ syntax, it is best if you get and install
+ <a href="ftp://ftp.gnu.org/pub/gnu/">gnumake</a> before you try
+ to compile everything. The alternative is to change all the
+ Makefiles.
+ </p>
+-->
+ <hr noshade>
+ <h1>
+ Disk space requirements
+ </h1>
+ <p>
+ The search engine will require lots of disk space to store
+ its databases. Unfortunately, there is no exact formula to
+ compute the space requirements. It depends on the number of
+ documents you are going to index but also on the various
+ options you use.
+ </p>
+ <p>As a temporary measure, 3.2 betas use a very inefficient
+ database structure to enable phrase searching. This will be
+ fixed before the release of 3.2.0. Currently, indexing a site of
+ around 10,000 documents gives a database of around 400MB using the
+ default setting for
+ <a href="attrs.html#max_doc_size">maximum document size</a> and storing the
+ <a href="attrs.html#max_head_length">first 50,000 bytes of each document</a>
+ to enable context to be displayed.
+ <!-- To give you an idea of the space
+ requirements, here is what I have deduced from our own
+ database size at San Diego State University.
+ </p>
+ <p>
+ If you keep around the wordlist database (for update digging
+ instead of initial digging) I found that multiplying the
+ number of documents covered by 12,000 will come pretty close
+ to the space required.
+ </p>
+ <p>
+ We have about 13,000 documents:
+ </p>
+<pre>
+ 13,000
+ 12,000 x
+ ===========
+ 156,000,000
+</pre>
+ or about 150 MB.
+ <p>
+ Without the wordlist database, the factor drops down to about
+ 7500:
+ </p>
+<pre>
+ 13,000
+ 7,500 x
+ ===========
+ 97,500,000
+</pre>
+ or about 93 MB.
+-->
+ <p>
+ Keep in mind that we keep at most 50,000 bytes of each
+ document. This may seen a lot, but most documents aren't very
+ big and it gives us a big enough chunk to almost always show
+ an excerpt of the matches.
+ </p>
+ <p>
+ You may find that if you store most of each document, the
+ databases are almost the same size, or even larger than the
+ documents themselves! Remember that if you're storing a
+ significant portion of each document (say 50,000 bytes as
+ above), you have that requirement, plus the size of the word
+ database and all the additional information about each document
+ (size, URL, date, etc.) required for searching.
+ </p>
+ <hr size="4" noshade>
+
+ Last modified: $Date: 2004/05/28 13:15:19 $
+
+ </body>
+</html>