summaryrefslogtreecommitdiffstats
path: root/debian/htdig/htdig-3.2.0b6/contrib/doc2html/DETAILS
diff options
context:
space:
mode:
Diffstat (limited to 'debian/htdig/htdig-3.2.0b6/contrib/doc2html/DETAILS')
-rw-r--r--debian/htdig/htdig-3.2.0b6/contrib/doc2html/DETAILS399
1 files changed, 399 insertions, 0 deletions
diff --git a/debian/htdig/htdig-3.2.0b6/contrib/doc2html/DETAILS b/debian/htdig/htdig-3.2.0b6/contrib/doc2html/DETAILS
new file mode 100644
index 00000000..35300c03
--- /dev/null
+++ b/debian/htdig/htdig-3.2.0b6/contrib/doc2html/DETAILS
@@ -0,0 +1,399 @@
+INTRODUCTION
+============
+
+This DETAILS file accompanies doc2html version 3.0.1.
+
+Read this file for instructions on the installation and use of the
+doc2html scripts.
+
+The set of files is:
+
+ DETAILS - this file
+ doc2html.pl - the main Perl script
+ doc2html.cfg - configuration file for use with wp2html
+ doc2html.sty - style file for use with wp2html
+ pdf2html.pl - Perl script for converting PDF files to HTML
+ swf2html.pl - Perl script for extracting links from Shockwave flash files.
+ README - brief description
+
+doc2html.pl is a Perl5 script for use as an external converter with
+htdig 3.1.4 or later. It takes as input the name of a file containing a
+document in a number of possible formats and its MIME type. It uses
+the appropriate conversion utility to convert it to HTML on standard
+output.
+
+doc2html.pl was designed to be easily adapted to use whatever conversion
+utilities are available, and although it has been written around the
+"wp2html" utility, it does not require wp2html to function.
+
+NOTE: version 3.0.1 has only been tested on Unix.
+
+pdf2html.pl is a Perl script which uses a pair of utilities (pdfinfo and
+pdf2text) to extract information and text from an Adobe PDF file and
+write HTML output. It can be called directly from htdig, but you are
+recommended to call it via doc2html.pl.
+
+swf2html.pl is a Perl script which calls a utility (swfparse) and
+outputs HTML containing links to the URL's found in a Shockwave flash
+file. It can be called directly from htdig, but you are recommended to
+call it via doc2html.pl.
+
+
+ABOUT DOC2HTML.PL
+=================
+
+doc2html.pl is essentially a wrapper script, and is itself only capable
+of reading plain text files. It requires the utility programs described
+below to work properly.
+
+doc2html.pl was written by David Adams <d.j.adams@soton.ac.uk>, it is
+based on conv_doc.pl written by Gilles Detillieux <grdetil@scrc.umanitoba.ca>.
+This in turn was based on the parse_word_doc.pl script, written by
+Jesse op den Brouw <MSQL_User@st.hhs.nl>.
+
+doc2html.pl makes up to three attempts to read a file. It first tries
+utilities which convert directly into HTML. If one is not found, or no
+output is produced, it then tries utilities which output plain text. If
+none is found, and the file is not of a type known to be unconvertable,
+then doc2html.pl attempts to read the file itself, stripping out any
+control characters.
+
+doc2html.pl is written to be flexible and easy to adapt to whatever
+conversion utilites are available. New conversion utilities may be
+added simply by making additions to routine 'store_methods', with no
+other changes being necessary. The existing lines in store_methods
+should provide sufficient examples on how to add more converters. Note
+that converters which produce HTML are entered differently to those that
+produce plain text.
+
+htdig provides three arguments which are read by doc2html.pl:
+
+1) the name of a temporary file containing a copy of the
+ document to be converted.
+
+2) the MIME type of the document.
+
+3) the URL of the document (which is used in generating the
+ title in the output).
+
+The test for document type uses both the MIME-type passed as second
+argument and the "Magic number" of the file.
+
+
+INSTALLATION
+============
+
+Installation requires that you acquire, compile and install the utilities
+you need to do the conversions. Those already setup in the Perl scripts are
+described below.
+
+If you don't have Perl module Sys::AlarmCall installed, then consider
+installing it, see section "TIMEOUT" below.
+
+You may need to change the first line of each script to the location of
+Perl on your system.
+
+Edit doc2html.pl to include the full pathname of each utility you have
+installed. For example:
+
+my $WP2HTML = '/opt/local/wp2html-3.2/bin/wp2html';
+
+If you don't have a particular utility then leave its location as a null
+string.
+
+Then place doc2html.pl and the other scripts where htdig can access them.
+
+If you are going to convert PDF files then you will need to edit pdf2html.pl
+and include its full path name in doc2html.pl.
+
+If you are going to extract links from Shockwave flash files then you will
+need to edit swf2html.pl and include its full path name in doc2html.pl.
+
+Edit the htdig.conf configuration file to use the script, as in this example:
+
+external_parsers: application/rtf->text/html /usr/local/scripts/doc2html.pl \
+ text/rtf->text/html /usr/local/scripts/doc2html.pl \
+ application/pdf->text/html /usr/local/scripts/doc2html.pl \
+ application/postscript->text/html /usr/local/scripts/doc2html.pl \
+ application/msword->text/html /usr/local/scripts/doc2html.pl \
+ application/Wordperfect5.1->text/html /usr/local/scripts/doc2html.pl \
+ application/msexcel->text/html /usr/local/scripts/doc2html.pl \
+ application/vnd.ms-excel->text/html /usr/local/scripts/doc2html.pl \
+ application/vnd.ms-powerpoint->text/html /usr/local/scripts/doc2html.pl \
+ application/x-shockwave-flash->text/html /usr/local/scripts/doc2html.pl \
+ application/x-shockwave-flash2-preview->text/html /usr/local/scripts/doc2html.pl
+
+If you are using wp2html then place the files doc2html.cfg and doc2html.sty in the
+wp2html library directory.
+
+
+UTILITY WP2HTML
+===============
+
+Obtain wp2html from http://www.res.bbsrc.ac.uk/wp2html/
+
+Note that wp2html is not free; its author charges a small fee for
+"registration". Various pre-compiled versions and the source code are
+available, together with extensive documentation. Upgrades are
+available at no further charge.
+
+wp2html converts WordPerfect documents (5.1 and later) to HTML.
+Versions 3.2 and later will also convert Word7 and Word97 documents to
+HTML. A feature of wp2html which doc2html.pl exploits is that the -q
+option will result in either good HTML or no output at all.
+
+wp2html is very flexible in the output it creates. The two files,
+doc2html.cfg and doc2html.sty, should be placed in the wp2html library
+directory along with the .cfg and .sty files supplied with wp2html.
+
+Edit the line in doc2html.pl:
+
+my $WP2HTML = '';
+
+to set $WP2HTML to the full pathname of wp2html.
+
+wp2html will look for the title in a document, and if it is found then
+output it in <TITLE>....</TITLE> markup. If a title is not found
+then it defaults to the file name in square brackets.
+
+If wp2html is unable to convert a document, or is not installed,
+then doc2html.pl can use the "catdoc" or "catwpd" utilities instead.
+
+
+UTILITY CATDOC
+==============
+
+Obtain catdoc from http://www.ice.ru/~vitus/catdoc/, it is available
+under the terms of the Gnu Public License.
+
+Edit the line in doc2html.pl:
+
+my $CATDOC = '';
+
+to set the variables to the full pathname of catdoc. You might want
+to use a different version of catdoc for Word2 documents or for MAC Word
+files.
+
+catdoc converts MS Word6, Word7, etc., documents to plain text. The
+latest beta version is also able to convert Word2 documents. catdoc
+also produces a certaint amount of "garbage" as well as the text of the
+document. The -b option improves the likelihood that catdoc will
+extract all the text from the document, but at the expense of increasing
+the garbage as well. doc2html.pl removes some non-printing characters
+to minimise the garbage. If a later version of catdoc than 0.91.4 is
+obtained then the use of the -b option should be reviewed.
+
+
+UTILITY CATWPD
+==============
+
+Obtain catwpd from the contribs section of the Ht://Dig web site where
+you obtained doc2html. It extracts words from some versions of WordPerfect
+files. You won't need it if you buy the superior wp2html.
+
+If you do use it, then edit the line in doc2html.pl:
+
+my $CATWPD = '';
+
+to set the variables to the full pathname of catwpd.
+
+
+UTILITY PPTHTML
+===============
+
+obtain ppthtml from http://www.xlhtml.org, where it is bundled in with
+xlhtml.
+
+In doc2html.pl, edit the line:
+
+my $PPT2HTML = '';
+
+to set $PPT2HTML to the full pathname of ppthtml.
+
+ppthtml converts Microsoft Powerpoint files into HTML. It uses the input
+filename as the title. doc2html.pl replaces this with the original
+filename from the URL in square brackets.
+
+
+UTILITY XLHTML
+==============
+
+Obtain xlhtml from http://www.xlhtml.org
+
+In doc2html.pl, edit the line:
+
+my $XLS2HTML = '';
+
+to set $XLS2HTML to the full pathname of xlhtml.
+
+xlhtml converts Microsoft Excel spreadsheets into HTML. It uses the input
+filename as the title. doc2html.pl replaces this with the original
+filename from the URL in square brackets.
+
+The present version of xlHtml (0.4) writes HTML output, but does not
+mark up hyperlinks in .xls files as links in its output.
+
+An alternative to xlHtml is xls2csv, see below.
+
+
+UTILITY RTF2HTML
+================
+
+Obtain rtf2html from http://www.ice.ru/~vitus/catdoc/
+
+In doc2html.pl, edit the line:
+
+my $RTF2HTML = '';
+
+to set $RTF2HTML to the full pathname of rtf2html.
+
+rtf2html converts Rich Text Font documents into HTML. It uses the input
+filename as the title, doc2html.pl replaces this with the original
+filename from the URL within square brackets.
+
+
+UTILITY PS2ASCII
+================
+
+Ps2ascii is a PostScript to text converter.
+
+In doc2html.pl, edit the line:
+
+my $CATPS = '';
+
+to the correct full pathname of ps2ascii.
+
+ps2ascii comes with ghostscript 3.33 (or later) package, which is
+pre-installed on many Unix systems. Commonly, it is a Bourne-shell
+script which invokes "gs", the Ghostscript binary. doc2html.pl has
+provision for adding the location of gs to the search path.
+
+
+UTILITY PDFTOTEXT
+=================
+
+pdftotext converts Adobe PDF files to text. pdfinfo is a tool which
+displays information about the document, and is used to obtain its
+title, etc. Get them from the xpdf package at
+http://www.foolabs.com/xpdf/
+
+In script pdf2html.pl, change the lines:
+
+my $PDFTOTEXT = "/... .../pdftotext";
+my $PDFINFO = "/... .../pdfinfo";
+
+to the correct full pathnames.
+
+Edit doc2html.pl to include the full pathname of the pdf2html.pl script.
+
+pdf2text may fail to convert PDF documents which have been truncated
+because htdig has max_doc_size set to smaller than the documents full
+size. Some PDF documents do not allow text to be extracted.
+
+
+UTILITY CATXLS
+==============
+
+The Excel to .csv converter, xls2csv, is included with recent versions of
+catdoc. This is an alternative to xlhtml (see above).
+
+Edit the line:
+
+my $CATXLS = '';
+
+to the full pathname of xls2csv.
+
+Xls2csv translates Excel spread sheets into comma-separated data.
+
+
+UTILITY SWFPARSE
+================
+
+swfparse (aka swfdump) extracts information from Shockwave flash files,
+and can be obtained from the contribs section of the Ht://Dig web site,
+where you obtained doc2html.
+
+Perl script swf2html.pl calls swfparse and writes HTML output containing
+links to the URLs found in the Shockwave file. It does NOT extract text
+from the file.
+
+In script swf2html.pl, change the line:
+
+my $SWFPARSE = "/... .../swfdump";
+
+to the full pathname.
+
+Edit doc2html.pl to include the full pathname of the swf2html.pl script.
+
+
+LOGGING
+=======
+
+Output of logging information and error messages is controlled by the
+environmental variable DOC2HTML_LOG, which may be set in the rundig
+script. If it is not set then only error messages output by doc2html.pl
+and by the conversion utilities it calls are returned to htdig and will
+appear in its STDOUT. If DOC2HTML_LOG is set to a filename, then
+doc2html.pl appends logging information and any error messages to the
+file. If DOC2HTML_LOG is set but blank, or the file cannot be opened
+for writing, logging information and error messages are passed back to
+htdig and will appear its STDOUT.
+
+In doc2html.pl, the variables $Emark and $EEmark, set in subroutine init,
+are used to highlight error messages.
+
+The number of lines of STDERR output from a utility which is logged or
+passed back to htdig is controlled by the variable $Maxerr set in
+routine "init" of doc2html.pl. This is provided in order to curb the
+large number of error messages which some utilities can produce from
+processing a single file.
+
+
+TIMEOUT
+=======
+
+If possible, install Perl module Sys::AlarmCall, obtainable from CPAN if
+you don't already have it. This module is used by doc2html.pl to
+terminate a utility if it takes too long to finish. The line in
+doc2html.pl:
+
+ $Time = 60; # allow 60 seconds for external utility to complete
+
+may be altered to suit.
+
+
+LIMITING INPUT AND OUTPUT
+=========================
+
+The environmental variable DOC2HTML_IP_LIMIT may be set in the rundig
+script to limit the size of the file which doc2html.pl will attempt to
+convert. The default value is 20000000. Doc2html.pl will return no
+output to htdig if the file size is equal to or greater than this size.
+
+You are recommended to set DOC2HTML_IP_LIMIT to the same as the
+"max_doc_size" parameter in the htdig configuration file. Then no
+attempt wil be made to extract text from files which have been truncated
+by htdig. It is not possible to extract any text from .PDF files, for
+example, if they have been truncated.
+
+The environmental variable DOC2HTML_OP_LIMIT may be set in the rundig
+script to limit the output sent back to htdig by a single call to
+doc2html.pl. The default value is 10000000. Doc2html.pl will stop
+returning output to htdig once the DOC2HTML_OP_LIMIT has been reached.
+This is precaution against the unlikely event of a conversion utility
+returning disproportionately large amounts of data.
+
+
+CONTACT
+=======
+
+Any queries regarding doc2html are best sent to the mailing list
+htdig-general@lists.sourceforge.net
+
+The author can be emailed at D.J.Adams@soton.ac.uk
+
+David Adams
+Information Systems Services
+University of Southampton
+
+27-November-2002