Helper library to stringify binary document formats
This extension has been extracted from
Foswiki:Extensions/KinoSearchContrib to make it available
for search engines other than kinosearch.
Supported file formats
-
.txt
-
.html
-
.xml
-
.doc
-
.docx
-
.xls
-
.xlsx
-
.ppt
-
.pptx
-
.pdf
-
.odt
-
.ott
-
.odp
-
.otp
-
.ods
-
.ots
-
.sxw
-
.stw
-
.sxc
-
.stc
-
.sxi
-
.sti
If you add other file extensions, they are treated as ASCII files. If needed,
you can add more specialised stringifiers for further document types (see below).
Backend for Word Documents
To index Word 2003 Documents (
.doc) you will need to install one of the following:
-
antiword (recommended)
-
abiword
-
wvWare
You can then select the tool to use in
configure.
Backend for PDF
To index
.pdf files you need to install
proppler-utils.
Backend for PPT
To index
.ppt files you need to install
ppthtml.
Backends for DOCX, PPTX, XLSX
To index these file types, you will need to install the following tools from Sourceforge:
Then set the command path to these tools in
configure.
Backend for OpenDocument and Staroffice documents
To index these file types you need to install
odt2txt.
Installing the Contrib
You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.
Open configure, and open the "Extensions" section. Use "Find More Extensions" to get a list of available extensions. Select "Install".
If you have any problems, or if the extension isn't available in
configure, then you can still install manually from the command-line. See
http://foswiki.org/Support/ManuallyInstallingExtensions for more help.
Configuration
There are a number of settings that need to be set in
configure before you can use the Contrib.
Test of the Installation
- Test if the installation was successful:
- Check that
antiword, abiword or wvHtml is in place: Type antiword, abiword or wvHtml on the prompt and check that the command exists.
- Check that
pdftotext is in place: Type pdftotext on the prompt and check that the command exists.
- Check that
ppthtml is in place: Type ppthtml on the prompt and check that the command exists.
-
stringify some files (see below)
Test of Stringification with stringify
Some users report problems with the stringification: The stringifier scipts
fails, takes too long on attachments. Some times this may result from
installation errors, especially of the installation of the backends for the
stringification.
stringify give you the opportunity to test the stringification in advance.
Usage:
stringify file_name
In the result you see, which stringifier is used and the result of the
stringification.
Example:
stringify /path/to/foswiki/StringifierContrib/test/unit/StringifierContrib/attachement_examples/Simple_example.doc
Simple example
Keyword: dummy
Umlaute: Grober, Uberschall, Anderung
Further Development
In this extension, a plug-in mechanism is implemented, so that additional
stringifiers can be added without changing the existing code. All stringifier
plugins are stored in the directory
lib/Foswiki/Contrib/Stringifier/Plugins.
You can add new stringifier plugins by just adding new files here. The minimum
things to be implemented are:
- The plugin must inherit from
Foswiki::Contrib::StringififierContrib::Base
- The plugin must register itself by
__PACKAGE__->register_handler($application, $file_extension);
- The plugin must implement the method
$text = stringForFile ($filename)
All the stringifiers have unit tests associated with them, and we would
encourage you to provide unit tests for any you wish to contribute. See
Foswiki:Development/UnitTests for more information on unit testing.
See
Foswiki:Tasks/StringifierContrib for currently open tasks.
Contrib Info
| Author(s): |
Foswiki:Main.MarkusHesse, Foswiki:Main.SvenDowideit, Foswiki:Main.MichaelDaum & Foswiki:Main.AndrewJones |
| Copyright: |
© 2007, Foswiki:Main.MarkusHesse; © 2009-2012, Foswiki Contributors |
| Release: |
2.20 |
| Version: |
14750 (2012-05-07) |
| Change History: |
|
| 07 May 2012: |
(2.20) added configuration parameter to specify the encoding of the output of each external helper in use |
| 17 Oct 2011: |
(2.10) using wvText instead of wvHtml now; encoding stringified files to the site's charset now; fixed unit tests to use utf8 exclusively |
| 05 Sep 2011: |
(2.00) added OpenDocument serializer; removed dependency left-over on Text::Iconv; added dependency on odt2txt; fixed defaults for wv serializer |
| 01 Dec 2010: |
(1.20) moved core from StringifierContrib to Stringifier not to disturb configure |
| 12 Nov 2010: |
(1.14) Foswiki:Main.PadraigLennon - Foswikitask:Item9311 |
| 23 Oct 2010: |
(1.12) made system fault-tolerant in case of missing dependencies for a given file type; doc cleanup -- Foswiki:Main.WillNorris |
| 12 Feb 2010: |
robust parsing of password protected XLS files |
| 02 Oct 2009: |
extracted from Foswiki:Extensions/KinoSearchContrib (MD) |
| Dependencies: |
| Name | Version | Description |
|---|
| File::MMagic | >0 | Required | | Module::Pluggable | >0 | Required | | Spreadsheet::ParseExcel | >0 | Required for .xls files | | Spreadsheet::XLSX | >0 | Required for .xlsx files | | Encode | >0 | Required | | Error | >0 | Required | | ppthtml | >0 | Required for indexing .ppt files. Part of xlhtml | | pdftotext | >0 | Required for indexing .pdf. Part of poppler-utils | | antiword | >0 | One of antiword, abiword or wvWare is required for .doc files | | abiword | >0 | One of antiword, abiword or wvWare is required for .doc files | | wvWare | >0 | One of antiword, abiword or wvWare is required for .doc files | | html2text | >0 | Required for indexing html files | | odt2txt | >0 | Required for indexing OpenDocument and StarOffice documents | |
| Home: |
Foswiki:Extensions/StringifierContrib |
| Support: |
Foswiki:Support/StringifierContrib |