StringifierContrib

Helper library to stringify binary document formats

This extension has been extracted from Foswiki:Extensions/KinoSearchContrib to make it available for search engines other than kinosearch.

Supported file formats

  • .txt
  • .html
  • .xml
  • .doc
  • .docx
  • .xls
  • .xlsx
  • .ppt
  • .pptx
  • .pdf
  • .odt
  • .ott
  • .odp
  • .otp
  • .ods
  • .ots
  • .sxw
  • .stw
  • .sxc
  • .stc
  • .sxi
  • .sti

If you add other file extensions, they are treated as ASCII files. If needed, you can add more specialised stringifiers for further document types (see below).

Backend for Indexing Word 2003 Documents

To index Word 2003 Documents (.doc) you will need to install one of the following:

  • antiword (recommended)
  • abiword
  • wvWare

You can then select the tool to use in configure.

Backend for PDF

To index .pdf files you need to install xpdf-utils.

Backend for PPT

To index .ppt files you need to install ppthtml.

Backends for DOCX, PPTX, XLSX

To index these file types, you will need to install the following tools from Sourceforge:

Then set the command path to these tools in configure.

Backend for OpenDocument and Staroffice documents

To index these file types you need to install odt2txt.

Installing the Contrib

You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.

Open configure, and open the "Extensions" section. Use "Find More Extensions" to get a list of available extensions. Select "Install".

If you have any problems, or if the extension isn't available in configure, then you can still install manually from the command-line. See http://foswiki.org/Support/ManuallyInstallingExtensions for more help.

Configuration

There are a number of settings that need to be set in configure before you can use the Contrib.

Test of the Installation

  • Test if the installation was successful:
    • Check that antiword, abiword or wvHtml is in place: Type antiword, abiword or wvHtml on the prompt and check that the command exists.
    • Check that pdftotext is in place: Type pdftotext on the prompt and check that the command exists.
    • Check that ppthtml is in place: Type ppthtml on the prompt and check that the command exists.
    • stringify some files (see below)

Test of Stringification with stringify

Some users report problems with the stringification: The stringifier scipts fails, takes too long on attachments. Some times this may result from installation errors, especially of the installation of the backends for the stringification.

stringify give you the opportunity to test the stringification in advance.

Usage: stringify file_name

In the result you see, which stringifier is used and the result of the stringification.

Example:

stringify /path/to/foswiki/StringifierContrib/test/unit/StringifierContrib/attachement_examples/Simple_example.doc

Simple example  

Keyword: dummy  

Umlaute: Grober, Uberschall, Anderung

Further Development

In this extension, a plug-in mechanism is implemented, so that additional stringifiers can be added without changing the existing code. All stringifier plugins are stored in the directory lib/Foswiki/Contrib/Stringifier/Plugins.

You can add new stringifier plugins by just adding new files here. The minimum things to be implemented are:

  • The plugin must inherit from Foswiki::Contrib::StringififierContrib::Base
  • The plugin must register itself by __PACKAGE__->register_handler($application, $file_extension);
  • The plugin must implement the method $text = stringForFile ($filename)

All the stringifiers have unit tests associated with them, and we would encourage you to provide unit tests for any you wish to contribute. See Foswiki:Development/UnitTests for more information on unit testing.

See Foswiki:Tasks/StringifierContrib for currently open tasks.

Contrib Info

Author(s): Foswiki:Main.MarkusHesse, Foswiki:Main.SvenDowideit, Foswiki:Main.MichaelDaum & Foswiki:Main.AndrewJones
Copyright: © 2007, Foswiki:Main.MarkusHesse; © 2009-2011, Foswiki Contributors
Release: 2.00
Version: 12455 (2011-09-05)
Change History:  
05 Sep 2011: (v2.00) added OpenDocument serializer; removed dependency left-over on Text::Iconv; added dependency on odt2txt; fixed defaults for wv serializer
01 Dec 2010: (v1.20) moved core from StringifierContrib to Stringifier not to disturb configure
12 Nov 2010: (v1.14) Foswiki:Main.PadraigLennon - Foswikitask:Item9311
23 Oct 2010: (v1.12) made system fault-tolerant in case of missing dependencies for a given file type; doc cleanup -- Foswiki:Main.WillNorris
12 Feb 2010: robust parsing of password protected XLS files
02 Oct 2009: extracted from Foswiki:Extensions/KinoSearchContrib (MD)
Dependencies:
NameVersionDescription
File::MMagic>0Required
Module::Pluggable>0Required
Spreadsheet::ParseExcel>0Required for .xls files
Spreadsheet::XLSX>0Required for .xlsx files
Encode>0Required
Error>0Required
ppthtml>0Required for indexing .ppt files. Part of xlhtml
pdftotext>0Required for indexing .pdf. Part of poppler-utils
antiword>0One of antiword, abiword or wvWare is required for .doc files
abiword>0One of antiword, abiword or wvWare is required for .doc files
wvWare>0One of antiword, abiword or wvWare is required for .doc files
html2text>0Required for indexing html files
odt2txt>0Required for indexing OpenDocument and StarOffice documents
Home: Foswiki:Extensions/StringifierContrib
Support: Foswiki:Support/StringifierContrib
Topic revision: r2 - 22 Sep 2011, AdminUser
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this site is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback