[KinoSearch] How do you index ms office (.doc, .xls, .ppt) files with kinosearch
peter at peknet.com
Mon Aug 25 07:00:00 PDT 2008
On 08/25/2008 08:42 AM, Henry wrote:
> On Mon, August 25, 2008 1:12 pm, Ben Aurel wrote:
>> My question is, what would you suggest for indexing office formats ?
>> How do you extract text without ole and and an office installation on
>> the server?
> You use file conversion utilities such as pdftotext, xlhtml, wvHtml etc.
> Most of these are far from perfect, sometimes crashing, etc.
Also, check out SWISH::Filter on CPAN, which uses many of those tools underneath but which
provides a common interface for converting them to parse-able text.
Peter Karman . peter at peknet.com . http://peknet.com/
KinoSearch mailing list
KinoSearch at rectangular.com
More information about the kinosearch