[KinoSearch] How do you index ms office (.doc, .xls, .ppt) files with kinosearch

Peter Karman peter at peknet.com
Mon Aug 25 07:00:00 PDT 2008





On 08/25/2008 08:42 AM, Henry wrote:
> On Mon, August 25, 2008 1:12 pm, Ben Aurel wrote:
>> My question is, what would you suggest for indexing office formats ?
>> How do you extract text without ole and and an office installation on
>> the server?
> 
> You use file conversion utilities such as pdftotext, xlhtml, wvHtml etc. 
> Most of these are far from perfect, sometimes crashing, etc.
> 

Also, check out SWISH::Filter on CPAN, which uses many of those tools underneath but which
provides a common interface for converting them to parse-able text.
-- 
Peter Karman  .  peter at peknet.com  .  http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list