[KinoSearch] Invalid UTF-8

Peter Karman peter at peknet.com
Mon Jan 25 22:09:20 PST 2010


Peter Karman wrote on 1/25/10 9:12 PM:

> I'll try and create a test case. I suspect it's going to be because I'm 
> using a lot of fields of various FieldType combinations.
> 

Here's the test case.

First, you need to create a corpus to test with. I use this script:

http://svn.swish-e.org/libswish3/trunk/perl/docmaker.pl

like this:

  perl docmaker.pl \
     --utf_factor=0 \
     --write_files \
     --tmp_dir path/to/my/testdocs/ \
     --max_files 33000 \
     --max_words 3 \
     --tmp_dir_segments 2

could also make fewer files with more words in them. Or use a different corpus 
altogether. But there appears to be something magical in the *total number* of 
terms parsed.

Second, here's the test script:

--------------------8<------------------------
#!/usr/bin/env perl
use strict;
use warnings;

use File::Find;
use File::Slurp;
use Data::Dump qw( dump );
use KinoSearch::Indexer;
use KinoSearch::Schema;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::FieldType::FullTextType;
use KinoSearch::FieldType::StringType;

my $usage = "$0 path/to/files\n";
die $usage unless @ARGV;

my $path_to_index = 'test-ks-utf8';
my $lang          = 'en';
my $schema        = KinoSearch::Schema->new();
my $analyzer  = KinoSearch::Analysis::PolyAnalyzer->new( language => $lang, );
my $fieldtype = KinoSearch::FieldType::FullTextType->new(
     analyzer      => $analyzer,
     highlightable => 1,
     sortable      => 1,
);
my $stringtype = KinoSearch::FieldType::StringType->new( sortable => 1, );
$schema->spec_field(
     name => 'swishtitle',
     type => $fieldtype,
);
$schema->spec_field(
     name => 'swishdefault',
     type => $fieldtype,
);

for my $property_name (
     qw(
     swishdescription
     swishdocpath
     swishdocsize
     swishencoding
     swishlastmodified
     swishmime
     swishparser
     swishwordnum
     )
     )
{
     $schema->spec_field(
         name => $property_name,
         type => $stringtype,
     );
}

my $indexer = KinoSearch::Indexer->new(
     schema => $schema,
     index  => $path_to_index,
     create => 1,
);

my $count = 0;

find( { wanted => \&wanted, no_chdir => 1 }, @ARGV );
print "Crawled $count documents\n";
$indexer->commit();

sub wanted {
     my $filename = $File::Find::name;
     return unless $filename =~ m/\.xml/;
     my $doc = parse_file($filename);

     #warn dump $doc;

     $indexer->add_doc($doc);
     $count++;
}

sub parse_file {
     my $file = shift;
     my $buf  = read_file($file);
     $buf =~ s/<.+?>//sg;
     return {
         swishtitle        => "",  # yes, empty
         swishdescription  => "",  # yes, empty
         swishdefault      => $buf,
         swishlastmodified => ( stat($file) )[9],
         swishdocsize      => ( stat($file) )[7],
         swishparser       => 'XML',
         swishmime         => 'application/xml',
         swishencoding     => 'utf-8',
         swishdocpath      => $file,
         swishwordnum      => 0,   # yes, zero
     };
}
--------------------8<------------------------

Here are some things I notice.

1) if I comment out the swishwordnum and swishdescription in parse_file() it works.

2) if I comment out the swishdescription alone, it fails.

3) if I comment out the swishwordnum alone, it fails.

I'll all-in for tonight, but hopefully this can help expose what's going on, 
either with my code or in KS.

cheers,
pek
-- 
Peter Karman  .  http://peknet.com/  .  peter at peknet.com



More information about the kinosearch mailing list