[KinoSearch] utf8 (unicode) any progress on TokenBatch?

Marvin Humphrey marvin at rectangular.com
Tue Aug 15 22:40:42 PDT 2006




On Aug 15, 2006, at 1:11 AM, Marc Elser wrote:

> First of all thanks a lot for the fast patch. I Just installed it  
> from svn and stumbled across the following problems:

Thanks for the detailed reports.

> 1.) UTF-8 QueryStrings are still split by QueryParser at UTF-8  
> special characters for example at a-umlaut (or in german ä). This  
> still leads to the described problems that a words like  
> "anlässlich" are split into "anl" and "sslich" which produces false  
> machtes for example a document which contains "Anleitung" and  
> "verlässlich" which is something completely different would match.

This is an important result.  Even though the mis-tokenization  
happens to be due to a bug (see below), it illustrates why moving to  
UTF-8 is not backwards compatible.

If you have an index based on, say, Latin 1, and it uses characters  
above 127, they will have been indexed verbatim -- but now, as you're  
searching, the Query string will get passed through a UTF-8  
converter, changing it into a different sequence of bytes -- and  
either producing no results, or incorrect results.

Indexes which were produced under the current version WILL NOT WORK  
PROPERLY after we make the transition.  I'm intend to make KinoSearch  
refuse to read them, so that you'll know you need to revert if you  
can't regenerate right away.

It would be difficult, maybe impossible to make a translator.  I  
think I'm going to have to invoke the KinoSearch alpha clause.

We may as well make some planned changes to the file format at the  
same time (this is for the rich positions stuff mentioned a few  
months a go), consolidating all the disruptive stuff into one release.

> I don't know exactly where the problem is because your regex \b\w+ 
> (?:'\w+)?\b should still work if the string it is used against has  
> the utf-8 flag on.

It's because the locale pragma was in effect (removing it fixed the  
bug):

slothbear:~/perltest marvin$ cat locale_regex.plx
#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( _utf8_on );

my $m = "Mot\xC3\xB6rhead";
_utf8_on($m);

$m =~ /(\w*)/;
print "$1\n";

use locale;
$m =~ /(\w*)/;
print "$1\n";

slothbear:~/perltest marvin$ perl locale_regex.plx
Motörhead
Mot
slothbear:~/perltest marvin$

> Is It possible that the TokenBatch does not set the utf-8 flag  
> correctly in gettext or does it somehow corrupt the string it returns?

I don't believe so.  There are a few ways of testing a scalar to see  
if the flag is on.  For future reference, if you want to verify that  
for yourself, my favorite is Devel::Peek::Dump().  Look for "UTF8" in  
the "FLAGS" field, and the UTF8 value of the string.

slothbear:~/perltest marvin$ cat peek_dump.plx
#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( _utf8_on );
use Devel::Peek;

my $m = "Mot\xC3\xB6rhead";
_utf8_on($m);
Dump($m);

slothbear:~/perltest marvin$ perl peek_dump.plx
SV = PV(0x1801660) at 0x180b59c
   REFCNT = 1
   FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
   PV = 0x300bd0 "Mot\303\266rhead"\0 [UTF8 "Mot\x{f6}rhead"]
   CUR = 10
   LEN = 11
slothbear:~/perltest marvin$


> Cause I was also playing around with the "utf8::upgrade" function  
> to upgrading the text returned from TokenBatch to utf8 before  
> feeding it through the regex before trying your patched version,  
> but it somehow did an additional utf8 encoding to the string  
> causing the special character 'ä' beeing encoded twice resulting in  
> 2 strange characters instead of 'ä'. Maybe the same happens now  
> with the modified TokenBatch class.

If you can reproduce the problem, can you please provide me with a  
Devel::Peek Dump of before and after?

>
> 2.) The Highlighter does not like the UTF-8 Sequences. It looks  
> like it doesn't count UTF-8 special characters when computing the  
> insertion points for the <strong>...</strong> tags. It looks like  
> they are shift left by number of UTF-8 special characters. Here's  
> an example where the search term was "klammerte":
>
> === start example ===
> verfilzten Pelz beiden Händen nach Ansatzpunkten durchwühlend,  
> schwang sich Ford Rücken mächtigen Tie<strong>res klamm</ 
> strong>erte sich, endlich sicher oben saß, beiden Händen braune  
> Gestrüpp ...
> === end example ===


I believe that this was actually due to a problem in Tokenizer/ 
TokenBatch.  Last night's patch measured Token starts in bytes from  
the beginning of the field, but Token ends in Unicode code points  
from the beginning of the field.  Highlighter uses this information  
(which has been stored in the index) to place the tags.  It expects  
bytecounts -- the code-point-count was bogus.

I've replaced the faulty algorithm with something slightly slower but  
less tricky.

> Please, let me know if you fixed these problems. Especially the  
> splitting of query terms at the wrong position is a big one. But I  
> gladly play beta tester for utf-8 text in KinoSearch if it makes  
> Kinosearch UTF-8 compatible :-)

I appreciate the offer.   We're going to need a few more tests.  It's  
clear that the current test suite is not adequate, since it did not  
reveal these problems.

> P.S. Did you ever think of wildcard searches like lucene offers. I  
> would very much like to search for "business*" and also find  
> "businesspartner"   or search for "b*ess" and find "business" but  
> also "boldness". I know that wildcards in front of the search terms  
> are not supported by clucene and I think it slows things down but I  
> really wonder if wildcards in between characters or at the end of  
> the search term could be implemented in Kinosearch with good search  
> speed.

Wildcard search in Lucene basically turns bus* into a giant OR'd  
query, iterating through all the terms in the index that begin with  
"bus".   It's highly inefficient, producing unacceptably poor  
performance much sooner than any of the other Query types.  It also  
raises an exception when more than 1024 terms are matched -- you can  
set that number higher, but then you may well run out of memory.   
People write all the time to the Lucene user list griping about  
either the poor performance or the exceptions.

I'm not a big fan of that implementation.  But then, the  
alternatives, such as indexing all bigrams and trigrams etc, aren't  
really that much better.  Wildcard search is just inherently more  
resource intensive than keyword search, because wildcards typically  
match SO many more documents.  Looking at if from a user's  
perspective, though -- e.g. browsing the Lucene docs -- you wouldn't  
know that.

There are many, many opportunities for expanding KinoSearch's  
capabilities which make better use of the inverted index data  
structure design.  Personally I don't imagine I'll ever add Wildcard  
queries to KinoSearch, and I'll be casting a vote for Lucy to avoid  
them as well -- at least in core.  They probably belong in a  
"contrib" or "experimental" section somewhere with a "WARNING: VERY  
SLOW" label at the top of the docs.

Do you know if anyone has tried a dictionary-based tokenizer for  
German?  I understand that with all the compound words, German needs  
substring search more than other Indo-European languages.  Stealing a  
page from the CJK playbook and splitting on words would cost a lot at  
index time, but be much faster than wildcards at search-time and  
maybe address the same need.  Would that help, at least in theory?

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

slothbear:~/projects/ks marvin$ svn diff -r 1026
Index: t/601-queryparser.t
===================================================================
--- t/601-queryparser.t (revision 1026)
+++ t/601-queryparser.t (working copy)
@@ -4,17 +4,20 @@
use lib 't';
use KinoSearch qw( kdump );
-use Test::More tests => 205;
+use Test::More tests => 207;
use File::Spec::Functions qw( catfile );
BEGIN { use_ok('KinoSearch::QueryParser::QueryParser') }
+use KinoSearchTestInvIndex qw( create_invindex );
+
use KinoSearch::InvIndexer;
use KinoSearch::Searcher;
use KinoSearch::Store::RAMInvIndex;
use KinoSearch::Analysis::Tokenizer;
use KinoSearch::Analysis::Stopalizer;
use KinoSearch::Analysis::PolyAnalyzer;
+use KinoSearch::Util::StringHelper qw( utf8_flag_on );
my $whitespace_tokenizer
      = KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/ );
@@ -175,3 +178,16 @@
      #exit;
}
+my $motorhead = "Mot\xC3\xB6rhead";
+utf8_flag_on($motorhead);
+$invindex = create_invindex($motorhead);
+my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
+$searcher = KinoSearch::Searcher->new(
+    analyzer => $tokenizer,
+    invindex => $invindex,
+);
+
+my $hits = $searcher->search('Mot');
+is( $hits->total_hits, 0, "Pre-test - indexing worked properly" );
+$hits = $searcher->search($motorhead);
+is( $hits->total_hits, 1, "QueryParser parses UTF-8 strings  
correctly" );
Index: lib/KinoSearch/Analysis/Tokenizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Tokenizer.pm        (revision 1026)
+++ lib/KinoSearch/Analysis/Tokenizer.pm        (working copy)
@@ -3,7 +3,6 @@
use warnings;
use KinoSearch::Util::ToolSet;
use base qw( KinoSearch::Analysis::Analyzer );
-use locale;
BEGIN {
      __PACKAGE__->init_instance_vars(
@@ -50,20 +49,24 @@
      # alias input to $_
      while ( $batch->next ) {
          local $_ = $batch->get_text;
+        my $copy = $_;
-        # ensure that pos is set to 0 for this scalar
-        pos = 0;
-
          # accumulate token start_offsets and end_offsets
          my ( @starts, @ends );
-        1 while ( m/$separator_re/g and push @starts,
-            pos and m/$token_re/g and push @ends, pos );
+        my $orig_length = bytes::length($_);
+        while (1) {
+            s/$separator_re//;
+            push @starts, $orig_length - bytes::length($_);
+            last unless s/$token_re//;
+            push @ends, $orig_length - bytes::length($_);
+        }
+
          # correct for overshoot
          $#starts = $#ends;
          # add the new tokens to the batch
-        $new_batch->add_many_tokens( $_, \@starts, \@ends );
+        $new_batch->add_many_tokens( $copy, \@starts, \@ends );
      }
      return $new_batch;
Index: lib/KinoSearch/Analysis/TokenBatch.pm
===================================================================
--- lib/KinoSearch/Analysis/TokenBatch.pm       (revision 1026)
+++ lib/KinoSearch/Analysis/TokenBatch.pm       (working copy)
@@ -69,7 +69,6 @@
      char *string_start = SvPV(string_sv, len);
      I32 i;
      const I32 max = av_len(starts_av);
-    STRLEN unicount = 0;
      for (i = 0; i <= max; i++) {
          STRLEN start_offset, end_offset;
@@ -93,24 +92,9 @@
              Kino_confess("end_offset > len (%d > %"UVuf")",
                  end_offset, (UV)len);
-        /* advance the pointer past as many unicode characters as  
required */
-        while (1) {
-            if (unicount == start_offset)
-                break;
-
-            /* header byte */
-            string_start++;
-
-            /* continutation bytes */
-            while ((*string_start & 0xC0) == 0xC0)
-                string_start++;
-
-            unicount++;
-        }
-
          /* calculate the start of the substring and add the token */
          token = Kino_Token_new(
-            string_start,
+            string_start + start_offset,
              (end_offset - start_offset),
              start_offset,
              end_offset,
Index: lib/KinoSearch/Index/Term.pm
===================================================================
--- lib/KinoSearch/Index/Term.pm        (revision 1026)
+++ lib/KinoSearch/Index/Term.pm        (working copy)
@@ -12,6 +12,8 @@
      __PACKAGE__->ready_get_set(qw( field text ));
}
+use KinoSearch::Util::StringHelper qw( utf8_flag_on utf8_flag_off );
+
sub new {
      croak("usage: KinoSearch::Index::Term->new( field, text )")
          unless @_ == 3;
@@ -26,6 +28,7 @@
sub new_from_string {
      my ( $class, $termstring, $finfos ) = @_;
      my $field_num = unpack( 'n', bytes::substr( $termstring, 0, 2,  
'' ) );
+    utf8_flag_on($termstring);
      my $field_name = $finfos->field_name($field_num);
      return __PACKAGE__->new( $field_name, $termstring );
}
@@ -37,7 +40,9 @@
      my ( $self, $finfos ) = @_;
      my $field_num = $finfos->get_field_num( $self->{field} );
      return unless defined $field_num;
-    return pack( 'n', $field_num ) . $self->{text};
+    my $termtext = $self->{text};
+    utf8_flag_off($termtext);
+    return pack( 'n', $field_num ) . $termtext;
}
sub to_string {
Index: lib/KinoSearch/Util/StringHelper.pm
===================================================================
--- lib/KinoSearch/Util/StringHelper.pm (revision 1026)
+++ lib/KinoSearch/Util/StringHelper.pm (working copy)
@@ -3,7 +3,7 @@
use warnings;
use base qw( Exporter );
-our @EXPORT_OK = qw( utf8_flag_on );
+our @EXPORT_OK = qw( utf8_flag_on utf8_flag_off );
1;
@@ -19,6 +19,12 @@
PPCODE:
      SvUTF8_on(sv);
+void
+utf8_flag_off(sv)
+    SV *sv;
+PPCODE:
+    SvUTF8_off(sv);
+
__H__
#ifndef H_KINO_STRING_HELPER
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list