[KinoSearch] Fwd: [rt.cpan.org #18899] KinoSearch and locale

Marvin Humphrey marvin at rectangular.com
Mon Apr 24 23:54:05 PDT 2006




Hi folks,

This nice bug report came in today.  I'll cc my reply to the list  
once I work one up.  The patch looks very simple, but I don't know  
all the implications of turning on the local pragma -- I've never had  
to use it, myself.

KinoSearch aspires to be completely agnostic about encoding, and to  
just treat everything as bytestrings.  In this regard, it differs  
from Lucene, which use Sun's "Modified UTF-8" for the index files and  
java chars of course for all strings.  The goal is to allow the user  
to choose an arbitrary encoding -- it would be especially good if CJK  
users could have KinoSearch keep all data in UTF-16, both in the  
index and throughout all KinoSearch processes.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
Begin forwarded message:

From: "Guest via RT" <bug-KinoSearch at rt.cpan.org>
Date: April 24, 2006 8:25:36 AM PDT
To: undisclosed-recipients:;
Subject: [rt.cpan.org #18899] KinoSearch and locale
Reply-To: bug-KinoSearch at rt.cpan.org


Mon Apr 24 11:25:35 2006: Request 18899 was acted upon.
Transaction: Ticket created by guest
        Queue: KinoSearch
      Subject: KinoSearch and locale
        Owner: Nobody
   Requestors: aver at pvk.org.ru
       Status: new
  Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=18899 >


Hello.

I've tried to use KinoSearch version 0.10_01 with English and Russian
(KOI8-R charset) texts and had some problems. For English text it works
as expected, but it doesn't works for texts in KOI8-R charset, because
'locale' Perl-pragma is not used in classes Tokenizer and LCNormalizer
(they use lc() and regular expressions).

I've created LocalizedLCNormalizer (with 'use locale;') and
LocalizedPolyAnalyzer (it calls Tokenizer with token_re & separator_re
parameters defined in scope with 'use locale;') and they works well.

Same problem with KinoSearch::Highlight::Highlighter class - it also
uses regular expressions in scope without 'use locale;', and doesn't
work with KOI8-R charset. Of course I can create LocalizedHighligher
with the only difference in locale pragma, but it isn't good solution.

Why not use locale pragma in Perl-part of KinoSearch?
package LocalizedPolyAnalyzer;

use strict;
use warnings;
use locale;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::Analysis::Tokenizer;
use KinoSearch::Analysis::Stemmer;

use base qw( KinoSearch::Analysis::PolyAnalyzer);

our %instance_vars = __PACKAGE__->init_instance_vars();

sub init_instance {
	my $self     = shift;
	my $language = $self->{language} = lc($self->{language});

	croak("Must specify 'language'") unless $language;

	$self->{analyzers} = [
		#KinoSearch::Analysis::LCNormalizer->new(language => $language),
		LocalizedLCNormalizer->new(language => $language),
		#KinoSearch::Analysis::Tokenizer->new(language => $language),
		KinoSearch::Analysis::Tokenizer->new(
			language     => $language,
			token_re     => qr/\b\w+(?:'\w+)?\b/,
			separator_re => qr/\W*/
		),
		KinoSearch::Analysis::Stemmer->new(language => $language),
	];
}

package LocalizedLCNormalizer;
use strict;
use warnings;
use locale;
use base qw( KinoSearch::Analysis::LCNormalizer );

our %instance_vars = __PACKAGE__->init_instance_vars();

sub analyze {
	my ($self, $token_batch) = @_;

	# lc all of the terms, one by one
	while ($token_batch->next) {
		$token_batch->set_text(lc($token_batch->get_text));
	}

	return $token_batch;
}

1;




_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list