[KinoSearch] Question) Unicode AND Sorting

Marvin Humphrey marvin at rectangular.com
Thu Aug 3 13:16:08 PDT 2006

On Aug 3, 2006, at 6:49 AM, 골빈해커 wrote:

> 1) How can I indexing unicode(utf-8) text?

I was going to say, "the same way you handle regular text", but I've  
just realized that the TokenBatch class is not preserving the UTF-8  
flag of the scalars that it's derived from -- and therefore, all of  
KinoSearch's Analyzers function in a non-UTF-8 context.  :(  So right  
this moment the only way to do it is to write your own Tokenizer class.

I'm slammed putting out fires for my main client right now and can't  
work on this today, but fixing this behavior is a high priority.  The  
fix will be to have the TokenBatch absorb the UTF8 flag of the latest  
scalar that gets assigned to it.  After that, the regular expressions  
in KinoSearch's Tokenizer will adapt themselves and function either  
in a UTF-8 context or not depending on the input.

> 2) When I use sort by field value?

This is only possible at present using a somewhat inefficient hack  
that violates KinoSearch's public API.


