[KinoSearch] utf8 (unicode) any progress on TokenBatch?

David Wheeler david at kineticode.com
Mon Aug 14 12:24:52 PDT 2006



On Aug 14, 2006, at 12:10, Marvin Humphrey wrote:

> My inclination now is to go with the radical solution: force non- 
> UTF-8 source material into UTF-8 when it gets imported into a  
> TokenBatch (we'd guess the source encoding based on locale), make  
> all of KinoSearch's internals expect Unicode, and always output  
> Unicode.

Make it possible for the encoding to be supplieed to TokenBatch so  
that you don't always have to guess. And guessing will be wrong, at  
least sometimes.

We've taken a similar approach for Bricolage: Everything is required  
to be UTF8, and if it's not, we have to know what it is so that we  
can convert it to UTF8. This makes things a hell of a lot simpler in  
the long run, because the rules are so straight-forward. I'll admit,  
though, that it took a bit of doing to find all those places that  
weren't setting the utf8 flag...

I have to admit, Unicode is the one thing that Java got right and  
better than any other language. I hope Perl 6 does the same. :-)

Best,

David




More information about the kinosearch mailing list