[KinoSearch] utf8 (unicode) any progress on TokenBatch?
David Wheeler
david at kineticode.com
Mon Aug 14 12:24:52 PDT 2006
On Aug 14, 2006, at 12:10, Marvin Humphrey wrote:
> My inclination now is to go with the radical solution: force non-
> UTF-8 source material into UTF-8 when it gets imported into a
> TokenBatch (we'd guess the source encoding based on locale), make
> all of KinoSearch's internals expect Unicode, and always output
> Unicode.
Make it possible for the encoding to be supplieed to TokenBatch so
that you don't always have to guess. And guessing will be wrong, at
least sometimes.
We've taken a similar approach for Bricolage: Everything is required
to be UTF8, and if it's not, we have to know what it is so that we
can convert it to UTF8. This makes things a hell of a lot simpler in
the long run, because the rules are so straight-forward. I'll admit,
though, that it took a bit of doing to find all those places that
weren't setting the utf8 flag...
I have to admit, Unicode is the one thing that Java got right and
better than any other language. I hope Perl 6 does the same. :-)
Best,
David
More information about the kinosearch
mailing list