[KinoSearch] Re: Analyzer API mods (was API request for KS::InvIndexer...)
Marvin Humphrey
marvin at rectangular.com
Fri May 4 14:59:56 PDT 2007
On May 4, 2007, at 11:52 AM, Peter Karman wrote:
> In Swish XS, I do something like this (not real code):
>
> swish_utf8ify(self, str)
> SV* self;
> SV* str;
>
> CODE:
>
> char * buf = SvPV(str, PL_na);
>
> if (!SvUTF8(str))
> {
> if (swish_is_ascii(buf))
> SvUTF8_on(str); /* flags original SV */
> else
> croak("%s is not flagged as a UTF-8 string and is not
> ASCII", buf);
> }
>
> where swish_is_ascii() just makes sure there is no byte > 127. I'm
> sure there's
> a native Perl equivalent.
The crucial perlapi functions are:
* sv_utf8_upgrade -- Converts an SV's string to utf8. The SV's
UTF8 flag will end up set no matter what. There are 3
possible outcomes.
o Source SV has UTF8 flag set: no-op.
o Source SV is pure ASCII: sets UTF8 flag, but no effect on
string.
o Source SV does not have UTF8 flag set, has some bytes > 127:
Converts string to utf8 assuming source encoding of Latin1,
reallocating as necessary.
* SvPVutf8 -- like SvPV, but converts the SV to utf8 first if
necessary.
* is_utf8_string -- Tests if a char* sequence of a length you
specify is valid utf8. Use this when you don't have access to or
don't want to trust a scalar's UTF8 flag.
None of them are directly equivalent to what you're doing. However,
using those (and a few others), I believe I've gotten KS to the point
where it handles all Perl character data transparently.
All output from KS is valid UTF-8 and has the UTF8 flag set, but you
wouldn't know that. Perl transparently downgrades everything to
Latin1 when it needs to. If there's a code point > 255, you might
see a "wide character in print" warning when printing to a filehandle
which thinks it's Latin1 (as STDOUT does by default), but you'd only
see that if you were supplying KS with something other than Latin1 to
begin with.
Here's XS code for StringHelper::utf8ify:
void
utf8ify(sv)
SV *sv;
PPCODE:
sv_utf8_upgrade(sv);
> You'll notice I don't handle Latin1 or EBCDIC as the utf8::upgrade
> () claims to.
EBCDIC isn't worth worrying about, IMO -- too few machines use it.
Handling Latin1 might be easier than you think -- you just have to be
consistent about using SvPVutf8 instead of SvPV so that all entry
points into your own string handling code convert if necessary. You
don't have to convert back -- just give Perl SVs with valid UTF-8
strings and the UTF8 flag on, and Perl will behave correctly (modulo
esoteric weirdness in pack/unpack and the regex engine).
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list