[KinoSearch] Re: Analyzer API mods (was API request for KS::InvIndexer...)

Marvin Humphrey marvin at rectangular.com
Fri May 4 14:59:56 PDT 2007




On May 4, 2007, at 11:52 AM, Peter Karman wrote:
> In Swish XS, I do something like this (not real code):
>
> swish_utf8ify(self, str)
>   SV* self;
>   SV* str;
>
>   CODE:
>
>   char * buf = SvPV(str, PL_na);
>
>   if (!SvUTF8(str))
>   {
>       if (swish_is_ascii(buf))
>          SvUTF8_on(str);     /* flags original SV */
>       else
>          croak("%s is not flagged as a UTF-8 string and is not  
> ASCII", buf);
>   }
>
> where swish_is_ascii() just makes sure there is no byte > 127. I'm  
> sure there's
> a native Perl equivalent.

The crucial perlapi functions are:

   * sv_utf8_upgrade -- Converts an SV's string to utf8.  The SV's
     UTF8 flag will end up set no matter what.  There are 3
     possible outcomes.
       o Source SV has UTF8 flag set: no-op.
       o Source SV is pure ASCII: sets UTF8 flag, but no effect on
         string.
       o Source SV does not have UTF8 flag set, has some bytes > 127:
         Converts string to utf8 assuming source encoding of Latin1,
         reallocating as necessary.

   * SvPVutf8 -- like SvPV, but converts the SV to utf8 first if  
necessary.

   * is_utf8_string -- Tests if a char* sequence of a length you
     specify is valid utf8.  Use this when you don't have access to or
     don't want to trust a scalar's UTF8 flag.

None of them are directly equivalent to what you're doing.  However,  
using those (and a few others), I believe I've gotten KS to the point  
where it handles all Perl character data transparently.

All output from KS is valid UTF-8 and has the UTF8 flag set, but you  
wouldn't know that.  Perl transparently downgrades everything to  
Latin1 when it needs to.  If there's a code point > 255, you might  
see a "wide character in print" warning when printing to a filehandle  
which thinks it's Latin1 (as STDOUT does by default), but you'd only  
see that if you were supplying KS with something other than Latin1 to  
begin with.

Here's XS code for StringHelper::utf8ify:

     void
     utf8ify(sv)
         SV *sv;
     PPCODE:
         sv_utf8_upgrade(sv);

> You'll notice I don't handle Latin1 or EBCDIC as the utf8::upgrade 
> () claims to.

EBCDIC isn't worth worrying about, IMO -- too few machines use it.

Handling Latin1 might be easier than you think -- you just have to be  
consistent about using SvPVutf8 instead of SvPV so that all entry  
points into your own string handling code convert if necessary.  You  
don't have to convert back -- just give Perl SVs with valid UTF-8  
strings and the UTF8 flag on, and Perl will behave correctly (modulo  
esoteric weirdness in pack/unpack and the regex engine).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch




More information about the kinosearch mailing list