[KinoSearch] index dump

Brian Phillips bpphillips+ml at gmail.com
Tue Apr 25 19:46:24 PDT 2006


Here's a rough port of Plucene's dump_index script.  I had to guess a little
bit about how stuff worked but I think it's pretty much the same as what
Plucene provided (it lists the fields, then lists each term and the
cooresponding document IDs that contain that term):

use strict;
use warnings;

use KinoSearch::Index::IndexReader;

my $where = shift @ARGV;
if( !$where || ! -e $where ){
    die "please specify an index location at the command line\n";
}

my $r = KinoSearch::Index::IndexReader->new( invindex => $where );

my @readers = ref $r->{sub_readers} eq 'ARRAY' ? @{ $r->{sub_readers} } :
$r;
print "We have " . @readers . " readers\n";

print "\n\nDocuments:\n";
for my $reader (@readers) {
    print "Segment "
      . $reader->{seg_name} . " has "
      . $reader->max_doc
      . " docs\n";
    my @terms = $reader->terms;
    print "Fields:\n";
    my %fields;
    for my $field ( $reader->{finfos}->get_infos ) {
        $fields{ $field->get_field_num } = $field->get_name;
        print "\t" . $field->get_field_num . ": " . $field->get_name;
        my @info;
        foreach my $i (qw(indexed stored analyzed vectorized binary
compressed))
        {
            my $method = "get_$i";
            push @info, $i if ( $field->$method );
        }
        print " [" . join( ',', map { substr( $_, 0, 1 ) } sort @info ) .
"]"
          if (@info);
        print "\n";
    }
    print "Terms:\n";
    my $td = $reader->term_docs;
    for my $t (@terms) {
        while ( $t->next ) {
            my $term =
              KinoSearch::Index::Term->new_from_string( $t->get_termstring,
                $t->_get_finfos );
            print $term->get_field . ": " . $term->get_text . "\n";
            $td->seek($term);
            my ( $docs, $freqs );
            my $num_got = $td->read( $docs, $freqs, $r->max_doc );
            my @docs = unpack( 'I*', $docs );
            my @tf_ds = unpack( 'I*', $freqs );
            for my $i ( 0 .. $#docs ) {
                print "\t Doc "
                  . $docs[$i] . " ("
                  . $tf_ds[$i]
                  . " occurrences)\n";
            }
        }
    }
}
print "Total documents: " . $r->max_doc . " in " . @readers . " segments\n";


On 4/17/06, Marvin Humphrey <marvin at rectangular.com> wrote:
>
>
> On Apr 17, 2006, at 7:05 PM, Brian Phillips wrote:
>
> > Is there any way (for debugging purposes) to dump the contents of
> > the inverted index in a semi-readable form?
>
> I haven't got anything written.
>
> > I'm trying to determine why certain queries return no results and I
> > seem to recall Plucene had something like this that allowed you to
> > dump out what you had indexed (in it's analyzed form).
>
> You are presumably referring to this:
>
>      http://search.cpan.org/~tmtm/Plucene-1.24/bin/dump_index
>      http://search.cpan.org/src/TMTM/Plucene-1.24/bin/dump_index
>
> A KinoSearch version would work very similarly, though things are
> named differently.  I'll try to get to this soon, unless someone
> volunteers to port it.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://rectangular.com/pipermail/kinosearch/attachments/20060425/3dd4d17d/attachment-0001.htm 


More information about the kinosearch mailing list