[KinoSearch] index dump
Brian Phillips
bpphillips+ml at gmail.com
Tue Apr 25 19:46:24 PDT 2006
Here's a rough port of Plucene's dump_index script. I had to guess a little
bit about how stuff worked but I think it's pretty much the same as what
Plucene provided (it lists the fields, then lists each term and the
cooresponding document IDs that contain that term):
use strict;
use warnings;
use KinoSearch::Index::IndexReader;
my $where = shift @ARGV;
if( !$where || ! -e $where ){
die "please specify an index location at the command line\n";
}
my $r = KinoSearch::Index::IndexReader->new( invindex => $where );
my @readers = ref $r->{sub_readers} eq 'ARRAY' ? @{ $r->{sub_readers} } :
$r;
print "We have " . @readers . " readers\n";
print "\n\nDocuments:\n";
for my $reader (@readers) {
print "Segment "
. $reader->{seg_name} . " has "
. $reader->max_doc
. " docs\n";
my @terms = $reader->terms;
print "Fields:\n";
my %fields;
for my $field ( $reader->{finfos}->get_infos ) {
$fields{ $field->get_field_num } = $field->get_name;
print "\t" . $field->get_field_num . ": " . $field->get_name;
my @info;
foreach my $i (qw(indexed stored analyzed vectorized binary
compressed))
{
my $method = "get_$i";
push @info, $i if ( $field->$method );
}
print " [" . join( ',', map { substr( $_, 0, 1 ) } sort @info ) .
"]"
if (@info);
print "\n";
}
print "Terms:\n";
my $td = $reader->term_docs;
for my $t (@terms) {
while ( $t->next ) {
my $term =
KinoSearch::Index::Term->new_from_string( $t->get_termstring,
$t->_get_finfos );
print $term->get_field . ": " . $term->get_text . "\n";
$td->seek($term);
my ( $docs, $freqs );
my $num_got = $td->read( $docs, $freqs, $r->max_doc );
my @docs = unpack( 'I*', $docs );
my @tf_ds = unpack( 'I*', $freqs );
for my $i ( 0 .. $#docs ) {
print "\t Doc "
. $docs[$i] . " ("
. $tf_ds[$i]
. " occurrences)\n";
}
}
}
}
print "Total documents: " . $r->max_doc . " in " . @readers . " segments\n";
On 4/17/06, Marvin Humphrey <marvin at rectangular.com> wrote:
>
>
> On Apr 17, 2006, at 7:05 PM, Brian Phillips wrote:
>
> > Is there any way (for debugging purposes) to dump the contents of
> > the inverted index in a semi-readable form?
>
> I haven't got anything written.
>
> > I'm trying to determine why certain queries return no results and I
> > seem to recall Plucene had something like this that allowed you to
> > dump out what you had indexed (in it's analyzed form).
>
> You are presumably referring to this:
>
> http://search.cpan.org/~tmtm/Plucene-1.24/bin/dump_index
> http://search.cpan.org/src/TMTM/Plucene-1.24/bin/dump_index
>
> A KinoSearch version would work very similarly, though things are
> named differently. I'll try to get to this soon, unless someone
> volunteers to port it.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://rectangular.com/pipermail/kinosearch/attachments/20060425/3dd4d17d/attachment-0001.htm
More information about the kinosearch
mailing list