[Kinosearch] Sort::External presort and postsort feature
Ken Clarke
perlprogrammer at shaw.ca
Thu Jul 21 18:16:23 PDT 2005
Hi Folks,
Marvin suggested I join this list to continue a discussion we are having
regarding feature suggestions I have.
I have customized Sort::External 0.05 by adding support for providing
"presort" and "postsort" subroutine references. Here's why:
Each line of the file I am sorting is actually a record of multiple
fields. I want to be able to sort and subsort individual fields. I also
need to decode some individual fields so that sort compares actual field
values rather than there encoded representations. Currently, it is possible
to accomplish this with:
# typical record line: 123|text field 1|text field 2|456
my @A = my @B = (); # mod_perl safe
my $sortsub = sub {
@A = split(/\|/, $Sort::External::a, -1);
@B = split(/\|/, $Sort::External::b, -1);
map { $_ = &Decode($_) } @A[1..2], @B[1..2]; # only text fields need to
be decoded
return $A->[1] cmp $B->[1] || # case-sensitive asciibetical
$B->[2] cmp $A->[2] || # case-sensitive reverse asciibetical
$A-> [3] <=> $B->[3] || # numberic ascending
$B->[4] <=> $A->[4] # numberic descending
}
Since sort compares a given line many times, the split and decode
operations will also be called many times for each line. My solutions
significantly reduces this overhead. For example, on a test file with over
300,000 records (which required 3,200,000 calls to the sort subroutine for
both the method shown above and my modified method below), the number of
calls to the decode subroutine were reduced from 11 million (an average of
35 times per line) to only 600,000 (2 times per line, which means that each
field is only decoded once). It reduces the runtime from almost 15 minutes
to less than 1 minute! Certainly there are other factors which need to be
considered, such as -cache_size and the number of records in the test
datafile, but this isn't ment as an exhaustive examination of efficiency
gains. It simply illustrates that considerable gains are possible.
Here's my implementation strategy:
my $line = '';
my @record = ();
my $presortsub = sub {
$line = $_[0];
chomp($line);
@record = split(/\|/, $line, -1);
map { $_ = &Decode($_) } @record[1..2];
$_[0] = [@record];
};
my $record_ref = '';
my $postsortsub = sub {
$record_ref = $_[0];
map { $_ = &Encode($_) } @{$record_ref}[1..2];
$line = join('|', @{$record_ref});
$_[0] = "$line\n";
};
# changes to Sort::External
# add presort and postsort to defaults
@init_defaults{qw{-presortsub -postsortsub}} = (undef, undef);
# this is only one example of wraping the sort call from sub
_write_input_cache_to_tempfile - I added it to all of the calls
# at the beginning of the sub
my $presortsub = $self->{-presortsub};
my $sortsub = $self->{-sortsub}; # this was already there
my $postsortsub = $self->{-postsortsub};
# here is the wrap
map { &{$self->{-presortsub}}($_) } @$input_cache if defined($presortsub);
@$input_cache = defined $sortsub ?
(sort $sortsub @$input_cache) :
(sort @$input_cache);
map { &{$self->{-postsortsub}}($_) } @$input_cache if defined($postsortsub);
# the call to the new method then becomes
my %Sort_External_args = (
-presortsub => $presortsub,
-sortsub => $sortsub,
-postsortsub => $postsortsub
);
my Sort::External->new(%Sort_External_args);
I would like to discuss pros and cons of this implementation strategy in
more detail, but this should be enough to wet your appetite. Please let me
know your thoughts on this suggestion.
>> Ken Clarke
>> Contract Web Programmer / E-commerce Technologist
>> www.Data-Maestro.com
_______________________________________________
Kinosearch mailing list
Kinosearch at rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
More information about the kinosearch
mailing list