[KinoSearch] Aggregates/Grouping possible?
forum at s05.tinita.de
Fri Feb 19 08:38:07 PST 2010
On Fri, 19 Feb 2010, Peter Karman wrote:
> Tina Müller wrote on 02/19/2010 08:16 AM:
>> So I get back really many rows, so I was wondering if KinoSearch allows
>> something like:
>> select sum(number_field) ... where ... group by bar
>> I searched for "aggregates" but didn't find anything.
>> Does the KinoSearch index structure allow such grouping and summing up?
>> I also searched for this feature in Xapian and only found an entry that
>> grouping (or "collapsing") is possible, but only a count for the rows,
>> not a summing up of a field.
>> Do you have any hints if this is possible in KinoSearch or maybe in
> Sounds like you're trying to provide what is typically called "faceted
> search" where search results are accompanied by statistical (count)
> information about certain fields (facets).
Yes, indeed. Just that I do not only need counts but sums (eventually
also aggregates like "average").
> Xapian does have this feature, via what is called the MatchSpy.
I only see that it can count, but not sum up.
My data is basically representing something like an accesslog statistic
from a webserver, so for example I have values like hits and page views
per day and user agent.
So in MySQL I do something like
select sum(hits), sum(pageviews) ... where date between ... group by useragent
> KS does
> not (yet) have a feature like this, AFAIK. I have implemented it using
> KS though, in the same way I do for Swish-e, by iterating over all the
> results for a query (or where result sets are huge and accuracy not as
> important, up to a pre-defined max to extract a suitable sample size)
> and then caching the facet counts. So the first time a query is run, the
> performance hit is paid, but after that, my code checks the cache for
> the query and uses those numbers instead. Set the cache ttl based on
> your business needs.
I'm doing something similar. I cache the data for a specific query
in memcached. That's a great help, but the user can select so many
different parameters (by useragent, by time of day, by url prefix, ...)
and timeframes that the cache is not used very often.
(Of course, there is a default timeframe so the start page of the
frontend is usually very fast when already in cache.)
Since I really need accurate data I have to fetch all rows, that's
why I thought about adding additional indices for longer timeframes.
Advantage: fewer rows, less memory, less calculating in perl.
Disadvantage: more indices to generate
> Search::OpenSearch (on CPAN) will be including this facet+cache feature
> as soon as I have time to write it.
Thanks, I'll have a look into it.
More information about the kinosearch