On 1/4/07, Luis Neves <[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
> Off the top of my head, one could use a priority queue that can change
> it's size dynamically. One could increment a group count for each hit
> (like faceted search with the FieldCache) and if the group count
> exceeds "n", then you increment the size of the priority queue to
> allow an additional item to be collected to compensate.
>
> -Yonik
You might as wheel say that I have to change the dilithium crystals in the flux
capacitor :-)
Heh...
When someone asks for the top 10 documents, we create a priority queue
of size 10 and put all of the hits through it (with a performance
shortcut if the only sort is by score). After we are all done, the
queue contains the top 10 documents by the sort criteria.
Now lets say we are limiting the number of results from any "site" to 2.
If we add another document to the priority queue and it will be the
3rd from a specific site, there are two things we could do:
1) remove the lowest ranking document from the 3 documents matching that site
2) increase the size of the priority queue to 11 since we will be
throwing one of the
documents away later.
At first blush, option (2) seemed easier to me, with the added step of
discarding the extra documents as you pull them from the queue.
One of the reasons I like Solr so much is because I get impressive results
without having to know Lucene, which is something that will have to change
because I also need this feature.
Not knowing much about the internal of Solr/Lucene I had a look at the Facet
code in search of ideas, but from what I could see the facet counts are
calculated after the Documents are added to the response, it seems to me that
any kind of grouping has to be done before that... right?
Right.
Could you explain in more detail where should I look?
Can the TopFieldDocCollector/TopFieldDocs classes be used to this end?
That's currently how the top docs are collected in Lucene (these
separate classes were added later, and Solr doesn't currently use
them).
SolrIndexSearcher.getDocListNC() is the lowest level of doc collection
that would need to be modified or duplicated.
Side note: Over here, beside Solr, we also use the "FAST" search platform and
they call this feature "Field collapsing":
<http://www.fastsearch.com/glossary.aspx?m=48&amid=299>
I like the syntax they use:
"&collapseon=<fieldname>&collapsenum=N" -> Collapse, but keep N number of
collapsed documents
For some reason they can only collapse on numeric fields (int32).
Cool, thanks for the reference.
There are still some things underspecified though.
Let's take an example of collapseon=site, collapsenum=2
The list of un-collapsed matches and their relevancy scores (sort order) is:
doc=51, site=A, score=100
doc=52, site=B, score=90
doc=53, site=C, score=80
doc=54, site=B, score=70
doc=55, site=D, score=60
doc=56, site=E, score=50
doc=57, site=B, score=40
doc=58, site=A, score=30
1) If I ask for the top 4 docs, should I get [51,52,53,54] or
[51,52,54,53]. Are lower ranking docs moved up in the rankings to be
in their higher ranking "group"?
2) If I ask for the top 3 docs, should I get [51,52,53] because those
are the top 3 scoring docs, or should I get [51,58,52] because
documents were first groups and then ranked (and 51 and 58 go
together)? Another way of asking this is related to (1): should docs
outside the "window" be moved up in the rankings to be in their higher
ranking "group"?
3) Should the number of documents in a "group" change the relevancy?
Should site=B rank higher than site=A?
4) Is the collapsing only in the returned results, or just within a
page of results. If I ask for docs 4 through 7, should doc 57 be in
that list or not?
Defining things to make sense while retaining the ability to page
through the results seems to be the challenge.
-Yonik