On 1/5/07, Luis Neves <[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
> There are still some things underspecified though.
>
> Let's take an example of collapseon=site, collapsenum=2
>
> The list of un-collapsed matches and their relevancy scores (sort order)
> is:
> doc=51, site=A, score=100
> doc=52, site=B, score=90
> doc=53, site=C, score=80
> doc=54, site=B, score=70
> doc=55, site=D, score=60
> doc=56, site=E, score=50
> doc=57, site=B, score=40
> doc=58, site=A, score=30
>
> 1) If I ask for the top 4 docs, should I get [51,52,53,54] or
> [51,52,54,53]. Are lower ranking docs moved up in the rankings to be
> in their higher ranking "group"?
The docs move up the ranking.
After thinking on this a little further (since someone submitted a
patch), this makes things significantly more expensive.
The issue is that even if you are only interested in the top 10 docs,
you can't use the normal priority queue method to discard low scores,
because the last document you score could be very high scoring, and be
in the same group as the lower previously-discarded scores.
One way is to keep a priority queue per field value (very expensive if
there are many field values).
Another way is to use two phases... the first collects the top n
documents, and the second grabs
Another issue is how to implement start + offset.
-Yonik