Yonik Seeley wrote:

There are still some things underspecified though.

Let's take an example of collapseon=site, collapsenum=2

The list of un-collapsed matches and their relevancy scores (sort order) is:
doc=51, site=A, score=100
doc=52, site=B, score=90
doc=53, site=C, score=80
doc=54, site=B, score=70
doc=55, site=D, score=60
doc=56, site=E, score=50
doc=57, site=B, score=40
doc=58, site=A, score=30

1)  If I ask for the top 4 docs, should I get [51,52,53,54] or
[51,52,54,53].  Are lower ranking docs moved up in the rankings to be
in their higher ranking "group"?

The docs move up the ranking.
You should get [51,58,52,54] ... or one could make the case that you should get
[51,58,52,54,53,55], to get the somewhat equivalent behaviour of a SQL "quota-query", in that case that case the "top 4" would not refer to the number of documents but the number of distinct values for the field you are collapsing.


2)  If I ask for the top 3 docs, should I get [51,52,53] because those
are the top 3 scoring docs, or should I get [51,58,52] because
documents were first groups and then ranked (and 51 and 58 go
together)?  Another way of asking this is related to (1): should docs
outside the "window" be moved up in the rankings to be in their higher
ranking "group"?

See above.



3) Should the number of documents in a "group" change the relevancy?
Should site=B rank higher than site=A?

I don't think so... don't know if that is what *should* be done, but that's not what FAST does.


4) Is the collapsing only in the returned results, or just within a
page of results.  If I ask for docs 4 through 7, should doc 57 be in
that list or not?

With "FAST" that is an option, the default behaviour is to remove the documents from the resultset and the 57 would not be on the list, but you can choose to not remove them and in that case they are presented last.

Defining things to make sense while retaining the ability to page
through the results seems to be the challenge.


I'm beginning to think that this a little to complex for a first project with Lucene. In my particular case all I want is to group results by category (from a predetermined - and small - category list), I think I will just make a request by category and accept the latency.

--
Luis Neves

Reply via email to