On 13/11/2007, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> can you elaborate on your use case ... the only time i've ever seen people
> ask about something like this it was because true facet counts were too
> expensive to compute, so they were doing "sampling" of the first N
> results.
>
> In Solr, Sampling like this would likely be just as expensive as getting
> the full count.


It's not really a performance-related issue, the primary goal is to use the
facet information to determine the most relevant product category related to
the particular search being performed.

Generally the facets returned by simple, generic queries are fine for this
purpose (e.g. a search for "nokia" will correctly return "Mobile / Cell
Phone" as the most frequent facet), however facet data for more specific
searches are not as clear-cut (e.g. "samsung tv" where TVs will appear at
the top of the search results, but will also match other "samsung' products
like mobile phones and mp3 players - obviously I could tweak 'mm' parameter
to fix this particular case, but it wouldn't really solve my problem).

The theory is that facet information generated from the first 'x' (lets say
100) matches to a query (ordered by score / relevance) will be more accurate
(for the above purpose) than facets obtained over the entire result set.  So
ideally, it would be useful to be able to contstrain the size of the DocSet
somehow (as you mention below).


matching occurs in increasing order of docid, so even if there was as hook
> to say "stop matching after N docs" those N wouldn't be a good
> representative sample, they would be biased towards "older" documents
> (based on when they were indexed, not on any particular date field)
>
> if what you are interested in is stats on the first N docs according to a
> specific sort (score or otherwise) then you could write a custom request
> handler that executed a search with a limit of N, got the DocList,
> iterated over it to build a DocSet, and then used that DocSet to do
> faceting ... but that would probably take even longer then just using the
> full DocSet matching the entire query.



I was hoping to avoid having to write a custom request handler but your
suggestion above sounds like it would do the trick.  I'm also debating
whether to extract my own facet info from a result set on the client side,
but this would be even slower.

Thanks for your suggestions so far,
Piete

Reply via email to