Hi Tavi,

In my understanding the scoring formula Lucene (and therefore Solr) uses is
based on a mathematical model which is proven to work for general purpose
full text searching.
The real challenge, as you mention, comes when you need to achieve high
quality scoring based on the domain you are working in. For example, a
general search portal for Songs might need to score Songs based on search
relevance, but a search application for a Music Publisher might need to
score Songs first by relevance with matched documents boosted according to
the revenue they have generated..and ranking from that second scoring
strategy could be widely different to the first one..

Personally, I can't think of a generic scoring strategy that would come out
of the box with Solr which would allow for all the widely different use
cases. Don't really agree that tuning Solr and in general experimenting for
better scoring quality is something fragile or awkward. As the name
suggests, it is a "tuning" process which is targeting your specific
environment. :)

Technically wise, in our case, we were able to significantly improve scoring
quality (as expected by our domain experts) by using the Dismax Search
Handler, and by experimenting with different Boost values, Function Queries,
the mm parameter and by setting "omitNorms" to true for the fields we were
having problems with.

Regards,
- Savvas


On 8 February 2011 16:23, Tavi Nathanson <tavi.nathan...@gmail.com> wrote:

> Hey everyone,
>
> I have a question about Lucene/Solr scoring in general. There are many
> factors at play in the final score for each document, and very often one
> factor will completely dominate everything else when that may not be the
> intention.
>
> ** The question: might there be a way to enforce strict requirements that
> certain factors are higher priority than other factors, and/or certain
> factors shouldn't overtake other factors? Perhaps a set of rules where one
> factor is considered before even examining another factor? Tuning boost
> numbers around and hoping for the best seems imprecise and very fragile. **
>
> To make this more concrete, an example:
>
> We previously added the scores of multi-field matches together via an OR,
> so: score(query "apple") = score(field1:apple) + score(field2:apple). I
> changed that to be more in-line with DisMaxParser, namely a max:
> score(query
> "apple") = max(score(field1:apple), score(field2:apple)). I also modified
> coord such that coord would only consider actual unique terms ("apple" vs.
> "orange"), rather than terms across multiple fields (field1:apple vs.
> field2:apple).
>
> This seemed like a good idea, but it actually introduced a bug that was
> previously hidden. Suddenly, documents matching "apple" in the title and
> *nothing* in the body were being boosted over documents matching "apple" in
> the title and "apple" in the body! I investigated, and it was due to
> lengthNorm: previously, documents matching "apple" in both title and body
> were getting very high scores and completely overwhelming lengthNorm. Now
> that they were no longer getting *such* high scores, which was beneficial
> in
> many respects, they were also no longer overwhelming lengthNorm. This
> allowed lengthNorm to dominate everything else.
>
> I'd love to hear your thoughts :)
>
> Tavi
>

Reply via email to