Hi Walter,

May I ask a tangential question? I'm curious the following line you wrote:

> Solr is a vector-space engine. Some early engines (Verity VDK) were probabilistic engines. Those do give an absolute estimate of the relevance of each hit. Unfortunately, the relevance of results is just not as good as vector-space engines. So, probabilistic engines are mostly dead.

Can you elaborate this?

I thought Okapi BM25, which is the default Similarity on Solr, is based on the 
probabilistic
model. Did you mean that Lucene/Solr is still based on vector space model but 
they built
BM25Similarity on top of it and therefore, BM25Similarity is not pure 
probabilistic scoring
system or Okapi BM25 is not originally probabilistic?

As for me, I prefer the idea of vector space than probabilistic for the 
information retrieval,
and I stick with ClassicSimilarity for my projects.

Thanks,

Koji


On 2017/04/13 4:08, Walter Underwood wrote:
Fine. It can’t be done. If it was easy, Solr/Lucene would already have the 
feature, right?

Solr is a vector-space engine. Some early engines (Verity VDK) were 
probabilistic engines. Those do give an absolute estimate of the relevance of 
each hit. Unfortunately, the relevance of results is just not as good as 
vector-space engines. So, probabilistic engines are mostly dead.

But, “you don’t want to do it” is very good advice. Instead of trying to reduce 
bad hits, work on increasing good hits. It is really hard, sometimes not 
possible, to optimize both. Increasing the good hits makes your customers 
happy. Reducing the bad hits makes your UX team happy.

Here is a process. Start collecting the clicks on the search results page (SRP) 
with each query. Look at queries that have below average clickthrough. See if 
those can be combined into categories, then address each category.

Some categories that I have used:

* One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all 
valid. Use synonyms or shingles (and maybe the word delimiter filter) to match 
these.

* Misspellings. These should be about 10% of queries. Use fuzzy matching. I 
recommend the patch in SOLR-629.

* Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”. 
People search for “kids movies”, but your movie genre is “Children and Family”. 
Use synonyms.

* Missing content. People can’t find anything about beach parking because there 
isn’t a page about that. Instead, there are scraps of info about beach parking 
in multiple other pages. Fix the content.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 12, 2017, at 11:44 AM, David Kramer <david.kra...@shoebuy.com> wrote:

The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.   I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:

    Can't the filter be used in cases when you're paginating in
    sharded-scenario ?
    So if you do limit=10, offset=10, each shard will return 20 docs ?
    While if you do limit=10, _score<=last_page.min_score, then each shard will
    return 10 docs ? (they will still score all docs, but merging will be
    faster)

    Makes sense ?

    On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io
wrote:

Can i ask what is the final requirement here ?
What are you trying to do ?
- just display less results ?
you can easily do at search client time, cutting after a certain amount
- make search faster returning less results ?
This is not going to work, as you need to score all of them as Erick
explained.

Function query ( as Mikhail specified) will run on a per document basis (
if
I am correct), so if your idea was to speed up the things, this is not
going
to work.

It makes much more sense to refine your system to improve relevancy if your
concern is to have more relevant docs.
If your concern is just to not show that many pages, you can limit that
client side.






-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.
nabble.com/Filtering-results-by-minimum-relevancy-score-
tp4329180p4329295.html
Sent from the Solr - User mailing list archive at Nabble.com.





Reply via email to