Hi Walter,
May I ask a tangential question? I'm curious the following line you wrote:
> Solr is a vector-space engine. Some early engines (Verity VDK) were probabilistic engines. Those
do give an absolute estimate of the relevance of each hit. Unfortunately, the relevance of results
is just not as good as vector-space engines. So, probabilistic engines are mostly dead.
Can you elaborate this?
I thought Okapi BM25, which is the default Similarity on Solr, is based on the
probabilistic
model. Did you mean that Lucene/Solr is still based on vector space model but
they built
BM25Similarity on top of it and therefore, BM25Similarity is not pure
probabilistic scoring
system or Okapi BM25 is not originally probabilistic?
As for me, I prefer the idea of vector space than probabilistic for the
information retrieval,
and I stick with ClassicSimilarity for my projects.
Thanks,
Koji
On 2017/04/13 4:08, Walter Underwood wrote:
Fine. It can’t be done. If it was easy, Solr/Lucene would already have the
feature, right?
Solr is a vector-space engine. Some early engines (Verity VDK) were
probabilistic engines. Those do give an absolute estimate of the relevance of
each hit. Unfortunately, the relevance of results is just not as good as
vector-space engines. So, probabilistic engines are mostly dead.
But, “you don’t want to do it” is very good advice. Instead of trying to reduce
bad hits, work on increasing good hits. It is really hard, sometimes not
possible, to optimize both. Increasing the good hits makes your customers
happy. Reducing the bad hits makes your UX team happy.
Here is a process. Start collecting the clicks on the search results page (SRP)
with each query. Look at queries that have below average clickthrough. See if
those can be combined into categories, then address each category.
Some categories that I have used:
* One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all
valid. Use synonyms or shingles (and maybe the word delimiter filter) to match
these.
* Misspellings. These should be about 10% of queries. Use fuzzy matching. I
recommend the patch in SOLR-629.
* Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”.
People search for “kids movies”, but your movie genre is “Children and Family”.
Use synonyms.
* Missing content. People can’t find anything about beach parking because there
isn’t a page about that. Instead, there are scraps of info about beach parking
in multiple other pages. Fix the content.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
On Apr 12, 2017, at 11:44 AM, David Kramer <david.kra...@shoebuy.com> wrote:
The idea is to not return poorly matching results, not to limit the number of
results returned. One query may have hundreds of excellent matches and another
query may have 7. So cutting off by the number of results is trivial but not
useful.
Again, we are not doing this for performance reasons. We’re doing this because
we don’t want to show products that are not very relevant to the search terms
specified by the user for UX reasons.
I had hoped that the responses would have been more focused on “it’ can’t be
done” or “here’s how to do it” than “you don’t want to do it”. I’m still left
not knowing if it’s even possible. The one concrete answer of using frange
doesn’t help as referencing score in either the q or the fq produces an
“undefined field” error.
Thanks.
On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:
Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)
Makes sense ?
On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io
wrote:
Can i ask what is the final requirement here ?
What are you trying to do ?
- just display less results ?
you can easily do at search client time, cutting after a certain amount
- make search faster returning less results ?
This is not going to work, as you need to score all of them as Erick
explained.
Function query ( as Mikhail specified) will run on a per document basis (
if
I am correct), so if your idea was to speed up the things, this is not
going
to work.
It makes much more sense to refine your system to improve relevancy if your
concern is to have more relevant docs.
If your concern is just to not show that many pages, you can limit that
client side.
-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.
nabble.com/Filtering-results-by-minimum-relevancy-score-
tp4329180p4329295.html
Sent from the Solr - User mailing list archive at Nabble.com.