Re: E-Commerce Search: tf-idf, tie-break and boolean model

Walter Underwood Fri, 20 Oct 2017 08:42:48 -0700

Setting mm to 100% means that any misspelled word in a query means zero 
results. That is not a good experience. Usually, 10% of queries contain a 
misspelling.


Set mm to 1.

The F-measure is not a good choice for this because recall is not very 
important in e-commerce. Use precision-oriented measures. P@3 is a good start. 
If there is usually exactly one correct answer (this was true when I did search 
at Netflix), MRR is a better choice. That measures the position of the first 
relevant result.

https://techblog.chegg.com/2012/12/12/measuring-search-relevance-with-mrr/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 20, 2017, at 1:05 AM, Vincenzo D'Amore <v.dam...@gmail.com> wrote:
> 
> Thanks for all the info, I really appreciate your help. I'm working on the
> configuration and following your suggestions.
> 
> We already had a golden set of query-results pairs (~1000) used to tune and
> check how my application (and Solr configuration) performs.
> But I've to entirely double check if this set is still relevant.
> The results of each query are used to calculate F1.
> 
> Nevertheless, having this base of tests le me able to try few rounds adding
> and removing custom similarity, changing the tie configuration and so on
> and so forth.
> 
> Now I want share with you my results:
> 
> - I've just set mm=100%
> 
> - TF - set as constant 1.0 - slight improvement in search results,
> basically it seems perform better when there are few products that are
> almost identical, but some of them have the same keyword repeated many
> times. For example a product "iphone charger for iphone 5, iphone
> 5s, iphone 6" versus a product "iphone charge"
> 
> - IDF - set as constant 1.0 - the results were not catastrophic but, for
> sure, worse than having default similarity. So I've roll backed this
> change, it seems to me the results are flattened too much.
> 
> - tie - I've just tried 0.1 and 1.0, at moment 1.0 seems to perform better.
> But not sure why.
> 
> I want try to add some relevant fields (tags, categories) in order to the
> have more chances to match the correct results.
> 
> Best regards,
> Vincenzo
> 
> On Tue, Oct 17, 2017 at 11:38 PM, Walter Underwood <wun...@wunderwood.org>
> wrote:
> 
>> That page from Stanford is not about e-commerce search. Westlaw is
>> professional librarian search.
>> 
>> I agree with Emir’s advice. Start with edismax. Use a small value for the
>> tie-breaker. It is one of the least important configuration values. I use
>> the default from the sample configs:
>> 
>>       <str name="tie">0.1</str>
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 16, 2017, at 1:53 AM, Emir Arnautović <
>> emir.arnauto...@sematext.com> wrote:
>>> 
>>> Hi Vincenzo,
>>> Unless you have really specific ranking requirements, I would not
>> suggest you to start with you proprietary similarity implementation. In
>> most cases edismax will be good enough to cover your requirements. It is
>> not easy task to tune edismax since it has a log knobs that you can use.
>>> In general there are two approaches that you can use: Create a golden
>> set of query-results pairs and use it with some metric (e.g. you can start
>> with simple F-measure) and tune parameters to maximize metric. The
>> alternative approach (complements the first one) is to let user use your
>> search, track clicks and monitor search metrics like mean reciprocal rank,
>> zero result queries, page depth etc. and tune queries to get better
>> results. If you can do A/B testing, you can use that as well to see which
>> changes are better.
>>> In most cases, this is iterative process and you should not expect to
>> get it right the first time and that you will be able to tune it to cover
>> all cases.
>>> 
>>> Good luck!
>>> 
>>> HTH,
>>> Emir
>>> 
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
>>>> On 16 Oct 2017, at 10:30, Vincenzo D'Amore <v.dam...@gmail.com> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I'm trying to figure out how to tune Solr for an e-commerce search.
>>>> 
>>>> I want to share with you what I did in the hope to understand if I was
>>>> right and, if there, I could also improve my configuration.
>>>> 
>>>> I also read that the boolean model has to be preferred in this case.
>>>> 
>>>> https://nlp.stanford.edu/IR-book/html/htmledition/the-extend
>> ed-boolean-model-versus-ranked-retrieval-1.html
>>>> 
>>>> 
>>>> So, I first wrote my own implementation of DefaultSimilarity returning
>>>> constantly 1.0 for TF and IDF.
>>>> 
>>>> Now I'm struggling to understand how to configure tie-break parameter,
>> my
>>>> opinion was to configure it to 0.1 or 0.0, thats because, if I
>> understood
>>>> well, in this way the boolean model should be preferred, that's because
>>>> only the maximum scoring subquery contributes to final score.
>>>> 
>>>> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-
>> parser.html#TheDisMaxQueryParser-Thetie_TieBreaker_Parameter
>>>> 
>>>> 
>>>> Not sure if this could be enough or if you need more information,
>> thanks in
>>>> advance for anyone would add a bit in this discussion.
>>>> 
>>>> Best regards,
>>>> Vincenzo
>>>> 
>> 
>>

Re: E-Commerce Search: tf-idf, tie-break and boolean model

Reply via email to