Hi Chantal,

Please see https://issues.apache.org/jira/browse/LUCENE-7148


ahmet



On Saturday, July 16, 2016 3:48 PM, CA <c...@it-agenten.com> wrote:
Hello all,

our index contains product offers from online shops. The fields we are indexing 
have all rather short values: the name of the product, the brand, the price, 
category and some fields containing identifiers like ASIN, GTIN etc. if 
available. We do not index the description texts.

The regular user search uses the „edismax“ and queries the above mentioned 
fields which works fine for short inputs like „iphone 6s“.

Now, we have to support a different kind of query which won’t be user input but 
using complete product names like those we store ourselves but not necessarily 
names that are actually part of our data set. This means that the input query 
can be relatively long. The output of the query is planned to consist of a More 
Like This list. So, in effect the query should have at least one hit that is 
hopefully close enough, and the actual result will be a More Like This list 
sourced by that one hit.

I have tried to get this to work based on the „edismax“ setup for the regular 
user search but this does not work well when the input is longer than what we 
have stored as similar product. Here is an example:


## Step 1: Input (not stored in our index):
"Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and Renew 
Charger“ (input to edismax without quotes)

(a) This input does not produce any results with our current edismax config 
(details at the end of the e-mail).
(b) When I relax the „mm“ parameter to "2<-1 5<-30% 8<10%“, I get one hit with 
the following name:
=> "Braun Series Clean&Renew CCR2 Cleansing Dock Cartridges Lemonfresh Formula 
Cartrige (Compatible with Series 7,5,3) 2 pc“


## Step 2: When I reduce the input manually to the following:
"Braun Series 9 9095CC Men's Electric Shaver“

The above shortened input returns a very good hit with the name:
=> "Braun 9095cc Series 9 Electric Shaver"


My Question:

Is it possible, and if so - how, to have the query input:
"Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and Renew 
Charger“ (input to edismax without quotes)
return (also or only) the hit with the name:
=> "Braun 9095cc Series 9 Electric Shaver"
and maybe even give it a high score.

I have tried to use „explainOther“ (output see at the end of this e-mail) but I 
have a really hard time reading it. In some cases, I’m not even able to 
understand where one clause ends and the next one starts (is it possible to 
have it returned in several lines?). Maybe someone can give me a hint on how to 
use that output or knows of some documentation on the i-net that explains how 
to make good use of it?


Looking at the input string, I was wondering:

(A) Is relaxing the „mm“ parameter really the way to go?
(B) Should I create another name field in schema.xml that basically has a 
different query chain, discarding the last words of a query input if too long. 
Or maybe it’s possible to make tokens in the first part of the input more 
„important“ (though I’m not sure this is generally the case)? Should I remove 
some of the filters from the query chain (like the ShingleFilter)?
(C) Can I configure something else or should I not use edismax for this?


Thank you for reading this,
any insight is highly appreciated!

Chantal


***

Following are the field configuration for the name field, the configuration of 
the edismax handler, and the output of „explainOther“ for the above example.



SCHEMA.XML — „name" field:

<field name="name" type="name_split" indexed="true" stored="true" 
required="true" multiValued="false“/>

<fieldType name="name_split" class="solr.TextField" positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" 
outputUnigrams="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
                generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="1"
                splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LengthFilterFactory" min="2" max="255"/>
        <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>



SOLRCONFIG.XML — MLT/EDISMAX

<requestHandler name="/mlt" class="solr.SearchHandler">
     <lst name="defaults">
         <str name="echoParams">all</str>
         <str name="defType">edismax</str>

         <str name="q.alt">*:*</str>
         <str name="fl">id,brand,name,price,score,popularity</str>
         <str name="tie">0.1</str>
         <str name="qf">brand_split^6 name</str>
         <str name="pf">brand_split^10 name^10</str>
         <str name="mm">2&lt;-1 5&lt;-30% 8&lt;10%</str>
         <int name="qs">10</int>
         <int name="ps">20</int>

         <str name="wt">xml</str>

         <str name="mlt">false</str>
         <str name="mlt.qf">brand_split^6 name price</str>
         <str name="mlt.fl">brand_split name price</str>
         <str name="mlt.interestingTerms">details</str>
     </lst>
</requestHandler>



DEBUG — EXPLAIN OTHER

The „other“ document with id:2d617cee76f5ed8598cf7db1b44a40de6f3c8c9b has the 
title "Braun 9095cc Series 9 Electric Shaver"

<response>
    <lst name="responseHeader">
        <lst name="params“><!-- shortened for better overview -->
            <str name="defType">edismax</str>
            <str name="qf">brand_split^6 name</str>
            <str name="pf">brand_split^10 name^10</str>
            <str name="mm">2<-1 5<-30% 8<10%</str>
            <str name="qs">10</str>
            <str name="ps">20</str>
            <str name="tie">0.1</str>
            <str name="q">
                Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean 
and Renew Charger
            </str>
            <str 
name="explainOther">id:2d617cee76f5ed8598cf7db1b44a40de6f3c8c9b</str>
        </lst>
    </lst>
    <result name="response" numFound="1" start="0" maxScore="97.122955">
        <doc>
            <str name="name">
                Braun Series Clean&Renew CCR2 Cleansing Dock Cartridges 
Lemonfresh Formula Cartrige (Compatible with
                Series 7,5,3) 2 pc
            </str>
            <str name="id">773d4bdb341c4dc438c481ac80de5abde08d85bf</str>
            <str name="brand">Braun</str>
            <float name="score">97.122955</float>
        </doc>
    </result>
    <lst name="debug">
        <str name="rawquerystring">
            Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and 
Renew Charger
        </str>
        <str name="querystring">
            Braun Series 9 9095CC Men's Electric Shaver Wet/Dry with Clean and 
Renew Charger
        </str>
        <str name="parsedquery">
            (+(DisjunctionMaxQuery((name:braun | (brand_split:braun)^6.0)~0.1) 
DisjunctionMaxQuery((name:series |
            (brand_split:series)^6.0)~0.1) DisjunctionMaxQuery((name:"(9095cc 
9095) cc"~10 | (brand_split:"(9095cc 9095)
            cc"~10)^6.0)~0.1) DisjunctionMaxQuery((Synonym(name:men name:men's) 
| (Synonym(brand_split:men
            brand_split:men's))^6.0)~0.1) DisjunctionMaxQuery((name:electric | 
(brand_split:electric)^6.0)~0.1)
            DisjunctionMaxQuery((name:shaver | (brand_split:shaver)^6.0)~0.1) 
DisjunctionMaxQuery((name:"(wet/dry wet
            wetdry) dry"~10 | (brand_split:"(wet/dry wet wetdry) 
dry"~10)^6.0)~0.1) DisjunctionMaxQuery((name:with |
            (brand_split:with)^6.0)~0.1) +DisjunctionMaxQuery((name:clean | 
(brand_split:clean)^6.0)~0.1)
            +DisjunctionMaxQuery((name:renew | (brand_split:renew)^6.0)~0.1) 
DisjunctionMaxQuery((name:charger |
            (brand_split:charger)^6.0)~0.1)) 
DisjunctionMaxQuery(((brand_split:"(braun braun series braunseries) series
            (series series 9 series9) ? (9 9095cc 99095 99095cc) 9095 cc 
(9095cc 9095) (cc 9095cc men's 9095 9095ccmen)
            (cc ccmen) men (men's men men's electric menelectric) electric 
(electric electric shaver electricshaver)
            shaver (shaver shaver wet/dry shaverwetdry) wet dry (wet/dry wet 
wetdry) (dry wet/dry with wet wetdrywith)
            dry with (with with clean withclean) clean (clean clean and 
cleanand) and (and and renew andrenew) renew
            (renew renew charger renewcharger) charger charger"~20)^10.0 | 
(name:"(braun braun series braunseries)
            series (series series 9 series9) ? (9 9095cc 99095 99095cc) 9095 cc 
(9095cc 9095) (cc 9095cc men's 9095
            9095ccmen) (cc ccmen) men (men's men men's electric menelectric) 
electric (electric electric shaver
            electricshaver) shaver (shaver shaver wet/dry shaverwetdry) wet dry 
(wet/dry wet wetdry) (dry wet/dry with
            wet wetdrywith) dry with (with with clean withclean) clean (clean 
clean and cleanand) and (and and renew
            andrenew) renew (renew renew charger renewcharger) charger 
charger"~20)^10.0)~0.1))/no_coord
        </str>
        <str name="parsedquery_toString">
            +((name:braun | (brand_split:braun)^6.0)~0.1 (name:series | 
(brand_split:series)^6.0)~0.1 (name:"(9095cc
            9095) cc"~10 | (brand_split:"(9095cc 9095) cc"~10)^6.0)~0.1 
(Synonym(name:men name:men's) |
            (Synonym(brand_split:men brand_split:men's))^6.0)~0.1 
(name:electric | (brand_split:electric)^6.0)~0.1
            (name:shaver | (brand_split:shaver)^6.0)~0.1 (name:"(wet/dry wet 
wetdry) dry"~10 | (brand_split:"(wet/dry
            wet wetdry) dry"~10)^6.0)~0.1 (name:with | 
(brand_split:with)^6.0)~0.1 +(name:clean |
            (brand_split:clean)^6.0)~0.1 +(name:renew | 
(brand_split:renew)^6.0)~0.1 (name:charger |
            (brand_split:charger)^6.0)~0.1) ((brand_split:"(braun braun series 
braunseries) series (series series 9
            series9) ? (9 9095cc 99095 99095cc) 9095 cc (9095cc 9095) (cc 
9095cc men's 9095 9095ccmen) (cc ccmen) men
            (men's men men's electric menelectric) electric (electric electric 
shaver electricshaver) shaver (shaver
            shaver wet/dry shaverwetdry) wet dry (wet/dry wet wetdry) (dry 
wet/dry with wet wetdrywith) dry with (with
            with clean withclean) clean (clean clean and cleanand) and (and and 
renew andrenew) renew (renew renew
            charger renewcharger) charger charger"~20)^10.0 | (name:"(braun 
braun series braunseries) series (series
            series 9 series9) ? (9 9095cc 99095 99095cc) 9095 cc (9095cc 9095) 
(cc 9095cc men's 9095 9095ccmen) (cc
            ccmen) men (men's men men's electric menelectric) electric 
(electric electric shaver electricshaver) shaver
            (shaver shaver wet/dry shaverwetdry) wet dry (wet/dry wet wetdry) 
(dry wet/dry with wet wetdrywith) dry with
            (with with clean withclean) clean (clean clean and cleanand) and 
(and and renew andrenew) renew (renew renew
            charger renewcharger) charger charger"~20)^10.0)~0.1
        </str>
        <lst name="explain">
            <str name="773d4bdb341c4dc438c481ac80de5abde08d85bf">
                97.122955 = sum of: 97.122955 = sum of: 61.102264 = max plus 
0.1 times others of: 6.80276 =
                weight(name:braun in 477314) [], result of: 6.80276 = 
score(doc=477314,freq=1.0 = termFreq=1.0 ),
                product of: 8.171213 = idf(docFreq=324, docCount=1147961) 
0.8325276 = tfNorm, computed from: 1.0 =
                termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 27.458092 = 
avgFieldLength 40.96 = fieldLength
                60.42199 = weight(brand_split:braun in 477314) [], result of: 
60.42199 = score(doc=477314,freq=1.0 =
                termFreq=1.0 ), product of: 6.0 = boost 8.11682 = 
idf(docFreq=305, docCount=1023531) 1.2406745 = tfNorm,
                computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = 
parameter b 1.9018271 = avgFieldLength 1.0 =
                fieldLength 8.663414 = max plus 0.1 times others of: 8.663414 = 
weight(name:series in 477314) [], result
                of: 8.663414 = score(doc=477314,freq=4.0 = termFreq=4.0 ), 
product of: 5.5549765 = idf(docFreq=4440,
                docCount=1147961) 1.5595771 = tfNorm, computed from: 4.0 = 
termFreq=4.0 1.2 = parameter k1 0.75 =
                parameter b 27.458092 = avgFieldLength 40.96 = fieldLength 
4.0527744 = max plus 0.1 times others of:
                4.0527744 = weight(name:with in 477314) [], result of: 
4.0527744 = score(doc=477314,freq=2.0 =
                termFreq=2.0 ), product of: 3.355103 = idf(docFreq=40070, 
docCount=1147961) 1.2079433 = tfNorm, computed
                from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 
27.458092 = avgFieldLength 40.96 =
                fieldLength 8.542337 = max plus 0.1 times others of: 8.542337 = 
weight(name:clean in 477314) [], result
                of: 8.542337 = score(doc=477314,freq=3.0 = termFreq=3.0 ), 
product of: 6.008829 = idf(docFreq=2820,
                docCount=1147961) 1.421631 = tfNorm, computed from: 3.0 = 
termFreq=3.0 1.2 = parameter k1 0.75 =
                parameter b 27.458092 = avgFieldLength 40.96 = fieldLength 
14.762168 = max plus 0.1 times others of:
                14.762168 = weight(name:renew in 477314) [], result of: 
14.762168 = score(doc=477314,freq=3.0 =
                termFreq=3.0 ), product of: 10.383966 = idf(docFreq=35, 
docCount=1147961) 1.421631 = tfNorm, computed
                from: 3.0 = termFreq=3.0 1.2 = parameter k1 0.75 = parameter b 
27.458092 = avgFieldLength 40.96 =
                fieldLength
            </str>
        </lst>
        <str name="otherQuery">id:2d617cee76f5ed8598cf7db1b44a40de6f3c8c9b</str>
        <lst name="explainOther">
            <str name="2d617cee76f5ed8598cf7db1b44a40de6f3c8c9b">
                0.0 = Failure to meet condition(s) of required/prohibited 
clause(s) 0.0 = no match on required clause
                ((name:braun | (brand_split:braun)^6.0)~0.1 (name:series | 
(brand_split:series)^6.0)~0.1 (name:"(9095cc
                9095) cc"~10 | (brand_split:"(9095cc 9095) cc"~10)^6.0)~0.1 
(Synonym(name:men name:men's) |
                (Synonym(brand_split:men brand_split:men's))^6.0)~0.1 
(name:electric | (brand_split:electric)^6.0)~0.1
                (name:shaver | (brand_split:shaver)^6.0)~0.1 (name:"(wet/dry 
wet wetdry) dry"~10 |
                (brand_split:"(wet/dry wet wetdry) dry"~10)^6.0)~0.1 (name:with 
| (brand_split:with)^6.0)~0.1
                +(name:clean | (brand_split:clean)^6.0)~0.1 +(name:renew | 
(brand_split:renew)^6.0)~0.1 (name:charger |
                (brand_split:charger)^6.0)~0.1) 0.0 = Failure to meet 
condition(s) of required/prohibited clause(s)
                61.40732 = max plus 0.1 times others of: 9.853278 = 
weight(name:braun in 113560) [], result of: 9.853278
                = score(doc=113560,freq=1.0 = termFreq=1.0 ), product of: 
8.171213 = idf(docFreq=324, docCount=1147961)
                1.2058525 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = 
parameter k1 0.75 = parameter b 27.458092 =
                avgFieldLength 16.0 = fieldLength 60.42199 = 
weight(brand_split:braun in 113560) [], result of: 60.42199
                = score(doc=113560,freq=1.0 = termFreq=1.0 ), product of: 6.0 = 
boost 8.11682 = idf(docFreq=305,
                docCount=1023531) 1.2406745 = tfNorm, computed from: 1.0 = 
termFreq=1.0 1.2 = parameter k1 0.75 =
                parameter b 1.9018271 = avgFieldLength 1.0 = fieldLength 
8.6537285 = max plus 0.1 times others of:
                8.6537285 = weight(name:series in 113560) [], result of: 
8.6537285 = score(doc=113560,freq=2.0 =
                termFreq=2.0 ), product of: 5.5549765 = idf(docFreq=4440, 
docCount=1147961) 1.5578334 = tfNorm, computed
                from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 
27.458092 = avgFieldLength 16.0 =
                fieldLength 52.67099 = max plus 0.1 times others of: 52.67099 = 
weight(name:"(9095cc 9095) cc"~10 in
                113560) [], result of: 52.67099 = score(doc=113560,freq=3.0 = 
phraseFreq=3.0 ), product of: 30.520727 =
                idf(), sum of: 13.037208 = idf(docFreq=2, docCount=1147961) 
10.796498 = idf(docFreq=23,
                docCount=1147961) 6.687021 = idf(docFreq=1431, 
docCount=1147961) 1.725745 = tfNorm, computed from: 3.0 =
                phraseFreq=3.0 1.2 = parameter k1 0.75 = parameter b 27.458092 
= avgFieldLength 16.0 = fieldLength
                8.592838 = max plus 0.1 times others of: 8.592838 = 
weight(name:electric in 113560) [], result of:
                8.592838 = score(doc=113560,freq=2.0 = termFreq=2.0 ), product 
of: 5.51589 = idf(docFreq=4617,
                docCount=1147961) 1.5578334 = tfNorm, computed from: 2.0 = 
termFreq=2.0 1.2 = parameter k1 0.75 =
                parameter b 27.458092 = avgFieldLength 16.0 = fieldLength 
13.669254 = max plus 0.1 times others of:
                13.669254 = weight(name:shaver in 113560) [], result of: 
13.669254 = score(doc=113560,freq=2.0 =
                termFreq=2.0 ), product of: 8.7745285 = idf(docFreq=177, 
docCount=1147961) 1.5578334 = tfNorm, computed
                from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 
27.458092 = avgFieldLength 16.0 =
                fieldLength 0.0 = no match on required clause ((name:clean | 
(brand_split:clean)^6.0)~0.1) 0.0 = No
                matching clause 0.0 = no match on required clause ((name:renew 
| (brand_split:renew)^6.0)~0.1) 0.0 = No
                matching clause
            </str>
        </lst>
    </lst>
</response>

Reply via email to