Re: SOLR 7.1 ClassicSimilarityFactory Problem

Erick Erickson Fri, 20 Jul 2018 08:56:26 -0700

Why do you think you need to "fix" anything here?

FieldNorm here is significantly different. On a quick scan (and you're
right, trying to understand it all at a glance is daunting) your
fieldNorm is lowering the score of the second doc. Basically the
"two hits" are in a longer field so their weight is less. Which is
part of the basic function of scoring.


Plus it looks like you've n-grammed the field, which is further
confusing the issue.

I don't see what rows is changing, please point it out. You're getting
the exact same score for the reported documents, it's just that
as you add more rows you get information for more docs as far as
I can tell.

You can try omitting norms and/or creating a non-ngrammed field.

As for why it's different from 4x, no clue. Perhaps the Lucene
folks can weigh in.

Best,
Erick

On Fri, Jul 20, 2018 at 8:41 AM, Hodder, Rick <rhod...@navg.com> wrote:

> I am using SOLR 7.1
>
> ClassicSimilarityFactory
>
> I have data in my core with field called CompanyName in an indexed field
> IDX_CompanyName
>
>
>
> <field name="IDX_CompanyName " type="text_general" indexed="true"
> stored="false" multiValued="true" />
>
> <field name="CompanyName" type="string" indexed="true" stored="true"/>
>
> <copyField source="CompanyName" dest=" IDX_CompanyName"/>
>
>
>
> Here are a few of the 900,000 rows in the core
>
>
>
> Cityview
>
> Citadel
>
> CivicVentures
>
> Clutch City Sports
>
> Clutch City Sports &amp; Entertainment
>
> Clutch City Sports &amp; Entertainment
>
> Clutch City Sports &amp; Entertainment
>
>
>
>
>
> If I *search* for IDX_Company:(clutch AND city) and a fl=*,score and
> maxrows of 750, and at 1500 I get the following results
>
>
>
> *CompanyName                Score*
>
> Cityview                               5.874983
>
> Citadel                                  5.3502507
>
> CivicVentures                    4.7278214
>
> <other rows, but no clutch city>
>
>
>
> If I *search* for IDX_Company:(clutch AND city) and a maxrows of 5000 I
> get the following results
>
>
>
> *CompanyName
>                                                 Score*
>
> Cityview
>                                 5.874983
>
> Citadel
> 5.3502507
>
> CivicVentures
> 4.7278214
>
> Clutch City Sports &amp; Entertainment                3.6542892
>
> Clutch City Sports &amp; Entertainment                3.6542892
>
> Clutch City Sports &amp; Entertainment                3.6542892
>
>
>
> Ive tried looking at the debug query to figure out what its doing and I’m
> confused by what it is saying
>
>
>
> The debug info for Cityview is
>
>
>
> <str name="366640">
>
> 5.874983 = sum of:
>
>   1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_
> CompanyName:clu IDX_CompanyName:clut IDX_CompanyName:clutc
> IDX_CompanyName:clutch) in 16639) [ClassicSimilarity], result of:
>
>     1.9583277 = fieldWeight in 16639, product of:
>
>       1.0 = tf(freq=1.0), with freq of:
>
>         1.0 = termFreq=1.0
>
>       1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
>         166407.0 = docFreq
>
>         433880.0 = docCount
>
>       1.0 = fieldNorm(doc=16639)
>
>   3.9166553 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_
> CompanyName:cit IDX_ CompanyName:city) in 16639) [ClassicSimilarity],
> result of:
>
>     3.9166553 = fieldWeight in 16639, product of:
>
>       2.0 = tf(freq=4.0), with freq of:
>
>         4.0 = termFreq=4.0
>
>       1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
>         166407.0 = docFreq
>
>         433880.0 = docCount
>
>       1.0 = fieldNorm(doc=16639)
>
> </str>
>
>
>
> The debug info for Clutch City Sports &amp; Entertainment is
>
>
>
> <str name="409550">
>
> 3.6542892 = sum of:
>
>   1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_
> CompanyName:clu IDX_ CompanyName:clut IDX_ CompanyName:clutc IDX_
> CompanyName:clutch) in 9549) [ClassicSimilarity], result of:
>
>     1.9583277 = fieldWeight in 9549, product of:
>
>       2.828427 = tf(freq=8.0), with freq of:
>
>         8.0 = termFreq=8.0
>
>       1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
>         166407.0 = docFreq
>
>         433880.0 = docCount
>
>       0.35355338 = fieldNorm(doc=9549)
>
>   1.6959615 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_
> CompanyName:cit IDX_ CompanyName:city) in 9549) [ClassicSimilarity], result
> of:
>
>     1.6959615 = fieldWeight in 9549, product of:
>
>       2.4494898 = tf(freq=6.0), with freq of:
>
>         6.0 = termFreq=6.0
>
>       1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
>         166407.0 = docFreq
>
>         433880.0 = docCount
>
>       0.35355338 = fieldNorm(doc=9549)
>
> </str>
>
>
>
> Why would something with 2 hits score lower? Why does the max rows
> influence this?
>
>
>
> How might I fix this?
>
>
>
> This didn’t used to happen in SOLR 4.10 (I know its an older version, but…)
>
>
>
>
>
> Thanks,
>
>
>
> Rick Hodder
>
> Information Technology
>
> Navigators Management Company, Inc.
>
> 83 Wooster Heights Road
> <https://maps.google.com/?q=83+Wooster+Heights+Road&entry=gmail&source=g>,
> 2nd Floor
>
> Danbury, CT  06810
>
> (475) 329-6251
>
>
>
> [image: Forbes_Best Places Logo2016]
>
>
>

Re: SOLR 7.1 ClassicSimilarityFactory Problem

Reply via email to