Hi,

Looking at the source code and term frequencies, it looks like:

fieldLength = number of tokens prior to ngram filter processing
avgFieldLength = <total number of terms post ngram filter> / <number of
docs>

As you are using n-gram then 11 is the total number of terms while
fieldLegth is 2.

See:

https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/similarities/BM25Similarity.html#avgFieldLength-org.apache.lucene.search.CollectionStatistics-

https://lucene.apache.org/core/8_0_0//core/org/apache/lucene/search/CollectionStatistics.html#sumTotalTermFreq--

But I join you on wanting to know what the experts say as I cannot claim to
be an expert on BM25 nor its Lucene implementation.

Best,
Edward


On Thu, Oct 31, 2019 at 5:56 AM Pedro Sousa <pedro.so...@pragsis.com> wrote:
>
> Hello all,
>
> I have a standard solr 7.4 installation and i have a question regarding
how
> BM25 similarity is computed. Here is an example to describe my question
>
> 1. create core `test_core`
>
> 2. add to `test_core/conf/managed-schema`: the following field and
fieldType
>
> <field name="text"       type="ngram_text" indexed="true" stored="true"
> docValues="false" multiValued="false"/>
>
> <fieldType name="ngram_text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="7"/>
>       </analyzer>
> </fieldType>
>
> 3. restart solr and add the following document:
>
> {
>     "id":1,
>     "text":"apples oranges"
> }
>
> 4. perform query: test_core/select?debugQuery=on&fl=*,score&q=text:apples
>
> 5. check bm25 calculation:
>
> 0.57919353 = weight(Synonym(text:ap text:app text:appl text:apple
> text:apples) in 0) [SchemaSimilarity], result of:
>   0.57919353 = score(doc=0,freq=5.0 = termFreq=5.0
> ), product of:
>     0.2876821 = idf, computed as log(1 + (docCount - docFreq + 0.5) /
> (docFreq + 0.5)) from:
>       1.0 = docFreq
>       1.0 = docCount
>     2.0133111 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 -
b
> + b * fieldLength / avgFieldLength)) from:
>       5.0 = termFreq=5.0
>       1.2 = parameter k1
>       0.75 = parameter b
>       11.0 = avgFieldLength
>       2.0 = fieldLength
>
> ------
>
> My question is that, since i only have one document, shouldn't fieldLength
> and avgFieldLength have the same value? I notice that avgFieldLength uses
> the NGram filter while the field length doesn't.
>
> Shouldn't avgFieldLength be the average of fieldLength?
>
> Thank you
>
> --
>
> AVISO DE CONFIDENCIALIDAD.
> Este
>  correo y la información contenida o
> adjunta al mismo es privada y
> confidencial y va dirigida exclusivamente a
> su destinatario. Pragsis
> informa a quien pueda haber recibido este correo
> por error que contiene
> información confidencial cuyo uso, copia,
> reproducción o distribución
> está expresamente prohibida. Si no es Vd. el
> destinatario del mismo y
> recibe este correo por error, le rogamos lo ponga
> en conocimiento del
> emisor y proceda a su eliminación sin copiarlo,
> imprimirlo o utilizarlo
> de ningún modo.
>
>
>
> CONFIDENTIALITY WARNING.
> This
>
> message and the information contained in or attached to it are private
> and
> confidential and intended exclusively for the addressee. Pragsis
> informs
> to whom it may receive it in error that it contains privileged
> information
> and its use, copy, reproduction or distribution is
> prohibited. If you are
> not an intended recipient of this E-mail, please
> notify the sender, delete
> it and do not read, act upon, print, disclose,
>  copy, retain or
> redistribute any portion of this E-mail.

Reply via email to