Hi, Looking at the source code and term frequencies, it looks like:
fieldLength = number of tokens prior to ngram filter processing avgFieldLength = <total number of terms post ngram filter> / <number of docs> As you are using n-gram then 11 is the total number of terms while fieldLegth is 2. See: https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/search/similarities/BM25Similarity.html#avgFieldLength-org.apache.lucene.search.CollectionStatistics- https://lucene.apache.org/core/8_0_0//core/org/apache/lucene/search/CollectionStatistics.html#sumTotalTermFreq-- But I join you on wanting to know what the experts say as I cannot claim to be an expert on BM25 nor its Lucene implementation. Best, Edward On Thu, Oct 31, 2019 at 5:56 AM Pedro Sousa <pedro.so...@pragsis.com> wrote: > > Hello all, > > I have a standard solr 7.4 installation and i have a question regarding how > BM25 similarity is computed. Here is an example to describe my question > > 1. create core `test_core` > > 2. add to `test_core/conf/managed-schema`: the following field and fieldType > > <field name="text" type="ngram_text" indexed="true" stored="true" > docValues="false" multiValued="false"/> > > <fieldType name="ngram_text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" > maxGramSize="7"/> > </analyzer> > </fieldType> > > 3. restart solr and add the following document: > > { > "id":1, > "text":"apples oranges" > } > > 4. perform query: test_core/select?debugQuery=on&fl=*,score&q=text:apples > > 5. check bm25 calculation: > > 0.57919353 = weight(Synonym(text:ap text:app text:appl text:apple > text:apples) in 0) [SchemaSimilarity], result of: > 0.57919353 = score(doc=0,freq=5.0 = termFreq=5.0 > ), product of: > 0.2876821 = idf, computed as log(1 + (docCount - docFreq + 0.5) / > (docFreq + 0.5)) from: > 1.0 = docFreq > 1.0 = docCount > 2.0133111 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b > + b * fieldLength / avgFieldLength)) from: > 5.0 = termFreq=5.0 > 1.2 = parameter k1 > 0.75 = parameter b > 11.0 = avgFieldLength > 2.0 = fieldLength > > ------ > > My question is that, since i only have one document, shouldn't fieldLength > and avgFieldLength have the same value? I notice that avgFieldLength uses > the NGram filter while the field length doesn't. > > Shouldn't avgFieldLength be the average of fieldLength? > > Thank you > > -- > > AVISO DE CONFIDENCIALIDAD. > Este > correo y la información contenida o > adjunta al mismo es privada y > confidencial y va dirigida exclusivamente a > su destinatario. Pragsis > informa a quien pueda haber recibido este correo > por error que contiene > información confidencial cuyo uso, copia, > reproducción o distribución > está expresamente prohibida. Si no es Vd. el > destinatario del mismo y > recibe este correo por error, le rogamos lo ponga > en conocimiento del > emisor y proceda a su eliminación sin copiarlo, > imprimirlo o utilizarlo > de ningún modo. > > > > CONFIDENTIALITY WARNING. > This > > message and the information contained in or attached to it are private > and > confidential and intended exclusively for the addressee. Pragsis > informs > to whom it may receive it in error that it contains privileged > information > and its use, copy, reproduction or distribution is > prohibited. If you are > not an intended recipient of this E-mail, please > notify the sender, delete > it and do not read, act upon, print, disclose, > copy, retain or > redistribute any portion of this E-mail.