I've just changed the stemming algorithm slightly and am running
a few
tests
against the old stemmer versus the new stemmer. I did a query for
'hanger'
and using the old stemmer I get the following scoring for a
document with
the title: Converter Hanger Assembly Replacement
6.4242806 = (MATCH) sum of:
2.5697122 = (MATCH) max of:
0.2439919 = (MATCH) weight(markup_t:hanger in 3454), product of:
0.1963516 = queryWeight(markup_t:hanger), product of:
6.5593724 = idf(docFreq=6375, numDocs=1655591)
0.02993451 = queryNorm
1.2426275 = (MATCH) fieldWeight(markup_t:hanger in 3454),
product of:
1.7320508 = tf(termFreq(markup_t:hanger)=3)
6.5593724 = idf(docFreq=6375, numDocs=1655591)
0.109375 = fieldNorm(field=markup_t, doc=3454)
2.5697122 = (MATCH) weight(title_t:hanger^2.0 in 3454), product of:
0.5547002 = queryWeight(title_t:hanger^2.0), product of:
2.0 = boost
9.265229 = idf(docFreq=425, numDocs=1655591)
0.02993451 = queryNorm
4.6326146 = (MATCH) fieldWeight(title_t:hanger in 3454),
product of:
1.0 = tf(termFreq(title_t:hanger)=1)
9.265229 = idf(docFreq=425, numDocs=1655591)
0.5 = fieldNorm(field=title_t, doc=3454)
3.8545685 = (MATCH) max of:
0.12199595 = (MATCH) weight(markup_t:hanger^0.5 in 3454), product
of:
0.0981758 = queryWeight(markup_t:hanger^0.5), product of:
0.5 = boost
6.5593724 = idf(docFreq=6375, numDocs=1655591)
0.02993451 = queryNorm
1.2426275 = (MATCH) fieldWeight(markup_t:hanger in 3454),
product of:
1.7320508 = tf(termFreq(markup_t:hanger)=3)
6.5593724 = idf(docFreq=6375, numDocs=1655591)
0.109375 = fieldNorm(field=markup_t, doc=3454)
3.8545685 = (MATCH) weight(title_t:hanger^3.0 in 3454), product of:
0.8320503 = queryWeight(title_t:hanger^3.0), product of:
3.0 = boost
9.265229 = idf(docFreq=425, numDocs=1655591)
0.02993451 = queryNorm
4.6326146 = (MATCH) fieldWeight(title_t:hanger in 3454),
product of:
1.0 = tf(termFreq(title_t:hanger)=1)
9.265229 = idf(docFreq=425, numDocs=1655591)
0.5 = fieldNorm(field=title_t, doc=3454)
Using the new stemmer I get:
5.621245 = (MATCH) sum of:
2.248498 = (MATCH) max of:
0.24399184 = (MATCH) weight(markup_t:hanger in 3454), product of:
0.19635157 = queryWeight(markup_t:hanger), product of:
6.559371 = idf(docFreq=6375, numDocs=1655589)
0.029934512 = queryNorm
1.2426274 = (MATCH) fieldWeight(markup_t:hanger in 3454),
product of:
1.7320508 = tf(termFreq(markup_t:hanger)=3)
6.559371 = idf(docFreq=6375, numDocs=1655589)
0.109375 = fieldNorm(field=markup_t, doc=3454)
2.248498 = (MATCH) weight(title_t:hanger^2.0 in 3454), product of:
0.5547002 = queryWeight(title_t:hanger^2.0), product of:
2.0 = boost
9.265228 = idf(docFreq=425, numDocs=1655589)
0.029934512 = queryNorm
4.0535374 = (MATCH) fieldWeight(title_t:hanger in 3454),
product of:
1.0 = tf(termFreq(title_t:hanger)=1)
9.265228 = idf(docFreq=425, numDocs=1655589)
0.4375 = fieldNorm(field=title_t, doc=3454)
3.372747 = (MATCH) max of:
0.12199592 = (MATCH) weight(markup_t:hanger^0.5 in 3454), product
of:
0.09817579 = queryWeight(markup_t:hanger^0.5), product of:
0.5 = boost
6.559371 = idf(docFreq=6375, numDocs=1655589)
0.029934512 = queryNorm
1.2426274 = (MATCH) fieldWeight(markup_t:hanger in 3454),
product of:
1.7320508 = tf(termFreq(markup_t:hanger)=3)
6.559371 = idf(docFreq=6375, numDocs=1655589)
0.109375 = fieldNorm(field=markup_t, doc=3454)
3.372747 = (MATCH) weight(title_t:hanger^3.0 in 3454), product of:
0.83205026 = queryWeight(title_t:hanger^3.0), product of:
3.0 = boost
9.265228 = idf(docFreq=425, numDocs=1655589)
0.029934512 = queryNorm
4.0535374 = (MATCH) fieldWeight(title_t:hanger in 3454),
product of:
1.0 = tf(termFreq(title_t:hanger)=1)
9.265228 = idf(docFreq=425, numDocs=1655589)
0.4375 = fieldNorm(field=title_t, doc=3454)
The thing that is perplexing is that the fieldNorm for the
title_t field
is
different in each of the explanations, ie: the fieldNorm using
the old
stemmer is: 0.5 = fieldNorm(field=title_t, doc=3454). For the new
stemmer
0.4375 = fieldNorm(field=title_t, doc=3454). I ran the title
through both
stemmers and get the same number of tokens produced. I do no
index time
boosting on the title_t field. I am using DefaultSimilarity in both
instances. So I figured the calculated fieldNorm would be:
field boost * lengthNorm = 1 * 1/sqrt(4) = 0.5
I wouldn't have thought that changing the stemmer would have any
impact
on
the fieldNorm in this case. Any insight? Please kick me over to the
lucene
list if you feel this isn't appropriate here.
Regards
Brendan