Hi, I'm having a weirdness with indexing multiple terms to a single field
using a copyField. An example:
For document A
field:contents_1 is a multivalued field containing "cat", "dog" and "duck"
field:contents_2 is a multivalued field containing "cat", "horse", and
"flower"
For document B
field:contents_1 is a multivalued field containing "cat" and "fish"
field:contents_2 is a multivalued field containing "bear" and "turkey"
I have a copyField in my schema:
<copyField source="contents_*" dest="combined"/>
A query like contents_1:cat contents_2:cat returns document A first, and
then document B. I think that is the way it should work.
But a query like combined:cat returns document B first. In my mind, when I
am doing a copyField I am copying each of the terms in the multivalued
fields of contents_1 and contents_2 into combined, so that combined
internally has "cat", "dog", "duck", "cat", "horse", "flower" for document
A.
An explain on the query says something like (this is from a real query not
the fake one above)
<lst name="explain">
<str name="B">
4.0687284 = (MATCH) fieldWeight(combined:cat in 1663089), product of: 1.0 =
tf(termFreq(combined:cat)=1) 4.0687284 = idf(docFreq=135688,
maxDocs=2919285) 1.0 = fieldNorm(field=combined, doc=1663089)
</str>
<str name="A">
0.8509077 = (MATCH) fieldWeight(combined:cat in 913171), product of:
2.236068 = tf(termFreq(combined:cat)=5) 4.0590663 = idf(docFreq=143689,
maxDocs=3061697) 0.09375 = fieldNorm(field=combined, doc=913171)
</str>
If I am reading this right, it is finding the higher TF in A (5 in this
case) but still scoring B higher. Shouldn't idf be exactly the same?
(Both fields are a solr.TextField:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
</analyzer>
</fieldtype>
)
Another piece of perhaps relevant information is that this a query over 16
shards using distributed solr.