On a closer review, i am noticing that the fieldNorm is what is killing document A. If I reindex with omitNorms=true, will this problem be "solved"?
On Wed, Feb 2, 2011 at 4:54 PM, Martin J <martinj.eng...@gmail.com> wrote: > Hi, I'm having a weirdness with indexing multiple terms to a single field > using a copyField. An example: > > For document A > field:contents_1 is a multivalued field containing "cat", "dog" and "duck" > field:contents_2 is a multivalued field containing "cat", "horse", and > "flower" > > For document B > field:contents_1 is a multivalued field containing "cat" and "fish" > field:contents_2 is a multivalued field containing "bear" and "turkey" > > I have a copyField in my schema: > > <copyField source="contents_*" dest="combined"/> > > A query like contents_1:cat contents_2:cat returns document A first, and > then document B. I think that is the way it should work. > > But a query like combined:cat returns document B first. In my mind, when I > am doing a copyField I am copying each of the terms in the multivalued > fields of contents_1 and contents_2 into combined, so that combined > internally has "cat", "dog", "duck", "cat", "horse", "flower" for document > A. > > An explain on the query says something like (this is from a real query not > the fake one above) > > <lst name="explain"> > <str name="B"> > 4.0687284 = (MATCH) fieldWeight(combined:cat in 1663089), product of: 1.0 = > tf(termFreq(combined:cat)=1) 4.0687284 = idf(docFreq=135688, > maxDocs=2919285) 1.0 = fieldNorm(field=combined, doc=1663089) > </str> > <str name="A"> > 0.8509077 = (MATCH) fieldWeight(combined:cat in 913171), product of: > 2.236068 = tf(termFreq(combined:cat)=5) 4.0590663 = idf(docFreq=143689, > maxDocs=3061697) 0.09375 = fieldNorm(field=combined, doc=913171) > </str> > > If I am reading this right, it is finding the higher TF in A (5 in this > case) but still scoring B higher. Shouldn't idf be exactly the same? > > (Both fields are a solr.TextField: > > <fieldtype name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StandardFilterFactory"/> > <filter class="solr.ISOLatin1AccentFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" words="stopwords.txt" > ignoreCase="true"/> > <filter class="solr.EnglishPorterFilterFactory" > protected="protwords.txt"/> > </analyzer> > </fieldtype> > ) > > Another piece of perhaps relevant information is that this a query over 16 > shards using distributed solr. >