Re: [MASSMAIL]Re: High fieldNorm values causing really odd results

Jorge Luis Betancourt González Thu, 14 May 2015 20:50:13 -0700

Regarding the experiment, sorry If I explained myself in the wrong way, the 
indexed document doesn't have 119669 terms have a lot less terms (less than a 
1000 terms, I don't have the exact number here now), instead 119669 is the 
number of distinct terms reported by luke (Top-terms total in the admin 
interface) on the title field.


This index was built from scratch using 4.10.3 if I'm no remembering 
incorrectly. Perhaps part of the data could be indexed using 4.10.2, but we 
updated our box quite some time ago and this problem didn't appear until 
recently. The more strange issue is that this was working fine until a week or 
so ago, the only thing I found strange is that the root partition in our Solr 
box got out of space; basically we've Solr deployed in Tomcat, which is 
installed in the root partition but the cores and all Solr related data is 
stored in a separated partition mounted in /opt with plenty of space to grow; 
could this be the cause of this behavior? 

We're thinking on rebuilding our index, but would love to avoid it if possible 
and more importantly find the root cause if this issue (if is possible at all).

As I said before very grateful for your responses,

----- Original Message -----
From: "Chris Hostetter" <hossman_luc...@fucit.org>
To: solr-user@lucene.apache.org
Sent: Thursday, May 14, 2015 7:11:08 PM
Subject: Re: [MASSMAIL]Re: High fieldNorm values causing really odd results


: Sorry for leaving the Solr version out in my previous email, I'm using 
: Solr 4.10.3 running on Centos7, with the following JRE: Oracle 
: Corporation OpenJDK 64-Bit Server VM (1.7.0_75 24.75-b04)

I can't reproduce Using Solr 4.10.3 (or 4.10.4 - mistread your email the 
first time)

Are you certain you didn't *build* this index with a different Similarity 
configured? or did you perhaps build it with an older version of Solr that 
might have had a bug in it?

Here's what i tried...

applied this patch to the example configs based on the fieldType you 
specified...

hossman@tray:~/lucene/lucene_solr_4_10_3_tag$ svn diff
Index: solr/example/solr/collection1/conf/schema.xml
===================================================================
--- solr/example/solr/collection1/conf/schema.xml       (revision 1679472)
+++ solr/example/solr/collection1/conf/schema.xml       (working copy)
@@ -46,6 +46,21 @@
 -->
 
 <schema name="example" version="1.5">
+
+        <fieldType name="hoss_type" class="solr.TextField" 
sortMissingLast="true">
+            <analyzer>
+                <charFilter class="solr.HTMLStripCharFilterFactory"/>
+                <tokenizer class="solr.StandardTokenizerFactory"/>
+                <filter class="solr.ASCIIFoldingFilterFactory"/>
+                <filter class="solr.StopFilterFactory"
+                    ignoreCase="true" words="stopwords.txt"/>
+                <filter class="solr.LowerCaseFilterFactory"/>
+                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+            </analyzer>
+        </fieldType>
+
+        <field name="hoss_test" type="hoss_type" stored="true" indexed="true" 
multiValued="true"/>
+  
   <!-- attribute "name" is the name of this schema and is only used for 
display purposes.
        version="x.y" is Solr's version number for the schema syntax and 
        semantics.  It should not normally be changed by applications.

...started up "java -jar start.jar" and then wrote & ran this script to 
generate a doc with the number of unique terms in my field that you mentioned & 
indexed it...

hossman@tray:~/tmp$ cat make-big-field.pl
#/usr/bin/perl

print qq{<add><doc><field name="id">hoss</field><field 
name="hoss_test">\n};
for (1..119669) {
    print "term${_} ";
}
print qq{</field></doc></add>\n};
hossman@tray:~/tmp$ perl make-big-field.pl > tmp.xml
hossman@tray:~/tmp$ curl -X POST -H 'Content-Type: application/xml' 
--data-binary @tmp.xml 
"http://localhost:8983/solr/collection1/update?commit=true";
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int 
name="QTime">877</int></lst>
</response>


Then confirmed i got a very small fieldNorm when querying against this 
field...

hossman@tray:~/tmp$ curl 
'http://localhost:8983/solr/collection1/select?q=hoss_test:term1&debug=results&wt=json&indent=true&fl=id&omitHeader=true'
{
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"hoss"}]
  },
  "debug":{
    "explain":{
      "hoss":"\n7.491524E-4 = (MATCH) weight(hoss_test:term1 in 0) 
[DefaultSimilarity], result of:\n  7.491524E-4 = fieldWeight in 0, product 
of:\n    1.0 = tf(freq=1.0), with freq of:\n      1.0 = termFreq=1.0\n    
0.30685282 = idf(docFreq=1, maxDocs=1)\n    0.0024414062 = 
fieldNorm(doc=0)\n"}}}


-Hoss
http://www.lucidworks.com/

Re: [MASSMAIL]Re: High fieldNorm values causing really odd results

Reply via email to