Hi,

I have 22 documents. I index these by posting them using LWP::UserAgent all with http status 200 OK.

One of my documents (id=44) contains the word "Campeau" in the "ocr" field. But according to luke this term does not appear in the index. Yet when I delete the index (delete by query *:* or restart server after deleting /index) and index just document id=44 its ocr field data does appear in the index according to luke.

Also I notice that the numTerms for 22 documents is 5579 and for just the doc id=44 it's 2194. Hard to believe that 22 documents only increase the number of terms by so little.

Why/how could this be happening?

Thanks,

Phil

---

My schema.xml:

<field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="extern_id" type="string" indexed="true" stored="true" required="true"/> <field name="ocr" type="mytext" indexed="true" stored="false" required="true"/>

where "mytext" is

 <fieldtype name="mytext" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
                splitOnCaseChange="0"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                />
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

Indexing 22 docs:
-----------------

<lst name="index">
<int name="numDocs">22</int>
<int name="maxDoc">22</int>
<int name="numTerms">5579</int>
<long name="version">1196382086904</long>
<bool name="optimized">true</bool>
<bool name="current">true</bool>
<bool name="hasDeletions">false</bool>
<date name="lastModified">2007-11-30T00:22:06Z</date>
</lst>
<lst name="fields">
<lst name="ocr">
<str name="type">mytext</str>
<str name="schema">IT-----------</str>
<str name="index">(unstored field)</str>
<int name="docs">22</int>
<int name="distinct">5513</int>
<lst name="topTerms">
[...]
<int name="cally">22</int>
<int name="cam">22</int>
<int name="cammi">22</int>  ???<<<<<<<<<<<<<<<<
<int name="cams">22</int>
<int name="can">22</int>


Indexing just doc id=44:
------------------------

<lst name="index">
<int name="numDocs">1</int>
<int name="maxDoc">1</int>
<int name="numTerms">2194</int>
<long name="version">1196381821086</long>
<bool name="optimized">true</bool>
<bool name="current">true</bool>
<bool name="hasDeletions">false</bool>
<date name="lastModified">2007-11-30T00:17:21Z</date>
</lst>
<lst name="fields">
<lst name="ocr">
<str name="type">mytext</str>
<str name="schema">IT-----------</str>
<str name="index">(unstored field)</str>
<int name="docs">1</int>
<int name="distinct">2191</int>
<lst name="topTerms">
[...]
<int name="called">1</int>
<int name="came">1</int>
<int name="camerons">1</int>
<int name="campeau">1</int>  <<<<<<<<<<<<<<
<int name="can">1</int>
<int name="canadian">1</int>
<int name="canal">1</int>



Reply via email to