Hi,
I have 22 documents. I index these by posting them using LWP::UserAgent
all with http status 200 OK.
One of my documents (id=44) contains the word "Campeau" in the "ocr"
field. But according to luke this term does not appear in the index.
Yet when I delete the index (delete by query *:* or restart server after
deleting /index) and index just document id=44 its ocr field data does
appear in the index according to luke.
Also I notice that the numTerms for 22 documents is 5579 and for just
the doc id=44 it's 2194. Hard to believe that 22 documents only
increase the number of terms by so little.
Why/how could this be happening?
Thanks,
Phil
---
My schema.xml:
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="extern_id" type="string" indexed="true" stored="true"
required="true"/>
<field name="ocr" type="mytext" indexed="true" stored="false"
required="true"/>
where "mytext" is
<fieldtype name="mytext" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Indexing 22 docs:
-----------------
<lst name="index">
<int name="numDocs">22</int>
<int name="maxDoc">22</int>
<int name="numTerms">5579</int>
<long name="version">1196382086904</long>
<bool name="optimized">true</bool>
<bool name="current">true</bool>
<bool name="hasDeletions">false</bool>
<date name="lastModified">2007-11-30T00:22:06Z</date>
</lst>
<lst name="fields">
<lst name="ocr">
<str name="type">mytext</str>
<str name="schema">IT-----------</str>
<str name="index">(unstored field)</str>
<int name="docs">22</int>
<int name="distinct">5513</int>
<lst name="topTerms">
[...]
<int name="cally">22</int>
<int name="cam">22</int>
<int name="cammi">22</int> ???<<<<<<<<<<<<<<<<
<int name="cams">22</int>
<int name="can">22</int>
Indexing just doc id=44:
------------------------
<lst name="index">
<int name="numDocs">1</int>
<int name="maxDoc">1</int>
<int name="numTerms">2194</int>
<long name="version">1196381821086</long>
<bool name="optimized">true</bool>
<bool name="current">true</bool>
<bool name="hasDeletions">false</bool>
<date name="lastModified">2007-11-30T00:17:21Z</date>
</lst>
<lst name="fields">
<lst name="ocr">
<str name="type">mytext</str>
<str name="schema">IT-----------</str>
<str name="index">(unstored field)</str>
<int name="docs">1</int>
<int name="distinct">2191</int>
<lst name="topTerms">
[...]
<int name="called">1</int>
<int name="came">1</int>
<int name="camerons">1</int>
<int name="campeau">1</int> <<<<<<<<<<<<<<
<int name="can">1</int>
<int name="canadian">1</int>
<int name="canal">1</int>