ShingleFilterFactory not indexing the whole doc, where is the limit ?

Pierre JdlF Tue, 31 Jan 2012 07:01:02 -0800

I'm trying to index word-ngrams using the solr.ShingleFilterFactory,
(storing their positions + offset)
...
    <fieldType name="edge_ngram" class="solr.TextField"
positionIncrementGap="1">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ShingleFilterFactory" minShingleSize="3"
maxShingleSize="5" outputUnigrams="false" tokenSeparator="_"/>
      </analyzer>
...
<field name="textengram" type="edge_ngram" indexed="true"
stored="true" multiValued="false" termVectors="true"
termPositions="true" termOffsets="true"/>
...
i'm testing it with a (big?) html document, [1.300.000 chars], with lots of tags
Looking at the index (using Schema browser web interface), i can see
some ngrams were indexed (8939)
but it appears that they were found only in the beginning of the
document (first 1/8 of the document)


other fields are indexing the whole doc without problem
so i was wondering if solr.ShingleFilterFactory had a limit ?
- in the sense of maximum blob of text it can manage ?
- in the sense of maximum number of ngrams produced ?

note that if i try with lower values like: minShingleSize="2" maxShingleSize="3"
i obtain 6465 ngrams (corresponding to the first 1/5 of the doc)

i though the sky was the limit !
any idea ?

-- 
+ Pierre

ShingleFilterFactory not indexing the whole doc, where is the limit ?

Reply via email to