I'm trying to index word-ngrams using the solr.ShingleFilterFactory, (storing their positions + offset) ... <fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1"> <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.ShingleFilterFactory" minShingleSize="3" maxShingleSize="5" outputUnigrams="false" tokenSeparator="_"/> </analyzer> ... <field name="textengram" type="edge_ngram" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/> ... i'm testing it with a (big?) html document, [1.300.000 chars], with lots of tags Looking at the index (using Schema browser web interface), i can see some ngrams were indexed (8939) but it appears that they were found only in the beginning of the document (first 1/8 of the document)
other fields are indexing the whole doc without problem so i was wondering if solr.ShingleFilterFactory had a limit ? - in the sense of maximum blob of text it can manage ? - in the sense of maximum number of ngrams produced ? note that if i try with lower values like: minShingleSize="2" maxShingleSize="3" i obtain 6465 ngrams (corresponding to the first 1/5 of the doc) i though the sky was the limit ! any idea ? -- + Pierre