> I'm trying to index word-ngrams using > the solr.ShingleFilterFactory, > (storing their positions + offset) > ... > <fieldType name="edge_ngram" > class="solr.TextField" > positionIncrementGap="1"> > <analyzer type="index"> > <charFilter > class="solr.HTMLStripCharFilterFactory"/> > <tokenizer > class="solr.WhitespaceTokenizerFactory" /> > <filter > class="solr.LowerCaseFilterFactory" /> > <filter > class="solr.ShingleFilterFactory" minShingleSize="3" > maxShingleSize="5" outputUnigrams="false" > tokenSeparator="_"/> > </analyzer> > ... > <field name="textengram" type="edge_ngram" > indexed="true" > stored="true" multiValued="false" termVectors="true" > termPositions="true" termOffsets="true"/> > ... > i'm testing it with a (big?) html document, [1.300.000 > chars], with lots of tags > Looking at the index (using Schema browser web interface), i > can see > some ngrams were indexed (8939) > but it appears that they were found only in the beginning of > the > document (first 1/8 of the document)
It could be the maxFieldLength setting in solrconfig.xml . Set it to <maxFieldLength>2147483647</maxFieldLength>