Re: ShingleFilterFactory not indexing the whole doc, where is the limit ?

Ahmet Arslan Tue, 31 Jan 2012 08:24:13 -0800

> I'm trying to index word-ngrams using
> the solr.ShingleFilterFactory,
> (storing their positions + offset)
> ...
>     <fieldType name="edge_ngram"
> class="solr.TextField"
> positionIncrementGap="1">
>       <analyzer type="index">
>           <charFilter
> class="solr.HTMLStripCharFilterFactory"/>
>     <tokenizer
> class="solr.WhitespaceTokenizerFactory" />
>         <filter
> class="solr.LowerCaseFilterFactory" />
>         <filter
> class="solr.ShingleFilterFactory" minShingleSize="3"
> maxShingleSize="5" outputUnigrams="false"
> tokenSeparator="_"/>
>       </analyzer>
> ...
> <field name="textengram" type="edge_ngram"
> indexed="true"
> stored="true" multiValued="false" termVectors="true"
> termPositions="true" termOffsets="true"/>
> ...
> i'm testing it with a (big?) html document, [1.300.000
> chars], with lots of tags
> Looking at the index (using Schema browser web interface), i
> can see
> some ngrams were indexed (8939)
> but it appears that they were found only in the beginning of
> the
> document (first 1/8 of the document)


It could be the maxFieldLength setting in solrconfig.xml . Set it to 
<maxFieldLength>2147483647</maxFieldLength>

Re: ShingleFilterFactory not indexing the whole doc, where is the limit ?

Reply via email to