Re: Issue with large html indexing

Shawn Heisey Thu, 24 Oct 2013 07:22:59 -0700

On 10/24/2013 2:11 AM, Raheel Hasan wrote:
> ok. see this:
> http://s23.postimg.org/yck2s5k1n/html_indexing.png


A recap.  You said your index analysis chain is this:

HTMLStripCharFilterFactory
WhitespaceTokenizerFactory (create tokens)
StopFilterFactory
WordDelimiterFilterFactory
ICUFoldingFilterFactory
PorterStemFilterFactory
RemoveDuplicatesTokenFilterFactory
LengthFilterFactory

Your picture says you have 1 document, and this field contains 1036
terms. The numbers are likely numbers that are in your html document.
You never showed us the input document.  It is likely that the
whitespace tokenizer and/or the WordDelimeter filter are producing these
numbers as standalone tokens.  The tokenizer is pretty easy to
understand - it splits on whitespace.  Please see the following to know
what the options for WordDelimeterFilterFactory will do:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Thanks,
Shawn

Re: Issue with large html indexing

Reply via email to