On 10/24/2013 2:11 AM, Raheel Hasan wrote: > ok. see this: > http://s23.postimg.org/yck2s5k1n/html_indexing.png
A recap. You said your index analysis chain is this: HTMLStripCharFilterFactory WhitespaceTokenizerFactory (create tokens) StopFilterFactory WordDelimiterFilterFactory ICUFoldingFilterFactory PorterStemFilterFactory RemoveDuplicatesTokenFilterFactory LengthFilterFactory Your picture says you have 1 document, and this field contains 1036 terms. The numbers are likely numbers that are in your html document. You never showed us the input document. It is likely that the whitespace tokenizer and/or the WordDelimeter filter are producing these numbers as standalone tokens. The tokenizer is pretty easy to understand - it splits on whitespace. Please see the following to know what the options for WordDelimeterFilterFactory will do: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Thanks, Shawn