ok. see this: http://s23.postimg.org/yck2s5k1n/html_indexing.png
On Wed, Oct 23, 2013 at 10:45 PM, Erick Erickson <erickerick...@gmail.com>wrote: > Attachments and images are often eaten by the mail server, your image is > not visible at least to me. Can you describe what you're seeing? Or post > the image somewhere and provide a link? > > Best, > Erick > > > On Wed, Oct 23, 2013 at 11:07 AM, Raheel Hasan <raheelhasan....@gmail.com > >wrote: > > > Hi, > > > > I have an issue here while indexing large html. Here is the confguration > > for that: > > > > 1) Data is imported via URLDataSource / PlainTextEntityProcessor (DIH) > > > > 2) Schema has this for the field: > > type="text_en_splitting" indexed="true" stored="false" required="false" > > > > 3) text_en_splitting has the following work done for indexing: > > HTMLStripCharFilterFactory > > WhitespaceTokenizerFactory (create tokens) > > StopFilterFactory > > WordDelimiterFilterFactory > > ICUFoldingFilterFactory > > PorterStemFilterFactory > > RemoveDuplicatesTokenFilterFactory > > LengthFilterFactory > > > > However, the indexed data is like this (as in the attached image): > > [image: Inline image 1] > > > > > > so what are these numbers? > > If I put small html, it works fine, but as the size of html file > > increases, this is what happens.. > > > > -- > > Regards, > > Raheel Hasan > > > -- Regards, Raheel Hasan