Hi, I have an issue here while indexing large html. Here is the confguration for that:
1) Data is imported via URLDataSource / PlainTextEntityProcessor (DIH) 2) Schema has this for the field: type="text_en_splitting" indexed="true" stored="false" required="false" 3) text_en_splitting has the following work done for indexing: HTMLStripCharFilterFactory WhitespaceTokenizerFactory (create tokens) StopFilterFactory WordDelimiterFilterFactory ICUFoldingFilterFactory PorterStemFilterFactory RemoveDuplicatesTokenFilterFactory LengthFilterFactory However, the indexed data is like this (as in the attached image): [image: Inline image 1] so what are these numbers? If I put small html, it works fine, but as the size of html file increases, this is what happens.. -- Regards, Raheel Hasan