Hi,

I have an issue here while indexing large html. Here is the confguration
for that:

1) Data is imported via URLDataSource / PlainTextEntityProcessor (DIH)

2) Schema has this for the field:
type="text_en_splitting" indexed="true" stored="false" required="false"

3) text_en_splitting has the following work done for indexing:
HTMLStripCharFilterFactory
WhitespaceTokenizerFactory (create tokens)
StopFilterFactory
WordDelimiterFilterFactory
ICUFoldingFilterFactory
PorterStemFilterFactory
RemoveDuplicatesTokenFilterFactory
LengthFilterFactory

However, the indexed data is like this (as in the attached image):
[image: Inline image 1]


so what are these numbers?
If I put small html, it works fine, but as the size of html file increases,
this is what happens..

-- 
Regards,
Raheel Hasan

Reply via email to