ok. see this:
http://s23.postimg.org/yck2s5k1n/html_indexing.png



On Wed, Oct 23, 2013 at 10:45 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Attachments and images are often eaten by the mail server, your image is
> not visible at least to me. Can you describe what you're seeing? Or post
> the image somewhere and provide a link?
>
> Best,
> Erick
>
>
> On Wed, Oct 23, 2013 at 11:07 AM, Raheel Hasan <raheelhasan....@gmail.com
> >wrote:
>
> > Hi,
> >
> > I have an issue here while indexing large html. Here is the confguration
> > for that:
> >
> > 1) Data is imported via URLDataSource / PlainTextEntityProcessor (DIH)
> >
> > 2) Schema has this for the field:
> > type="text_en_splitting" indexed="true" stored="false" required="false"
> >
> > 3) text_en_splitting has the following work done for indexing:
> > HTMLStripCharFilterFactory
> > WhitespaceTokenizerFactory (create tokens)
> > StopFilterFactory
> > WordDelimiterFilterFactory
> > ICUFoldingFilterFactory
> > PorterStemFilterFactory
> > RemoveDuplicatesTokenFilterFactory
> > LengthFilterFactory
> >
> > However, the indexed data is like this (as in the attached image):
> > [image: Inline image 1]
> >
> >
> > so what are these numbers?
> > If I put small html, it works fine, but as the size of html file
> > increases, this is what happens..
> >
> > --
> > Regards,
> > Raheel Hasan
> >
>



-- 
Regards,
Raheel Hasan

Reply via email to