Re: Indexing HTML content... (Embed HTML into XML?)

Ravish Bhagdev Wed, 22 Aug 2007 07:24:50 -0700

Thanks Jérôme!

It seems to work now.  I just hope the provided
HTMLStripWhitespaceTokenizerFactory will strip the right tags now.


I use Java and used HtmlEncoder provided in
http://itext.ugent.be/library/api/  for encoding with success. (just
in case someone happens to search this thread)

Ravi

On 8/22/07, Jérôme Etévé <[EMAIL PROTECTED]> wrote:
> You need to encode your html content so it can be include as a normal
> 'string' value in your xml element.
>
> As far as remember, the only unsafe characters you have to encode as
> entities are:
> <  -> &lt;
> > -> &gt;
> " -> &quote;
> & -> &amp;
>
> (google xml entities to be sure).
>
> I dont know what language you use , but for perl for instance, you can
> use something like:
> use HTML::Entities ;
> my $xmlString = encode_entities($rawHTML  , '<>&"' );
>
> Also you need to make sure your Html is encoded in UTF-8 . To comply
> with solr need for UTF-8 encoded xml.
>
> I hope it helps.
>
> J.
>
> On 8/22/07, Ravish Bhagdev <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > Sorry for stupid question.  I'm trying to index html file as one of
> > the fields in Solr, I've setup appropriate analyzer in schema but I'm
> > not sure how to add html content to Solr.  Encapsulating HTML content
> > within field tag is obviously not valid.  How do I add html content?
> > Hope the query is clear....
> >
> > Thanks,
> > Ravi
> >
>
>
> --
> Jerome Eteve.
> [EMAIL PROTECTED]
> http://jerome.eteve.free.fr/
>

Re: Indexing HTML content... (Embed HTML into XML?)

Reply via email to