Thanks Jérôme! It seems to work now. I just hope the provided HTMLStripWhitespaceTokenizerFactory will strip the right tags now.
I use Java and used HtmlEncoder provided in http://itext.ugent.be/library/api/ for encoding with success. (just in case someone happens to search this thread) Ravi On 8/22/07, Jérôme Etévé <[EMAIL PROTECTED]> wrote: > You need to encode your html content so it can be include as a normal > 'string' value in your xml element. > > As far as remember, the only unsafe characters you have to encode as > entities are: > < -> < > > -> > > " -> "e; > & -> & > > (google xml entities to be sure). > > I dont know what language you use , but for perl for instance, you can > use something like: > use HTML::Entities ; > my $xmlString = encode_entities($rawHTML , '<>&"' ); > > Also you need to make sure your Html is encoded in UTF-8 . To comply > with solr need for UTF-8 encoded xml. > > I hope it helps. > > J. > > On 8/22/07, Ravish Bhagdev <[EMAIL PROTECTED]> wrote: > > Hello, > > > > Sorry for stupid question. I'm trying to index html file as one of > > the fields in Solr, I've setup appropriate analyzer in schema but I'm > > not sure how to add html content to Solr. Encapsulating HTML content > > within field tag is obviously not valid. How do I add html content? > > Hope the query is clear.... > > > > Thanks, > > Ravi > > > > > -- > Jerome Eteve. > [EMAIL PROTECTED] > http://jerome.eteve.free.fr/ >