One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?

We didn't do anything at all to the HTML, the editor returns valid XHTML (using numeric entities, never named entities which aren't valid in XML and don't tend to work in XHTML) and we do string concatenation to build up the /update request body like:

requestBody += "<str name=\"content\">" + xhtmlContent + "</str>";

Solr seems to handle it. From what people are suggesting though you'd be better off converting to plain text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net) can parse most HTML that's around and you can iterate over the DOM to extract the text from there.

Regards,

Adrian Sutton
http://www.symphonious.net

Reply via email to