Re: unable to figure out nutch type highlighting in solr....

Adrian Sutton Fri, 05 Oct 2007 04:33:58 -0700

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?

We didn't do anything at all to the HTML, the editor returns validXHTML (using numeric entities, never named entities which aren'tvalid in XML and don't tend to work in XHTML) and we do stringconcatenation to build up the /update request body like:


requestBody += "<str name=\"content\">" + xhtmlContent + "</str>";

Solr seems to handle it. From what people are suggesting though you'dbe better off converting to plain text before indexing it with Solr.Something like JTidy (http://jtidy.sf.net) can parse most HTML that'saround and you can iterate over the DOM to extract the text from there.


Regards,

Adrian Sutton
http://www.symphonious.net

Re: unable to figure out nutch type highlighting in solr....

Reply via email to