Re: unable to figure out nutch type highlighting in solr....

Chris Hostetter Thu, 04 Oct 2007 19:09:00 -0700

: In general, I don't recommend indexing HTML content straight to Solr.  None of
: the Solr contributors do this so the use case hasn't received a lot of love.


I second that comment ... the HTML Striping code was never intended to be 
an "HTML Parser" it was designed to be a workarround for dealing with 
"dirty data" where people had unwanted HTML tags in what should be plain 
text.  indexing as is with some analyzers would result in words like 
"script", "strong", and "class" matching lots of docs where the words 
never relaly appear in the text.

if you have wellformed HTML documents, use an HTML parser to extract the 
real content.



-Hoss

Re: unable to figure out nutch type highlighting in solr....

Reply via email to