On 5-Oct-07, at 11:59 AM, Ravish Bhagdev wrote:
But a different use-case might be for the highlighting to encompass
the markup rather than >just the text, e.g.
<span class="highlighted"><topic type="location">Paris</topic></
span>
which would have to be accomplished some other way.
Yes, exactly. And I think nutch handles this somehow as I remember
using it for indexing HTML and then returning snippets with accurate
highlighting placed within html snippets.
Is there a potential for code reuse from nutch? Maybe this is topic
for solr developer list? Or has it been already considered?
Last time I looked at the nutch highlighter I don't remember seeing
anything about handling this correctly (which would involved a
considerable amount of html finangling to get perfect).
Also, I don't see the use case for web docs: you absolutely never
want to serve up the raw html form an unknown page.
I'm not against improving Solr's handling of HTML data, but it is the
type of thing that is unlikely to happen unless someone who cares
about it steps up.
Patches welcome :)
-Mike