OK - I read a bit more and it appears an appropriate analysis pipeline
(which would extract text from XML using SAX, say) is all that's
required, and existing highlighting ought to be able to accomplish what
I'm after. So I guess the only question I have now before writing code
is where is the existing implementation :) - anyone?
-Mike
On 10/9/2010 12:51 PM, Michael Sokolov wrote:
I have a requirement to highlight search results, and to display
documents with matching terms highlighted in the context of the
original XML document structure.
It seems like this must be a very common use case, but I am having
trouble finding a way to accomplish what we need to do using solr
and/or lucene. Using the standard highlighting support in solr, we
have been able to retrieve KWIC text fragments for search results,
which is great. But what we would ideally like to do is to apply
similar highlighting logic while preserving the original document
structure.
1) When the user selects a matching document, we render it as HTML
with paragraphs, headers, text styles such as italics, and so on, so
we need to highlight either the rendered HTML or the original XML and
then process that. We need to find the text fragments that matched
the original query and highlight those. And this has to use the same
logic used by solr/lucene to do the searching, so that the
tokenization and analysis is applied properly, and query semantics are
respected: if the original query was a phrase query, only phrases
should match, and so on.
2) In addition, we also want to be able to display KWIC phrases that
are rendered with type styles based on the original XML; this requires
some XML tree surgery in order to pull out a fragment of a structured
document while preserving enough xml structure to render type styles,
which we can do, but it also requires a mapping of matching tokens
back into the original document.
I am hoping this is a solved problem, but if not, I'd also be
interested in pointers to the best places to start an implementation.
I think the problem at base is to maintain a map relating positions of
matching terms in the indexed and stored field in lucene to
corresponding positions in an original XML document. Ideally the
original positions could be stored directly in term vectors, but they
could also be translated at render/highlight time using an additional
lookup.
I see code in org.apache.lucene.search.highlight in solr and also
something in lucene/contrib/highlighter. Is that the state of the art
now, or is there anywhere else I should be looking as well?
Thanks for any pointers
-Mike Sokolov