Re: xml-aware highlighting

Michael Sokolov Sat, 09 Oct 2010 13:32:15 -0700

OK - I read a bit more and it appears an appropriate analysis pipeline(which would extract text from XML using SAX, say) is all that'srequired, and existing highlighting ought to be able to accomplish whatI'm after. So I guess the only question I have now before writing codeis where is the existing implementation :) - anyone?


-Mike



On 10/9/2010 12:51 PM, Michael Sokolov wrote:

I have a requirement to highlight search results, and to displaydocuments with matching terms highlighted in the context of theoriginal XML document structure.
It seems like this must be a very common use case, but I am havingtrouble finding a way to accomplish what we need to do using solrand/or lucene. Using the standard highlighting support in solr, wehave been able to retrieve KWIC text fragments for search results,which is great. But what we would ideally like to do is to applysimilar highlighting logic while preserving the original documentstructure.
1) When the user selects a matching document, we render it as HTMLwith paragraphs, headers, text styles such as italics, and so on, sowe need to highlight either the rendered HTML or the original XML andthen process that. We need to find the text fragments that matchedthe original query and highlight those. And this has to use the samelogic used by solr/lucene to do the searching, so that thetokenization and analysis is applied properly, and query semantics arerespected: if the original query was a phrase query, only phrasesshould match, and so on.
2) In addition, we also want to be able to display KWIC phrases thatare rendered with type styles based on the original XML; this requiressome XML tree surgery in order to pull out a fragment of a structureddocument while preserving enough xml structure to render type styles,which we can do, but it also requires a mapping of matching tokensback into the original document.
I am hoping this is a solved problem, but if not, I'd also beinterested in pointers to the best places to start an implementation.I think the problem at base is to maintain a map relating positions ofmatching terms in the indexed and stored field in lucene tocorresponding positions in an original XML document. Ideally theoriginal positions could be stored directly in term vectors, but theycould also be translated at render/highlight time using an additionallookup.
I see code in org.apache.lucene.search.highlight in solr and alsosomething in lucene/contrib/highlighter. Is that the state of the artnow, or is there anywhere else I should be looking as well?
Thanks for any pointers

-Mike Sokolov

Re: xml-aware highlighting

Reply via email to