[ https://issues.apache.org/jira/browse/LUCENE-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191003#comment-17191003 ]
Dawid Weiss commented on LUCENE-9461: ------------------------------------- I don't mind reuse. I would mind if the result of this would mean increasing their complexity. Right now they're really simple and detached from each other which allows for a variety of use cases to be implemented. The example is already there - the test shows at least one use case that's not possible with current highlighters (highlighting multiple queries at once). https://github.com/dweiss/lucene-solr/blob/LUCENE-9464/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L118-L400 > Query hit highlighting components on top of matches API > ------------------------------------------------------- > > Key: LUCENE-9461 > URL: https://issues.apache.org/jira/browse/LUCENE-9461 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Dawid Weiss > Assignee: Dawid Weiss > Priority: Minor > Fix For: master (9.0) > > > Highlighters. Eventually, you'll have to face them. > When a Lucene Query is ran over an index, it implies a list of documents that > "matched it" - literally a boolean indication of whether the document should > be included in the search result or not. In practice, many applications need > to convey to users not just the fact that a document matched the query but > also some sort of intuitive explanation of *why* this particular query > matched it. While in many cases the relationship is trivial (term > containment), in case of complex queries it may not be trivial at all (think > of a really short prefix query, a fuzzy term query or even a Boolean > disjunction with a high number of possibilities). > Historically, search engines used to "highlight" the source area of a > document that caused the "hit". If a document was too long, it was truncated > and only the area around the hit (or hits) was displayed (so called > "snippet"). > In my subjective opinion, in the Lucene API highlighters have played a > secondary role to queries and search. And once you're trying to build > something higher-level, highlighters are a crucial and necessary element of > the entire system. > My experience (and users feedback) from an implementation of a document > retrieval system where highlighting was involved was that it just didn't work > as expected. Here are the requirements of that system: > * the query parser uses default field expansion into multiple fields (there > is no single "sink" field), > * the highlights should match *exactly* what caused the hit; a search for > 'title:foo' must not highlight foo in any other field, > * the set of fields to be highlighted isn't really fixed - there are some > fields that should always be displayed - title, summary - and others that > should not be displayed unless they're part of the query (in which case the > highlight is important and should be shown to the user). > * highlights should be accurate for all sorts of queries: fuzzy, phrase, > prefix, Boolean, spans, etc., > * there can be more than one query at one time and they should highlight the > same content (with different colors). > Many highlighters are available in Lucene (vector highlighter, postings > highlighter, unified highlighter) but none of them quite fit the bill above. > Believe me - we have tried (hard). We ended up using unified highlighter but > with subclassing, customizations and all sorts of complex, low-level quirks. > My gut feeling at that point was that it should be the Query that somehow > *exposes* the information about how a given field content matched. Then I > looked at matches API and built a quick prototype retrieving "match regions" > on top of that. It works like magic. Here are the key insights: > * matches API returns exactly what a highlighter needs: for a given query it > iterates over fields and positions (including offsets, if they are available) > that caused a document to be included in the search result, > * when matches API cannot provide offsets, it provides elements from which > offsets can be computed: positions by re-analyzing the field's value, for > example. > * in extreme cases it may happen the matches API doesn't provide anything > useful (a field only indexed, with no stored field value, no positions, no > offsets) but I assume it is up to the application layer to know how to deal > with this then (or not deal with it at all and throw an exception). > * matches API delegates the work of providing proper match ranges to the > query itself (actually, to the weight a query produces), it doesn't need to > know anything about different implementations and their specifics. > The absolute *key* element is the last one. Once you build match region > retriever, highlighting is a merely about organizing match ranges, dealing > with potential overlaps, and proper formatting. It becomes a simple, > tractable problem separated from the internals of Lucene Queries. > The initial set of "highlighter components" in this issue is a set of classes > that allows one to assemble a complete pipeline from any query into a set of > highlighted document fields. Any highlighter can be essentially built by > assembling the following steps: > * retrieving documents and their fields/ match ranges, given [Query, > IndexSearcher], > * sanitizing match ranges (overlaps, etc.), > * selecting the "best" snippet for the given set of match ranges, > * formatting the output (adding start/ end tags for snippets, ellipsis > between values, etc.). > This issue implements components for all of the above steps. It isn't about > one highlighter class with tons of options, it's about bits and pieces that > can be put together to build anything one desires. This said, an example > "high level" highlighter class will also be provided as a sub-task. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org