[
https://issues.apache.org/jira/browse/LUCENE-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss resolved LUCENE-9461.
---------------------------------
Resolution: Fixed
> Query hit highlighting components on top of matches API
> -------------------------------------------------------
>
> Key: LUCENE-9461
> URL: https://issues.apache.org/jira/browse/LUCENE-9461
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Fix For: master (9.0)
>
>
> Highlighters. Eventually, you'll have to face them.
> When a Lucene Query is ran over an index, it implies a list of documents that
> "matched it" - literally a boolean indication of whether the document should
> be included in the search result or not. In practice, many applications need
> to convey to users not just the fact that a document matched the query but
> also some sort of intuitive explanation of *why* this particular query
> matched it. While in many cases the relationship is trivial (term
> containment), in case of complex queries it may not be trivial at all (think
> of a really short prefix query, a fuzzy term query or even a Boolean
> disjunction with a high number of possibilities).
> Historically, search engines used to "highlight" the source area of a
> document that caused the "hit". If a document was too long, it was truncated
> and only the area around the hit (or hits) was displayed (so called
> "snippet").
> In my subjective opinion, in the Lucene API highlighters have played a
> secondary role to queries and search. And once you're trying to build
> something higher-level, highlighters are a crucial and necessary element of
> the entire system.
> My experience (and users feedback) from an implementation of a document
> retrieval system where highlighting was involved was that it just didn't work
> as expected. Here are the requirements of that system:
> * the query parser uses default field expansion into multiple fields (there
> is no single "sink" field),
> * the highlights should match *exactly* what caused the hit; a search for
> 'title:foo' must not highlight foo in any other field,
> * the set of fields to be highlighted isn't really fixed - there are some
> fields that should always be displayed - title, summary - and others that
> should not be displayed unless they're part of the query (in which case the
> highlight is important and should be shown to the user).
> * highlights should be accurate for all sorts of queries: fuzzy, phrase,
> prefix, Boolean, spans, etc.,
> * there can be more than one query at one time and they should highlight the
> same content (with different colors).
> Many highlighters are available in Lucene (vector highlighter, postings
> highlighter, unified highlighter) but none of them quite fit the bill above.
> Believe me - we have tried (hard). We ended up using unified highlighter but
> with subclassing, customizations and all sorts of complex, low-level quirks.
> My gut feeling at that point was that it should be the Query that somehow
> *exposes* the information about how a given field content matched. Then I
> looked at matches API and built a quick prototype retrieving "match regions"
> on top of that. It works like magic. Here are the key insights:
> * matches API returns exactly what a highlighter needs: for a given query it
> iterates over fields and positions (including offsets, if they are available)
> that caused a document to be included in the search result,
> * when matches API cannot provide offsets, it provides elements from which
> offsets can be computed: positions by re-analyzing the field's value, for
> example.
> * in extreme cases it may happen the matches API doesn't provide anything
> useful (a field only indexed, with no stored field value, no positions, no
> offsets) but I assume it is up to the application layer to know how to deal
> with this then (or not deal with it at all and throw an exception).
> * matches API delegates the work of providing proper match ranges to the
> query itself (actually, to the weight a query produces), it doesn't need to
> know anything about different implementations and their specifics.
> The absolute *key* element is the last one. Once you build match region
> retriever, highlighting is a merely about organizing match ranges, dealing
> with potential overlaps, and proper formatting. It becomes a simple,
> tractable problem separated from the internals of Lucene Queries.
> The initial set of "highlighter components" in this issue is a set of classes
> that allows one to assemble a complete pipeline from any query into a set of
> highlighted document fields. Any highlighter can be essentially built by
> assembling the following steps:
> * retrieving documents and their fields/ match ranges, given [Query,
> IndexSearcher],
> * sanitizing match ranges (overlaps, etc.),
> * selecting the "best" snippet for the given set of match ranges,
> * formatting the output (adding start/ end tags for snippets, ellipsis
> between values, etc.).
> This issue implements components for all of the above steps. It isn't about
> one highlighter class with tons of options, it's about bits and pieces that
> can be put together to build anything one desires. This said, an example
> "high level" highlighter class will also be provided as a sub-task.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]