[jira] [Commented] (LUCENE-9461) Query hit highlighting components on top of matches API

David Smiley (Jira) Fri, 04 Sep 2020 22:11:10 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190981#comment-17190981
 ]


David Smiley commented on LUCENE-9461:
--------------------------------------

Maybe not as a sub-task, but would it make sense to modify the 
UnifiedHighlighter to use some of these components, thereby reducing 
redundancy?  As I say this, I look at some of these new components and maybe 
not (yet)... but maybe I'll see it better once you get to the example task.

> Query hit highlighting components on top of matches API
> -------------------------------------------------------
>
>                 Key: LUCENE-9461
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9461
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: master (9.0)
>
>
> Highlighters. Eventually, you'll have to face them. 
> When a Lucene Query is ran over an index, it implies a list of documents that 
> "matched it" - literally a boolean indication of whether the document should 
> be included in the search result or not. In practice, many applications need 
> to convey to users not just the fact that a document matched the query but 
> also some sort of intuitive explanation of *why* this particular query 
> matched it. While in many cases the relationship is trivial (term 
> containment), in case of complex queries it may not be trivial at all (think 
> of a really short prefix query, a fuzzy term query or even a Boolean 
> disjunction with a high number of possibilities).
> Historically, search engines used to "highlight" the source area of a 
> document that caused the "hit". If a document was too long, it was truncated 
> and only the area around the hit (or hits) was displayed (so called 
> "snippet").
> In my subjective opinion, in the Lucene API highlighters have played a 
> secondary role to queries and search. And once you're trying to build 
> something higher-level, highlighters are a crucial and necessary element of 
> the entire system. 
> My experience (and users feedback) from an implementation of a document 
> retrieval system where highlighting was involved was that it just didn't work 
> as expected. Here are the requirements of that system:
> * the query parser uses default field expansion into multiple fields (there 
> is no single "sink" field),
> * the highlights should match *exactly* what caused the hit; a search for 
> 'title:foo' must not highlight foo in any other field,
> * the set of fields to be highlighted isn't really fixed - there are some 
> fields that should always be displayed - title, summary - and others that 
> should not be displayed unless they're part of the query (in which case the 
> highlight is important and should be shown to the user).
> * highlights should be accurate for all sorts of queries: fuzzy, phrase, 
> prefix, Boolean, spans, etc.,
> * there can be more than one query at one time and they should highlight the 
> same content (with different colors).
> Many highlighters are available in Lucene (vector highlighter, postings 
> highlighter, unified highlighter) but none of them quite fit the bill above. 
> Believe me - we have tried (hard). We ended up using unified highlighter but 
> with subclassing, customizations and all sorts of complex, low-level quirks. 
> My gut feeling at that point was that it should be the Query that somehow 
> *exposes* the information about how a given field content matched. Then I 
> looked at matches API and built a quick prototype retrieving "match regions" 
> on top of that. It works like magic. Here are the key insights:
> * matches API returns exactly what a highlighter needs: for a given query it 
> iterates over fields and positions (including offsets, if they are available) 
> that caused a document to be included in the search result,
> * when matches API cannot provide offsets, it provides elements from which 
> offsets can be computed: positions by re-analyzing the field's value, for 
> example.
> * in extreme cases it may happen the matches API doesn't provide anything 
> useful (a field only indexed, with no stored field value, no positions, no 
> offsets) but I assume it is up to the application layer to know how to deal 
> with this then (or not deal with it at all and throw an exception).
> * matches API delegates the work of providing proper match ranges to the 
> query itself (actually, to the weight a query produces), it doesn't need to 
> know anything about different implementations and their specifics.
> The absolute *key* element is the last one. Once you build match region 
> retriever, highlighting is a merely about organizing match ranges, dealing 
> with potential overlaps, and proper formatting. It becomes a simple, 
> tractable problem separated from the internals of Lucene Queries.
> The initial set of "highlighter components" in this issue is a set of classes 
> that allows one to assemble a complete pipeline from any query into a set of 
> highlighted document fields. Any highlighter can be essentially built by 
> assembling the following steps:
> * retrieving documents and their fields/ match ranges, given [Query, 
> IndexSearcher],
> * sanitizing match ranges (overlaps, etc.),
> * selecting the "best" snippet for the given set of match ranges,
> * formatting the output (adding start/ end tags for snippets, ellipsis 
> between values, etc.).
> This issue implements components for all of the above steps. It isn't about 
> one highlighter class with tons of options, it's about bits and pieces that 
> can be put together to build anything one desires. This said, an example 
> "high level" highlighter class will also be provided as a sub-task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9461) Query hit highlighting components on top of matches API

Reply via email to