[ https://issues.apache.org/jira/browse/LUCENE-9439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168613#comment-17168613 ]
Dawid Weiss commented on LUCENE-9439: ------------------------------------- Hi Alan. I've added a patch that shows the PoC highlighter. The patch is rather large because it includes a rather complex passage selector but it's required to make it self-contained. The thing to look at is the test: HitRegionRetrieverTest - it shows how the "hit regions" are retrieved from Matches API for various query types, including fuzzy queries, contiguous ranges for gap queries, non-default position gaps, synonyms, etc. These hit ranges are then passed to passage selector and a simple test-only formatter - these clean up potential overlaps and nesting problems (which is useful for HTML, for example). I think this works great and is elegantly small (everything is essentially in HitRegionRetriever class). The only test that currently fails is testTextFieldNoPositions. It has no way of knowing which field the match was on (or retrieving its "value"). If the Matches API could provide this information somehow, it'd be fairly complete I think. The patch isn't meant to go in (ant bit is missing, added a dependency to assertj for gradle only) but it shows the big picture of what I'm trying to achieve. > Matches API should enumerate hit fields that have no positions (no iterator) > ---------------------------------------------------------------------------- > > Key: LUCENE-9439 > URL: https://issues.apache.org/jira/browse/LUCENE-9439 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Dawid Weiss > Assignee: Dawid Weiss > Priority: Minor > Attachments: LUCENE-9439.patch, matchhighlighter.patch > > Time Spent: 20m > Remaining Estimate: 0h > > I have been fiddling with Matches API and it's great. There is one corner > case that doesn't work for me though -- queries that affect fields without > positions return {{MatchesUtil.MATCH_WITH_NO_TERMS}} but this constant is > problematic as it doesn't carry the field name that caused it (returns null). > The associated fromSubMatches combines all these constants into one (or > swallows them) which is another problem. > I think it would be more consistent if MATCH_WITH_NO_TERMS was replaced with > a true match (carrying field name) returning an empty iterator (or a constant > "empty" iterator NO_TERMS). > I have a very compelling use case: I wrote an "auto-highlighter" that runs on > top of Matches API and automatically picks up query-relevant fields and > snippets. Everything works beautifully except for cases where fields are > searchable but don't have any positions (token-like fields). > I can work on a patch but wanted to reach out first - [~romseygeek]? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org