[jira] [Created] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8
Dawid Weiss created LUCENE-10284: Summary: Upgrade morfologik-stemming to 2.1.8 Key: LUCENE-10284 URL: https://issues.apache.org/jira/browse/LUCENE-10284 Project: Lucene - Core Issue Type: Task Reporter: Dawid Weiss -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8
[ https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10284: - Fix Version/s: 9.1 > Upgrade morfologik-stemming to 2.1.8 > > > Key: LUCENE-10284 > URL: https://issues.apache.org/jira/browse/LUCENE-10284 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Priority: Trivial > Fix For: 9.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8
[ https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss reassigned LUCENE-10284: Assignee: Dawid Weiss > Upgrade morfologik-stemming to 2.1.8 > > > Key: LUCENE-10284 > URL: https://issues.apache.org/jira/browse/LUCENE-10284 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #514: LUCENE-10284: Upgrade morfologik-stemming to 2.1.8
dweiss merged pull request #514: URL: https://github.com/apache/lucene/pull/514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8
[ https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453322#comment-17453322 ] ASF subversion and git services commented on LUCENE-10284: -- Commit d2b7e7a4410e1b4a4e46818e38c89717963b5087 in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d2b7e7a ] LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514) > Upgrade morfologik-stemming to 2.1.8 > > > Key: LUCENE-10284 > URL: https://issues.apache.org/jira/browse/LUCENE-10284 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8
[ https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453323#comment-17453323 ] ASF subversion and git services commented on LUCENE-10284: -- Commit d2563e6f1fd019ecdcbfea45220a168b8a644242 in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d2563e6 ] LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514) > Upgrade morfologik-stemming to 2.1.8 > > > Key: LUCENE-10284 > URL: https://issues.apache.org/jira/browse/LUCENE-10284 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.1 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8
[ https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-10284. -- Resolution: Fixed > Upgrade morfologik-stemming to 2.1.8 > > > Key: LUCENE-10284 > URL: https://issues.apache.org/jira/browse/LUCENE-10284 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.1 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new pull request #515: simplify jflex grammars by using difference rather than negation
rmuir opened a new pull request #515: URL: https://github.com/apache/lucene/pull/515 This change uses a new jflex feature (https://github.com/jflex-de/jflex/pull/654) to simplify emoji processing in the grammar. We can do a set difference rather than workaround it with complement + demorgan stuff. It is cosmetic: doesn't change the resulting tokenizers (see diff), but makes the emoji parts easier to read. bonus: major speed up to regenerating that huge UAX29UrlEmail DFA. Before: ``` > Task :lucene:analysis:common:generateUAX29URLEmailTokenizer Aggregate task times (possibly running in parallel!): 918.87 sec. generateUAX29URLEmailTokenizerInternal ``` After: ``` > Task :lucene:analysis:common:generateUAX29URLEmailTokenizer Aggregate task times (possibly running in parallel!): 285.26 sec. generateUAX29URLEmailTokenizerInternal ``` This was suggested by jflex developers to help with the very-slow-regeneration on https://github.com/jflex-de/jflex/issues/715 . It doesn't solve all of our problems there, but it makes things a lot less painful :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #515: simplify jflex grammars by using difference rather than negation
rmuir commented on pull request #515: URL: https://github.com/apache/lucene/pull/515#issuecomment-986055369 I'll take care of the precommit. There's some build wierdness in the regeneration where we pull the TLDs and make the new included-TLD.jflex *after* we recompile the grammar. Maybe dependencies are backwards. Anyay, I didn't really want to suck in today's new TLDs and trigger any changes to the parsers anyway, as I wanted to show this change "makes no difference". Oh well, ill just force-regenerate everything (you really need to do it twice right now to really pull in the new TLDs) and push again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #515: simplify jflex grammars by using difference rather than negation
rmuir commented on pull request #515: URL: https://github.com/apache/lucene/pull/515#issuecomment-986055851 and don't worry about the gradle build, I think part of the issue is that this TLDs file annoyingly changed *while I was iterating and working on this*: ``` # Version 2021120400, Last Updated Sat Dec 4 07:07:01 2021 UTC ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10285) gradle regenerate TLDs file / tokenizer dependency is backwards/wrong
Robert Muir created LUCENE-10285: Summary: gradle regenerate TLDs file / tokenizer dependency is backwards/wrong Key: LUCENE-10285 URL: https://issues.apache.org/jira/browse/LUCENE-10285 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir To reproduce: {{./gradlew regenerate --rerun-tasks}} You'll see this behavior: {noformat} > Task :lucene:analysis:common:generateUAX29URLEmailTokenizerInternal Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and requires 12g of memory!). Recompiling JFlex: lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex ... > Task :lucene:analysis:common:generateTldsInternal Execution optimizations have been disabled for task ':lucene:analysis:common:generateTldsInternal' to ensure correctness due to the following reasons: - Gradle detected a problem with the following location: '/home/rmuir/workspace/lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex'. Reason: Task ':lucene:analysis:common:generateUAX29URLEmailTokenizerInternal' uses this output of task ':lucene:analysis:common:generateTldsInternal' without declaring an explicit or implicit dependency. This can lead to incorrect results being produced, depending on what order the tasks are executed. Please refer to https://docs.gradle.org/7.2/userguide/validation_problems.html#implicit_dependency for more details about this problem. Found 1489 TLDs in IANA TLD Database at https://data.iana.org/TLD/tlds-alpha-by-domain.txt ASCIITLD: 1370 TLDs ASCIITLDprefix_1CharSuffix: 108 TLDs ASCIITLDprefix_2CharSuffix: 11 TLDs Total: 1489 TLDs You've regenerated the TLD include file, remember to regenerate UAX29URLEmailTokenizerImpl too. {noformat} So it regenerates the TLD include file after the UAX29URLEmailTokenizerImpl, which means now you gotta run "gradlew regenerate" again to really pick up the changes. cc [~dweiss] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10285) gradle regenerate TLDs file / tokenizer dependency is backwards/wrong
[ https://issues.apache.org/jira/browse/LUCENE-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss reassigned LUCENE-10285: Assignee: Dawid Weiss > gradle regenerate TLDs file / tokenizer dependency is backwards/wrong > - > > Key: LUCENE-10285 > URL: https://issues.apache.org/jira/browse/LUCENE-10285 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Assignee: Dawid Weiss >Priority: Major > > To reproduce: {{./gradlew regenerate --rerun-tasks}} > You'll see this behavior: > {noformat} > > Task :lucene:analysis:common:generateUAX29URLEmailTokenizerInternal > Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and > requires 12g of memory!). > Recompiling JFlex: > lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex > ... > > Task :lucene:analysis:common:generateTldsInternal > Execution optimizations have been disabled for task > ':lucene:analysis:common:generateTldsInternal' to ensure correctness due to > the following reasons: > - Gradle detected a problem with the following location: > '/home/rmuir/workspace/lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex'. > Reason: Task > ':lucene:analysis:common:generateUAX29URLEmailTokenizerInternal' uses this > output of task ':lucene:analysis:common:generateTldsInternal' without > declaring an explicit or implicit dependency. This can lead to incorrect > results being produced, depending on what order the tasks are executed. > Please refer to > https://docs.gradle.org/7.2/userguide/validation_problems.html#implicit_dependency > for more details about this problem. > Found 1489 TLDs in IANA TLD Database at > https://data.iana.org/TLD/tlds-alpha-by-domain.txt > ASCIITLD: 1370 TLDs > ASCIITLDprefix_1CharSuffix: 108 TLDs > ASCIITLDprefix_2CharSuffix: 11 TLDs > Total: 1489 TLDs > You've regenerated the TLD include file, remember to regenerate > UAX29URLEmailTokenizerImpl too. > {noformat} > So it regenerates the TLD include file after the UAX29URLEmailTokenizerImpl, > which means now you gotta run "gradlew regenerate" again to really pick up > the changes. > cc [~dweiss] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10285) gradle regenerate TLDs file / tokenizer dependency is backwards/wrong
[ https://issues.apache.org/jira/browse/LUCENE-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453453#comment-17453453 ] Dawid Weiss commented on LUCENE-10285: -- Eh. That crap is so complicated already that it makes me want to run away when I see an issue touching it... :) I'll take a look later - can you run them/ order them manually now? Sorry about it. > gradle regenerate TLDs file / tokenizer dependency is backwards/wrong > - > > Key: LUCENE-10285 > URL: https://issues.apache.org/jira/browse/LUCENE-10285 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Assignee: Dawid Weiss >Priority: Major > > To reproduce: {{./gradlew regenerate --rerun-tasks}} > You'll see this behavior: > {noformat} > > Task :lucene:analysis:common:generateUAX29URLEmailTokenizerInternal > Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and > requires 12g of memory!). > Recompiling JFlex: > lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex > ... > > Task :lucene:analysis:common:generateTldsInternal > Execution optimizations have been disabled for task > ':lucene:analysis:common:generateTldsInternal' to ensure correctness due to > the following reasons: > - Gradle detected a problem with the following location: > '/home/rmuir/workspace/lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex'. > Reason: Task > ':lucene:analysis:common:generateUAX29URLEmailTokenizerInternal' uses this > output of task ':lucene:analysis:common:generateTldsInternal' without > declaring an explicit or implicit dependency. This can lead to incorrect > results being produced, depending on what order the tasks are executed. > Please refer to > https://docs.gradle.org/7.2/userguide/validation_problems.html#implicit_dependency > for more details about this problem. > Found 1489 TLDs in IANA TLD Database at > https://data.iana.org/TLD/tlds-alpha-by-domain.txt > ASCIITLD: 1370 TLDs > ASCIITLDprefix_1CharSuffix: 108 TLDs > ASCIITLDprefix_2CharSuffix: 11 TLDs > Total: 1489 TLDs > You've regenerated the TLD include file, remember to regenerate > UAX29URLEmailTokenizerImpl too. > {noformat} > So it regenerates the TLD include file after the UAX29URLEmailTokenizerImpl, > which means now you gotta run "gradlew regenerate" again to really pick up > the changes. > cc [~dweiss] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #515: simplify jflex grammars by using difference rather than negation
rmuir commented on pull request #515: URL: https://github.com/apache/lucene/pull/515#issuecomment-986060813 I opened https://issues.apache.org/jira/browse/LUCENE-10285 about the TLD task dependency. For now, I just ran `regenerate` again and it brought in the TLD changes. It is a lot less painful at least now that it takes 1/3 of the time! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10285) gradle regenerate TLDs file / tokenizer dependency is backwards/wrong
[ https://issues.apache.org/jira/browse/LUCENE-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453456#comment-17453456 ] Robert Muir commented on LUCENE-10285: -- Yeah, I just ran {{regenerate}} again to work around it. Mainly just wanted to open the issue for tracking, so we don't forget about it. I'd recommend we merge the speed-up PR ([https://github.com/apache/lucene/pull/515)] anyway before trying to debug this issue. 5 minutes vs 15 minutes is a lot easier on the liver :) > gradle regenerate TLDs file / tokenizer dependency is backwards/wrong > - > > Key: LUCENE-10285 > URL: https://issues.apache.org/jira/browse/LUCENE-10285 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Assignee: Dawid Weiss >Priority: Major > > To reproduce: {{./gradlew regenerate --rerun-tasks}} > You'll see this behavior: > {noformat} > > Task :lucene:analysis:common:generateUAX29URLEmailTokenizerInternal > Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and > requires 12g of memory!). > Recompiling JFlex: > lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex > ... > > Task :lucene:analysis:common:generateTldsInternal > Execution optimizations have been disabled for task > ':lucene:analysis:common:generateTldsInternal' to ensure correctness due to > the following reasons: > - Gradle detected a problem with the following location: > '/home/rmuir/workspace/lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex'. > Reason: Task > ':lucene:analysis:common:generateUAX29URLEmailTokenizerInternal' uses this > output of task ':lucene:analysis:common:generateTldsInternal' without > declaring an explicit or implicit dependency. This can lead to incorrect > results being produced, depending on what order the tasks are executed. > Please refer to > https://docs.gradle.org/7.2/userguide/validation_problems.html#implicit_dependency > for more details about this problem. > Found 1489 TLDs in IANA TLD Database at > https://data.iana.org/TLD/tlds-alpha-by-domain.txt > ASCIITLD: 1370 TLDs > ASCIITLDprefix_1CharSuffix: 108 TLDs > ASCIITLDprefix_2CharSuffix: 11 TLDs > Total: 1489 TLDs > You've regenerated the TLD include file, remember to regenerate > UAX29URLEmailTokenizerImpl too. > {noformat} > So it regenerates the TLD include file after the UAX29URLEmailTokenizerImpl, > which means now you gotta run "gradlew regenerate" again to really pick up > the changes. > cc [~dweiss] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #515: simplify jflex grammars by using difference rather than negation
dweiss commented on pull request #515: URL: https://github.com/apache/lucene/pull/515#issuecomment-986062093 That's a nice improvement! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10282) morfologik-stemming is not an automatic module
[ https://issues.apache.org/jira/browse/LUCENE-10282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453460#comment-17453460 ] Dawid Weiss commented on LUCENE-10282: -- This is partially done. The weird part is the Ukrainian dictionary - it's not declared in the module requirements (and it should be), yet it doesn't fail because that jar doesn't have any classes (just the resource)? > morfologik-stemming is not an automatic module > -- > > Key: LUCENE-10282 > URL: https://issues.apache.org/jira/browse/LUCENE-10282 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10282) morfologik-stemming is not an automatic module
[ https://issues.apache.org/jira/browse/LUCENE-10282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453461#comment-17453461 ] Dawid Weiss commented on LUCENE-10282: -- And I think this shows how even a compiling module can be trappy - I don't think the Ukrainian dictionary will work in module mode, even when the corresponding jar is part of the classpath. > morfologik-stemming is not an automatic module > -- > > Key: LUCENE-10282 > URL: https://issues.apache.org/jira/browse/LUCENE-10282 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10286) Module path to dependencies that are not automatic modules (according to gradle)
Dawid Weiss created LUCENE-10286: Summary: Module path to dependencies that are not automatic modules (according to gradle) Key: LUCENE-10286 URL: https://issues.apache.org/jira/browse/LUCENE-10286 Project: Lucene - Core Issue Type: Sub-task Reporter: Dawid Weiss So... the workaround [~tomoko] came up with here: https://github.com/dweiss/lucene/pull/8/files will not work. Basically, the workaround is to force gradle's compile java task to use module path with the classpath entries: {code} plugins.withType(JavaPlugin) { tasks.withType(JavaCompile) { doFirst { options.compilerArgs += [ "--module-path", classpath.asPath ] {code} this is indeed quoted as a solution all over the place but it predates what Gradle currently does (module path inference). There are multiple problems with the above: 1) java compilation task does not "understand" that we pass a module path argument and happily issues its own version from the inferred path - I confirmed this by looking at the logs. The second option pretty much takes precedence (javac doesn't complain, it just takes the value of the last option) and things may break in weird ways there. 2) the solution is also flowed because classpath can contain directories that are not modules... think cross-project dependencies (to other projects that are not modules). These directories would not be converted to automatic modules and would in fact fail the compilation. I looked at gradle source code and the "module inference" is pretty much non-configurable - it's what the docs say (automatic module name or proper module descriptor). Anything else is treated as a classpath entry, with no way of moving it to module path. There is an open issue that touches on this subject here: https://github.com/gradle/gradle/issues/12630 and Jendrik Johannes of Gradle published a plugin leveraging Gradle's "artifact transforms" that allow you to create full module info for jars that are missing it: https://github.com/jjohannes/extra-java-module-info#how-to-use-this-plugin As fancy as this solution is... I don't like it that much. It's what Java already does for you (conversion of jars to automatic modules) - I'd rather rely on built-in mechanisms than gradle magic. The only way out of this that I currently see is to turn off automatic module inference for these projects that contain non-modular dependencies and manually modify the classpath + module path for tasks that may be using them (primarily javac). I'm pretty sure it can be done with relative ease but I don't have any more time today to provide a proof of concept that would do it for one of the subprojects. Will do it tomorrow, hopefully. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10255) Fully embrace the java module system
[ https://issues.apache.org/jira/browse/LUCENE-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10255: - Description: I've experimented a bit trying to move the code to the JMS. It is _surprisingly difficult_... A PoC that almost passes all checks is here: https://github.com/dweiss/lucene/tree/jms Here are my conclusions so far: * The JMS and gradle add a lot of complexity (this applies to any higher-level tooling, including IDEs, I think). For starters, modules have to be JARs. The effect of this is that what was previously a set of directories from dependencies now has to be a JAR. What was previously an incremental update of a single .class file now ripples throughout the build recreating module JARs (ZIPs!)... I didn't realize it at first, but it's a costly thing to do. I'm not even sure how IDEs handle this issue. * A Java module contains metadata (such as the module version or main class) that is completely detached from any source file. These things live in a class bytecode of the compiled module-info; interestingly, there is no source-level way to specify it - these class attributes are injected by the 'jar' tool. Gradle has some fancy on-the-fly asm conversion filter that injects it. * Dependencies between modules will effectively live in two places: in gradle build files and in module-info files. And they can go out of sync, although it's probably easy to catch (since javac would complain about missing classes during compilation, even if they're in module path). * Probably the biggest challenge (not covered in the PoC) are with our custom javadoc and ecj linter tasks - they see the module-info.java and can't cope with it. At the same time, there is no easy way to exclude that one particular file: ecj would have to accept a full set of sources (command argument limit will be a problem), javac can accept a full set of java sources (external file) but then it doesn't copy doc-files properly anymore (this is probably easier to fix). * There are differences at runtime that are hard to anticipate - for example resource lookups via class loader no longer work (I fixed this in Luke). * We will have to rethink the long-term strategy of how white-box tests work. There are some guidelines here but all of them have some cons (IDEs being confused). https://docs.gradle.org/current/userguide/java_testing.html#sec:java_testing_modular * it's pretty much impossible to exclude transitive dependencies from modules we depend on - if they're not compile-time only (static) requirements, they will have to be present on module path. After poking a bit and trying it out I have to say I have mixed feelings about moving to the JMS. On the one hand, many things are great - the module path, module descriptors and access modes. On the other hand, the tooling tricks required to make it all work make you shiver. If anybody wants to play/ improve things on that experimental branch (I converted Luke to a full module - it works), please be my guest. I have to sit on this and think whether it's something I really like or not. was: I've experimented a bit trying to move the code to the JMS. It is _surprisingly difficult_... A PoC that almost passes all checks is here: https://github.com/dweiss/lucene/tree/jms Here are my conclusions so far: * The JMS and gradle add a lot of complexity (this applies to any higher-level tooling, including IDEs, I think). For starters, modules have to be JARs. The effect of this is that what was previously a set of directories from dependencies now has to be a JAR. What was previously an incremental update of a single .class file now ripples throughout the build recreating module JARs (ZIPs!)... I didn't realize it at first, but it's a costly thing to do. I'm not even sure how IDEs handle this issue. * A Java module contains metadata (such as the module version or main class) that is completely detached from any source file. These things live in a class bytecode of the compiled module-info; interestingly, there is no source-level way to specify it - these class attributes are injected by the 'jar' tool. Gradle has some fancy on-the-fly asm conversion filter that injects it. * Dependencies between modules will effectively live in two places: in gradle build files and in module-info files. And they can go out of sync, although it's probably easy to catch (since javac would complain about missing classes during compilation, even if they're in module path). * Probably the biggest challenge (not covered in the PoC) are with our custom javadoc and ecj linter tasks - they see the module-info.java and can't cope with it. At the same time, there is no easy way to exclude that one particular file: ecj would have to accept a full set of sources (command argument limit will be a problem), javac can accept a full set of java sources (external file) but then it
[GitHub] [lucene] rmuir commented on pull request #515: simplify jflex grammars by using difference rather than negation
rmuir commented on pull request #515: URL: https://github.com/apache/lucene/pull/515#issuecomment-986075965 @dweiss i think it's good win just for getting a simpler grammar. There is probably evil stuff we could do to speed up the monster. Yes, I am slightly tempted to import `org.apache.lucene.util.automaton` into `GenerateJFlexTLDMacros.java`... but I agree with your thoughts on the JFlex issue, it is better to just generate the simple "transparent" grammar of TLDs, despite how slow it makes things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #504: Make TestNRTReplication.testCrashReplica nightly
rmuir merged pull request #504: URL: https://github.com/apache/lucene/pull/504 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #505: tone down BaseTermVectorsFormatTestCase.testLotsOfFields in non-nightly
rmuir merged pull request #505: URL: https://github.com/apache/lucene/pull/505 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #506: tone down TestIndexWriter.testMaxCompletedSequenceNumber in non-nightly
rmuir merged pull request #506: URL: https://github.com/apache/lucene/pull/506 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new pull request #516: speed up TestSimpleExplanationsWithFillerDocs
rmuir opened a new pull request #516: URL: https://github.com/apache/lucene/pull/516 This is the slowest test suite, runs for ~ 60s, because between every document it adds 2048 "filler docs". This just adds up to a ton of indexing across all the test methods. Use 2048 for Nightly, and instead a smaller number (4) for local builds. It saves almost a minute of cpu time in tests. Before: ``` The slowest suites (exceeding 1s) during this run: 59.44s TestSimpleExplanationsWithFillerDocs (:lucene:core) ``` After: (no longer on the list < 9s for sure) ``` The slowest suites (exceeding 1s) during this run: 14.98s TestSimpleTextDocValuesFormat (:lucene:codecs) 14.06s TestLucene90DocValuesFormat (:lucene:core) 13.96s TestLucene90DocValuesFormatMergeInstance (:lucene:core) 13.64s TestFSTPostingsFormat (:lucene:codecs) 11.41s TestPerFieldDocValuesFormat (:lucene:core) 11.29s TestSimpleTextTermVectorsFormat (:lucene:codecs) 11.10s TestAssertingDocValuesFormat (:lucene:test-framework) 10.62s TestDirectPostingsFormat (:lucene:codecs) 9.43s TestIndexSorting (:lucene:core) 9.28s TestLatLonPointQueries (:lucene:core) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety
dsmiley commented on a change in pull request #412: URL: https://github.com/apache/lucene/pull/412#discussion_r762497281 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java ## @@ -113,118 +112,239 @@ protected static final LabelledCharArrayMatcher[] ZERO_LEN_AUTOMATA_ARRAY = new LabelledCharArrayMatcher[0]; - protected final IndexSearcher searcher; // if null, can only use highlightWithoutSearcher + protected final IndexSearcher searcher; protected final Analyzer indexAnalyzer; - private boolean defaultHandleMtq = true; // e.g. wildcards + private final int maxLength; - private boolean defaultHighlightPhrasesStrictly = true; // AKA "accuracy" or "query debugging" + private final Supplier defaultBreakIterator; - // For analysis, prefer MemoryIndexOffsetStrategy - private boolean defaultPassageRelevancyOverSpeed = true; + private final Predicate defaultFieldMatcher; - private int maxLength = DEFAULT_MAX_LENGTH; + private final PassageScorer defaultScorer; - // BreakIterator is stateful so we use a Supplier factory method - private Supplier defaultBreakIterator = - () -> BreakIterator.getSentenceInstance(Locale.ROOT); + private final PassageFormatter defaultFormatter; - private Predicate defaultFieldMatcher; + private final int defaultMaxNoHighlightPassages; - private PassageScorer defaultScorer = new PassageScorer(); + // lazy initialized with double-check locking; protected so subclass can init + protected volatile FieldInfos fieldInfos; - private PassageFormatter defaultFormatter = new DefaultPassageFormatter(); + private final int cacheFieldValCharsThreshold; - private int defaultMaxNoHighlightPassages = -1; + private final Set flags; - // lazy initialized with double-check locking; protected so subclass can init - protected volatile FieldInfos fieldInfos; + /** Builder for UnifiedHighlighter. */ + public static class Builder { +/** If null, can only use highlightWithoutSearcher. */ +private IndexSearcher searcher; - private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD; +private Analyzer indexAnalyzer; +private boolean handleMultiTermQuery = true; +private boolean highlightPhrasesStrictly = true; +private boolean passageRelevancyOverSpeed = true; +private boolean weightMatches = true; +private int maxLength = DEFAULT_MAX_LENGTH; - /** Extracts matching terms after rewriting against an empty index */ - protected static Set extractTerms(Query query) throws IOException { -Set queryTerms = new HashSet<>(); - EMPTY_INDEXSEARCHER.rewrite(query).visit(QueryVisitor.termCollector(queryTerms)); -return queryTerms; - } +/** BreakIterator is stateful so we use a Supplier factory method. */ +private Supplier breakIterator = +() -> BreakIterator.getSentenceInstance(Locale.ROOT); - /** - * Constructs the highlighter with the given index searcher and analyzer. - * - * @param indexSearcher Usually required, unless {@link #highlightWithoutSearcher(String, Query, - * String, int)} is used, in which case this needs to be null. - * @param indexAnalyzer Required, even if in some circumstances it isn't used. - */ - public UnifiedHighlighter(IndexSearcher indexSearcher, Analyzer indexAnalyzer) { -this.searcher = indexSearcher; // TODO: make non nullable -this.indexAnalyzer = -Objects.requireNonNull( -indexAnalyzer, -"indexAnalyzer is required" + " (even if in some circumstances it isn't used)"); - } +private Predicate fieldMatcher; +private PassageScorer scorer = new PassageScorer(); +private PassageFormatter formatter = new DefaultPassageFormatter(); +private int maxNoHighlightPassages = -1; +private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD; +private Set flags; - public void setHandleMultiTermQuery(boolean handleMtq) { -this.defaultHandleMtq = handleMtq; - } +/** + * Usually required, unless {@link #highlightWithoutSearcher(String, Query, String, int)} is + * used, in which case this needs to be null. + */ +public Builder withSearcher(IndexSearcher value) { Review comment: Yeah; that's exactly what I meant. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] apanimesh061 commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety
apanimesh061 commented on a change in pull request #412: URL: https://github.com/apache/lucene/pull/412#discussion_r762514373 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java ## @@ -113,118 +112,239 @@ protected static final LabelledCharArrayMatcher[] ZERO_LEN_AUTOMATA_ARRAY = new LabelledCharArrayMatcher[0]; - protected final IndexSearcher searcher; // if null, can only use highlightWithoutSearcher + protected final IndexSearcher searcher; protected final Analyzer indexAnalyzer; - private boolean defaultHandleMtq = true; // e.g. wildcards + private final int maxLength; - private boolean defaultHighlightPhrasesStrictly = true; // AKA "accuracy" or "query debugging" + private final Supplier defaultBreakIterator; - // For analysis, prefer MemoryIndexOffsetStrategy - private boolean defaultPassageRelevancyOverSpeed = true; + private final Predicate defaultFieldMatcher; - private int maxLength = DEFAULT_MAX_LENGTH; + private final PassageScorer defaultScorer; - // BreakIterator is stateful so we use a Supplier factory method - private Supplier defaultBreakIterator = - () -> BreakIterator.getSentenceInstance(Locale.ROOT); + private final PassageFormatter defaultFormatter; - private Predicate defaultFieldMatcher; + private final int defaultMaxNoHighlightPassages; - private PassageScorer defaultScorer = new PassageScorer(); + // lazy initialized with double-check locking; protected so subclass can init + protected volatile FieldInfos fieldInfos; - private PassageFormatter defaultFormatter = new DefaultPassageFormatter(); + private final int cacheFieldValCharsThreshold; - private int defaultMaxNoHighlightPassages = -1; + private final Set flags; - // lazy initialized with double-check locking; protected so subclass can init - protected volatile FieldInfos fieldInfos; + /** Builder for UnifiedHighlighter. */ + public static class Builder { +/** If null, can only use highlightWithoutSearcher. */ +private IndexSearcher searcher; - private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD; +private Analyzer indexAnalyzer; +private boolean handleMultiTermQuery = true; +private boolean highlightPhrasesStrictly = true; +private boolean passageRelevancyOverSpeed = true; +private boolean weightMatches = true; +private int maxLength = DEFAULT_MAX_LENGTH; - /** Extracts matching terms after rewriting against an empty index */ - protected static Set extractTerms(Query query) throws IOException { -Set queryTerms = new HashSet<>(); - EMPTY_INDEXSEARCHER.rewrite(query).visit(QueryVisitor.termCollector(queryTerms)); -return queryTerms; - } +/** BreakIterator is stateful so we use a Supplier factory method. */ +private Supplier breakIterator = +() -> BreakIterator.getSentenceInstance(Locale.ROOT); - /** - * Constructs the highlighter with the given index searcher and analyzer. - * - * @param indexSearcher Usually required, unless {@link #highlightWithoutSearcher(String, Query, - * String, int)} is used, in which case this needs to be null. - * @param indexAnalyzer Required, even if in some circumstances it isn't used. - */ - public UnifiedHighlighter(IndexSearcher indexSearcher, Analyzer indexAnalyzer) { -this.searcher = indexSearcher; // TODO: make non nullable -this.indexAnalyzer = -Objects.requireNonNull( -indexAnalyzer, -"indexAnalyzer is required" + " (even if in some circumstances it isn't used)"); - } +private Predicate fieldMatcher; +private PassageScorer scorer = new PassageScorer(); +private PassageFormatter formatter = new DefaultPassageFormatter(); +private int maxNoHighlightPassages = -1; +private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD; +private Set flags; - public void setHandleMultiTermQuery(boolean handleMtq) { -this.defaultHandleMtq = handleMtq; - } +/** + * Usually required, unless {@link #highlightWithoutSearcher(String, Query, String, int)} is + * used, in which case this needs to be null. + */ +public Builder withSearcher(IndexSearcher value) { + this.searcher = value; + return self(); +} - public void setHighlightPhrasesStrictly(boolean highlightPhrasesStrictly) { -this.defaultHighlightPhrasesStrictly = highlightPhrasesStrictly; - } +/** + * This method sets the analyzer for the UH object. Required, even if in some circumstances it + * isn' used. The null check is performed in the constructor. + */ +public Builder withIndexAnalyzer(Analyzer value) { + this.indexAnalyzer = value; + return self(); +} - public void setMaxLength(int maxLength) { -if (maxLength < 0 || maxLength == Integer.MAX_VALUE) { - // two reasons: no overflow problems in BreakIterator.preceding(offset+1), -
[GitHub] [lucene] apanimesh061 commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety
apanimesh061 commented on a change in pull request #412: URL: https://github.com/apache/lucene/pull/412#discussion_r762514390 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java ## @@ -823,24 +943,13 @@ public void visitLeaf(Query query) { return filteredTerms.toArray(new BytesRef[filteredTerms.size()]); } - /** Customize the highlighting flags to use by field. */ + /** + * Customize the highlighting flags to use by field. Here the user can either specify the set of + * {@link HighlightFlag}s to be applied or use the boolean flags to populate final list of {@link Review comment: Will rectify the comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets
[ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453526#comment-17453526 ] Haoyu Zhai commented on LUCENE-10229: - Seems for {{containedBy}} this inconsistency is introduced [here|https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/ConjunctionMatchesIterator.java#L60,L75], perhaps we could further subclass the {{ConjunctionMatchesIterator}} to a {{FilterMatchesIterator}} to let the offset methods return only offset of "source"? > Match offsets should be consistent for fields with positions and fields with > offsets > > > Key: LUCENE-10229 > URL: https://issues.apache.org/jira/browse/LUCENE-10229 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > > This is a follow-up of LUCENE-10223 in which it was discovered that fields > with > offsets don't highlight some more complex interval queries properly. Alan > says: > {quote} > It's because it returns the position of the inner match, but the offsets of > the outer. And so if you're re-analyzing and retrieving offsets by looking > at the positions, you get the 'right' thing. It's not obvious to me what the > correct response is here, but thinking about it the current behaviour is kind > of the worst of both worlds, and perhaps we should change it so that you get > offsets of the inner match as standard, and then the outer match is returned > as part of the sub matches. > {quote} > Intervals are nicely separated into "basic intervals" and "filters" which > restrict some other source of intervals, here is the original documentation: > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50 > My experience from an extended period of using interval queries in a frontend > where they're highlighted is that filters are restrictions that should not be > highlighted - it's the source intervals that people care about. Filters are > what you remove or where you give proper context to source intervals. > The test code contributed in LUCENE-10223 contains numerous query-highlight > examples (on fields with positions) where this intuition is demonstrated on > all kinds of interval functions: > https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542 > This issue is about making the internals work consistently for fields with > positions and fields with offsets. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] apanimesh061 commented on pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety
apanimesh061 commented on pull request #412: URL: https://github.com/apache/lucene/pull/412#issuecomment-986174116 @dsmiley Updated the PR as per the comments and added the new builder functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets
[ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453538#comment-17453538 ] Dawid Weiss commented on LUCENE-10229: -- Thanks for looking at this, [~zhai7631]! I honestly don't know what the problem is - didn't have time to look at it as I was absorbed by other issues. If you'd like to try to file a patch, please go ahead. [~romseygeek] is the expert on this code and I'm sure he'd guide us both here. I am just convinced about what the outcome should be - consistent with what the positions currently return. It is really more logical this way. > Match offsets should be consistent for fields with positions and fields with > offsets > > > Key: LUCENE-10229 > URL: https://issues.apache.org/jira/browse/LUCENE-10229 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > > This is a follow-up of LUCENE-10223 in which it was discovered that fields > with > offsets don't highlight some more complex interval queries properly. Alan > says: > {quote} > It's because it returns the position of the inner match, but the offsets of > the outer. And so if you're re-analyzing and retrieving offsets by looking > at the positions, you get the 'right' thing. It's not obvious to me what the > correct response is here, but thinking about it the current behaviour is kind > of the worst of both worlds, and perhaps we should change it so that you get > offsets of the inner match as standard, and then the outer match is returned > as part of the sub matches. > {quote} > Intervals are nicely separated into "basic intervals" and "filters" which > restrict some other source of intervals, here is the original documentation: > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50 > My experience from an extended period of using interval queries in a frontend > where they're highlighted is that filters are restrictions that should not be > highlighted - it's the source intervals that people care about. Filters are > what you remove or where you give proper context to source intervals. > The test code contributed in LUCENE-10223 contains numerous query-highlight > examples (on fields with positions) where this intuition is demonstrated on > all kinds of interval functions: > https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542 > This issue is about making the internals work consistently for fields with > positions and fields with offsets. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org