[jira] [Created] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8

2021-12-04 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-10284:


 Summary: Upgrade morfologik-stemming to 2.1.8
 Key: LUCENE-10284
 URL: https://issues.apache.org/jira/browse/LUCENE-10284
 Project: Lucene - Core
  Issue Type: Task
Reporter: Dawid Weiss






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8

2021-12-04 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10284:
-
Fix Version/s: 9.1

> Upgrade morfologik-stemming to 2.1.8
> 
>
> Key: LUCENE-10284
> URL: https://issues.apache.org/jira/browse/LUCENE-10284
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Priority: Trivial
> Fix For: 9.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8

2021-12-04 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss reassigned LUCENE-10284:


Assignee: Dawid Weiss

> Upgrade morfologik-stemming to 2.1.8
> 
>
> Key: LUCENE-10284
> URL: https://issues.apache.org/jira/browse/LUCENE-10284
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #514: LUCENE-10284: Upgrade morfologik-stemming to 2.1.8

2021-12-04 Thread GitBox


dweiss merged pull request #514:
URL: https://github.com/apache/lucene/pull/514


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8

2021-12-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453322#comment-17453322
 ] 

ASF subversion and git services commented on LUCENE-10284:
--

Commit d2b7e7a4410e1b4a4e46818e38c89717963b5087 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d2b7e7a ]

LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514)



> Upgrade morfologik-stemming to 2.1.8
> 
>
> Key: LUCENE-10284
> URL: https://issues.apache.org/jira/browse/LUCENE-10284
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8

2021-12-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453323#comment-17453323
 ] 

ASF subversion and git services commented on LUCENE-10284:
--

Commit d2563e6f1fd019ecdcbfea45220a168b8a644242 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d2563e6 ]

LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514)



> Upgrade morfologik-stemming to 2.1.8
> 
>
> Key: LUCENE-10284
> URL: https://issues.apache.org/jira/browse/LUCENE-10284
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10284) Upgrade morfologik-stemming to 2.1.8

2021-12-04 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10284.
--
Resolution: Fixed

> Upgrade morfologik-stemming to 2.1.8
> 
>
> Key: LUCENE-10284
> URL: https://issues.apache.org/jira/browse/LUCENE-10284
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request #515: simplify jflex grammars by using difference rather than negation

2021-12-04 Thread GitBox


rmuir opened a new pull request #515:
URL: https://github.com/apache/lucene/pull/515


   This change uses a new jflex feature 
(https://github.com/jflex-de/jflex/pull/654) to simplify emoji processing in 
the grammar. We can do a set difference rather than workaround it with 
complement + demorgan stuff.
   
   It is cosmetic: doesn't change the resulting tokenizers (see diff), but 
makes the emoji parts easier to read.
   
   bonus: major speed up to regenerating that huge UAX29UrlEmail DFA.
   
   Before:
   ```
   > Task :lucene:analysis:common:generateUAX29URLEmailTokenizer
   Aggregate task times (possibly running in parallel!):
918.87 sec.  generateUAX29URLEmailTokenizerInternal
   ```
   
   After:
   ```
   > Task :lucene:analysis:common:generateUAX29URLEmailTokenizer
   Aggregate task times (possibly running in parallel!):
285.26 sec.  generateUAX29URLEmailTokenizerInternal
   ```
   
   This was suggested by jflex developers to help with the 
very-slow-regeneration on https://github.com/jflex-de/jflex/issues/715 . It 
doesn't solve all of our problems there, but it makes things a lot less painful 
:)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #515: simplify jflex grammars by using difference rather than negation

2021-12-04 Thread GitBox


rmuir commented on pull request #515:
URL: https://github.com/apache/lucene/pull/515#issuecomment-986055369


   I'll take care of the precommit. There's some build wierdness in the 
regeneration where we pull the TLDs and make the new included-TLD.jflex *after* 
we recompile the grammar. Maybe dependencies are backwards. Anyay, I didn't 
really want to suck in today's new TLDs and trigger any changes to the parsers 
anyway, as I wanted to show this change "makes no difference".
   
   Oh well, ill just force-regenerate everything (you really need to do it 
twice right now to really pull in the new TLDs) and push again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #515: simplify jflex grammars by using difference rather than negation

2021-12-04 Thread GitBox


rmuir commented on pull request #515:
URL: https://github.com/apache/lucene/pull/515#issuecomment-986055851


   and don't worry about the gradle build, I think part of the issue is that 
this TLDs file annoyingly changed *while I was iterating and working on this*:
   ```
   # Version 2021120400, Last Updated Sat Dec  4 07:07:01 2021 UTC
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10285) gradle regenerate TLDs file / tokenizer dependency is backwards/wrong

2021-12-04 Thread Robert Muir (Jira)
Robert Muir created LUCENE-10285:


 Summary: gradle regenerate TLDs file / tokenizer dependency is 
backwards/wrong
 Key: LUCENE-10285
 URL: https://issues.apache.org/jira/browse/LUCENE-10285
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


To reproduce: {{./gradlew regenerate --rerun-tasks}}

You'll see this behavior:
{noformat}
> Task :lucene:analysis:common:generateUAX29URLEmailTokenizerInternal
Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and 
requires 12g of memory!).
Recompiling JFlex: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex

...

> Task :lucene:analysis:common:generateTldsInternal
Execution optimizations have been disabled for task 
':lucene:analysis:common:generateTldsInternal' to ensure correctness due to the 
following reasons:
  - Gradle detected a problem with the following location: 
'/home/rmuir/workspace/lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex'.
 Reason: Task ':lucene:analysis:common:generateUAX29URLEmailTokenizerInternal' 
uses this output of task ':lucene:analysis:common:generateTldsInternal' without 
declaring an explicit or implicit dependency. This can lead to incorrect 
results being produced, depending on what order the tasks are executed. Please 
refer to 
https://docs.gradle.org/7.2/userguide/validation_problems.html#implicit_dependency
 for more details about this problem.
Found 1489 TLDs in IANA TLD Database at 
https://data.iana.org/TLD/tlds-alpha-by-domain.txt
  ASCIITLD: 1370 TLDs
ASCIITLDprefix_1CharSuffix:  108 TLDs
ASCIITLDprefix_2CharSuffix:   11 TLDs
 Total: 1489 TLDs
You've regenerated the TLD include file, remember to regenerate 
UAX29URLEmailTokenizerImpl too.
{noformat}

So it regenerates the TLD include file after the UAX29URLEmailTokenizerImpl, 
which means now you gotta run "gradlew regenerate" again to really pick up the 
changes.

cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10285) gradle regenerate TLDs file / tokenizer dependency is backwards/wrong

2021-12-04 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss reassigned LUCENE-10285:


Assignee: Dawid Weiss

> gradle regenerate TLDs file / tokenizer dependency is backwards/wrong
> -
>
> Key: LUCENE-10285
> URL: https://issues.apache.org/jira/browse/LUCENE-10285
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Assignee: Dawid Weiss
>Priority: Major
>
> To reproduce: {{./gradlew regenerate --rerun-tasks}}
> You'll see this behavior:
> {noformat}
> > Task :lucene:analysis:common:generateUAX29URLEmailTokenizerInternal
> Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and 
> requires 12g of memory!).
> Recompiling JFlex: 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex
> ...
> > Task :lucene:analysis:common:generateTldsInternal
> Execution optimizations have been disabled for task 
> ':lucene:analysis:common:generateTldsInternal' to ensure correctness due to 
> the following reasons:
>   - Gradle detected a problem with the following location: 
> '/home/rmuir/workspace/lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex'.
>  Reason: Task 
> ':lucene:analysis:common:generateUAX29URLEmailTokenizerInternal' uses this 
> output of task ':lucene:analysis:common:generateTldsInternal' without 
> declaring an explicit or implicit dependency. This can lead to incorrect 
> results being produced, depending on what order the tasks are executed. 
> Please refer to 
> https://docs.gradle.org/7.2/userguide/validation_problems.html#implicit_dependency
>  for more details about this problem.
> Found 1489 TLDs in IANA TLD Database at 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt
>   ASCIITLD: 1370 TLDs
> ASCIITLDprefix_1CharSuffix:  108 TLDs
> ASCIITLDprefix_2CharSuffix:   11 TLDs
>  Total: 1489 TLDs
> You've regenerated the TLD include file, remember to regenerate 
> UAX29URLEmailTokenizerImpl too.
> {noformat}
> So it regenerates the TLD include file after the UAX29URLEmailTokenizerImpl, 
> which means now you gotta run "gradlew regenerate" again to really pick up 
> the changes.
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10285) gradle regenerate TLDs file / tokenizer dependency is backwards/wrong

2021-12-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453453#comment-17453453
 ] 

Dawid Weiss commented on LUCENE-10285:
--

Eh. That crap is so complicated already that it makes me want to run away when 
I see an issue touching it... :) I'll take a look later - can you run them/ 
order them manually now? Sorry about it.

> gradle regenerate TLDs file / tokenizer dependency is backwards/wrong
> -
>
> Key: LUCENE-10285
> URL: https://issues.apache.org/jira/browse/LUCENE-10285
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Assignee: Dawid Weiss
>Priority: Major
>
> To reproduce: {{./gradlew regenerate --rerun-tasks}}
> You'll see this behavior:
> {noformat}
> > Task :lucene:analysis:common:generateUAX29URLEmailTokenizerInternal
> Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and 
> requires 12g of memory!).
> Recompiling JFlex: 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex
> ...
> > Task :lucene:analysis:common:generateTldsInternal
> Execution optimizations have been disabled for task 
> ':lucene:analysis:common:generateTldsInternal' to ensure correctness due to 
> the following reasons:
>   - Gradle detected a problem with the following location: 
> '/home/rmuir/workspace/lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex'.
>  Reason: Task 
> ':lucene:analysis:common:generateUAX29URLEmailTokenizerInternal' uses this 
> output of task ':lucene:analysis:common:generateTldsInternal' without 
> declaring an explicit or implicit dependency. This can lead to incorrect 
> results being produced, depending on what order the tasks are executed. 
> Please refer to 
> https://docs.gradle.org/7.2/userguide/validation_problems.html#implicit_dependency
>  for more details about this problem.
> Found 1489 TLDs in IANA TLD Database at 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt
>   ASCIITLD: 1370 TLDs
> ASCIITLDprefix_1CharSuffix:  108 TLDs
> ASCIITLDprefix_2CharSuffix:   11 TLDs
>  Total: 1489 TLDs
> You've regenerated the TLD include file, remember to regenerate 
> UAX29URLEmailTokenizerImpl too.
> {noformat}
> So it regenerates the TLD include file after the UAX29URLEmailTokenizerImpl, 
> which means now you gotta run "gradlew regenerate" again to really pick up 
> the changes.
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #515: simplify jflex grammars by using difference rather than negation

2021-12-04 Thread GitBox


rmuir commented on pull request #515:
URL: https://github.com/apache/lucene/pull/515#issuecomment-986060813


   I opened https://issues.apache.org/jira/browse/LUCENE-10285 about the TLD 
task dependency. For now, I just ran `regenerate` again and it brought in the 
TLD changes. It is a lot less painful at least now that it takes 1/3 of the 
time!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10285) gradle regenerate TLDs file / tokenizer dependency is backwards/wrong

2021-12-04 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453456#comment-17453456
 ] 

Robert Muir commented on LUCENE-10285:
--

Yeah, I just ran {{regenerate}} again to work around it. Mainly just wanted to 
open the issue for tracking, so we don't forget about it.

I'd recommend we merge the speed-up PR 
([https://github.com/apache/lucene/pull/515)] anyway before trying to debug 
this issue. 5 minutes vs 15 minutes is a lot easier on the liver :)

> gradle regenerate TLDs file / tokenizer dependency is backwards/wrong
> -
>
> Key: LUCENE-10285
> URL: https://issues.apache.org/jira/browse/LUCENE-10285
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Assignee: Dawid Weiss
>Priority: Major
>
> To reproduce: {{./gradlew regenerate --rerun-tasks}}
> You'll see this behavior:
> {noformat}
> > Task :lucene:analysis:common:generateUAX29URLEmailTokenizerInternal
> Regenerating UAX29URLEmailTokenizerImpl. This may take a long time (and 
> requires 12g of memory!).
> Recompiling JFlex: 
> lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex
> ...
> > Task :lucene:analysis:common:generateTldsInternal
> Execution optimizations have been disabled for task 
> ':lucene:analysis:common:generateTldsInternal' to ensure correctness due to 
> the following reasons:
>   - Gradle detected a problem with the following location: 
> '/home/rmuir/workspace/lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex'.
>  Reason: Task 
> ':lucene:analysis:common:generateUAX29URLEmailTokenizerInternal' uses this 
> output of task ':lucene:analysis:common:generateTldsInternal' without 
> declaring an explicit or implicit dependency. This can lead to incorrect 
> results being produced, depending on what order the tasks are executed. 
> Please refer to 
> https://docs.gradle.org/7.2/userguide/validation_problems.html#implicit_dependency
>  for more details about this problem.
> Found 1489 TLDs in IANA TLD Database at 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt
>   ASCIITLD: 1370 TLDs
> ASCIITLDprefix_1CharSuffix:  108 TLDs
> ASCIITLDprefix_2CharSuffix:   11 TLDs
>  Total: 1489 TLDs
> You've regenerated the TLD include file, remember to regenerate 
> UAX29URLEmailTokenizerImpl too.
> {noformat}
> So it regenerates the TLD include file after the UAX29URLEmailTokenizerImpl, 
> which means now you gotta run "gradlew regenerate" again to really pick up 
> the changes.
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #515: simplify jflex grammars by using difference rather than negation

2021-12-04 Thread GitBox


dweiss commented on pull request #515:
URL: https://github.com/apache/lucene/pull/515#issuecomment-986062093


   That's a nice improvement!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10282) morfologik-stemming is not an automatic module

2021-12-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453460#comment-17453460
 ] 

Dawid Weiss commented on LUCENE-10282:
--

This is partially done. The weird part is the Ukrainian dictionary - it's not 
declared in the module requirements (and it should be), yet it doesn't fail 
because that jar doesn't have any classes (just the resource)?

> morfologik-stemming is not an automatic module
> --
>
> Key: LUCENE-10282
> URL: https://issues.apache.org/jira/browse/LUCENE-10282
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10282) morfologik-stemming is not an automatic module

2021-12-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453461#comment-17453461
 ] 

Dawid Weiss commented on LUCENE-10282:
--

And I think this shows how even a compiling module can be trappy - I don't 
think the Ukrainian dictionary will work in module mode, even when the 
corresponding jar is part of the classpath.

> morfologik-stemming is not an automatic module
> --
>
> Key: LUCENE-10282
> URL: https://issues.apache.org/jira/browse/LUCENE-10282
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10286) Module path to dependencies that are not automatic modules (according to gradle)

2021-12-04 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-10286:


 Summary: Module path to dependencies that are not automatic 
modules (according to gradle)
 Key: LUCENE-10286
 URL: https://issues.apache.org/jira/browse/LUCENE-10286
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Dawid Weiss


So... the workaround [~tomoko] came up with here: 
https://github.com/dweiss/lucene/pull/8/files will not work. Basically, the 
workaround is to force gradle's compile java task to use module path with the 
classpath entries:
{code}
plugins.withType(JavaPlugin) {
tasks.withType(JavaCompile) {
  doFirst {
options.compilerArgs += [
  "--module-path", classpath.asPath
]
{code}

this is indeed quoted as a solution all over the place but it predates what 
Gradle currently does (module path inference).

There are multiple problems with the above:

1) java compilation task does not "understand" that we pass a module path 
argument and happily issues its own version from the inferred path - I 
confirmed this by looking at the logs. The second option pretty much takes 
precedence (javac doesn't complain, it just takes the value of the last option) 
and things may break in weird ways there.

2) the solution is also flowed because classpath can contain directories that 
are not modules... think cross-project dependencies (to other projects that are 
not modules). These directories would not be converted to automatic modules and 
would in fact fail the compilation.

I looked at gradle source code and the "module inference" is pretty much 
non-configurable - it's what the docs say (automatic module name or proper 
module descriptor). Anything else is treated as a classpath entry, with no way 
of moving it to module path. There is an open issue that touches on this 
subject here:

https://github.com/gradle/gradle/issues/12630

and Jendrik Johannes of Gradle published a plugin leveraging Gradle's "artifact 
transforms" that allow you to create full module info for jars that are missing 
it:

https://github.com/jjohannes/extra-java-module-info#how-to-use-this-plugin

As fancy as this solution is... I don't like it that much. It's what Java 
already does for you (conversion of jars to automatic modules) - I'd rather 
rely on built-in mechanisms than gradle magic.

The only way out of this that I currently see is to turn off automatic module 
inference for these projects that contain non-modular dependencies and manually 
modify the classpath + module path for tasks that may be using them (primarily 
javac). I'm pretty sure it can be done with relative ease but I don't have any 
more time today to provide a proof of concept that would do it for one of the 
subprojects. Will do it tomorrow, hopefully.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10255) Fully embrace the java module system

2021-12-04 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10255:
-
Description: 
I've experimented a bit trying to move the code to the JMS. It is _surprisingly 
difficult_... A PoC that almost passes all checks is here:
https://github.com/dweiss/lucene/tree/jms

Here are my conclusions so far:
* The JMS and gradle add a lot of complexity (this applies to any higher-level 
tooling, including IDEs, I think). For starters, modules have to be JARs. The 
effect of this is that what was previously a set of directories from 
dependencies now has to be a JAR. What was previously an incremental update of 
a single .class file now ripples throughout the build recreating module JARs 
(ZIPs!)... I didn't realize it at first, but it's a costly thing to do. I'm not 
even sure how IDEs handle this issue.
* A Java module contains metadata (such as the module version or main class) 
that is completely detached from any source file. These things live in a class 
bytecode of the compiled module-info; interestingly, there is no source-level 
way to specify it - these class attributes are injected by the 'jar' tool. 
Gradle has some fancy on-the-fly asm conversion filter that injects it.
* Dependencies between modules will effectively live in two places: in gradle 
build files and in module-info files. And they can go out of sync, although 
it's probably easy to catch (since javac would complain about missing classes 
during compilation, even if they're in module path).
* Probably the biggest challenge (not covered in the PoC) are with our custom 
javadoc and ecj linter tasks - they see the module-info.java and can't cope 
with it. At the same time, there is no easy way to exclude that one particular 
file: ecj would have to accept a full set of sources (command argument limit 
will be a problem), javac can accept a full set of java sources (external file) 
but then it doesn't copy doc-files properly anymore (this is probably easier to 
fix). 
* There are differences at runtime that are hard to anticipate - for example 
resource lookups via class loader no longer work (I fixed this in Luke).
* We will have to rethink the long-term strategy of how white-box tests work. 
There are some guidelines here but all of them have some cons (IDEs being 
confused). 
https://docs.gradle.org/current/userguide/java_testing.html#sec:java_testing_modular
* it's pretty much impossible to exclude transitive dependencies from modules 
we depend on - if they're not compile-time only (static) requirements, they 
will have to be present on module path.

After poking a bit and trying it out I have to say I have mixed feelings about 
moving to the JMS. On the one hand, many things are great - the module path, 
module descriptors and access modes. On the other hand, the tooling tricks 
required to make it all work make you shiver.

If anybody wants to play/ improve things on that experimental branch (I 
converted Luke to a full module - it works), please be my guest. I have to sit 
on this and think whether it's something I really like or not.

  was:
I've experimented a bit trying to move the code to the JMS. It is _surprisingly 
difficult_... A PoC that almost passes all checks is here:
https://github.com/dweiss/lucene/tree/jms

Here are my conclusions so far:
* The JMS and gradle add a lot of complexity (this applies to any higher-level 
tooling, including IDEs, I think). For starters, modules have to be JARs. The 
effect of this is that what was previously a set of directories from 
dependencies now has to be a JAR. What was previously an incremental update of 
a single .class file now ripples throughout the build recreating module JARs 
(ZIPs!)... I didn't realize it at first, but it's a costly thing to do. I'm not 
even sure how IDEs handle this issue.
* A Java module contains metadata (such as the module version or main class) 
that is completely detached from any source file. These things live in a class 
bytecode of the compiled module-info; interestingly, there is no source-level 
way to specify it - these class attributes are injected by the 'jar' tool. 
Gradle has some fancy on-the-fly asm conversion filter that injects it.
* Dependencies between modules will effectively live in two places: in gradle 
build files and in module-info files. And they can go out of sync, although 
it's probably easy to catch (since javac would complain about missing classes 
during compilation, even if they're in module path).
* Probably the biggest challenge (not covered in the PoC) are with our custom 
javadoc and ecj linter tasks - they see the module-info.java and can't cope 
with it. At the same time, there is no easy way to exclude that one particular 
file: ecj would have to accept a full set of sources (command argument limit 
will be a problem), javac can accept a full set of java sources (external file) 
but then it 

[GitHub] [lucene] rmuir commented on pull request #515: simplify jflex grammars by using difference rather than negation

2021-12-04 Thread GitBox


rmuir commented on pull request #515:
URL: https://github.com/apache/lucene/pull/515#issuecomment-986075965


   @dweiss i think it's good win just for getting a simpler grammar. 
   
   There is probably evil stuff we could do to speed up the monster. Yes, I am 
slightly tempted to import `org.apache.lucene.util.automaton` into 
`GenerateJFlexTLDMacros.java`... but I agree with your thoughts on the JFlex 
issue, it is better to just generate the simple "transparent" grammar of TLDs, 
despite how slow it makes things.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #504: Make TestNRTReplication.testCrashReplica nightly

2021-12-04 Thread GitBox


rmuir merged pull request #504:
URL: https://github.com/apache/lucene/pull/504


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #505: tone down BaseTermVectorsFormatTestCase.testLotsOfFields in non-nightly

2021-12-04 Thread GitBox


rmuir merged pull request #505:
URL: https://github.com/apache/lucene/pull/505


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #506: tone down TestIndexWriter.testMaxCompletedSequenceNumber in non-nightly

2021-12-04 Thread GitBox


rmuir merged pull request #506:
URL: https://github.com/apache/lucene/pull/506


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request #516: speed up TestSimpleExplanationsWithFillerDocs

2021-12-04 Thread GitBox


rmuir opened a new pull request #516:
URL: https://github.com/apache/lucene/pull/516


   This is the slowest test suite, runs for ~ 60s, because between every
   document it adds 2048 "filler docs". This just adds up to a ton of
   indexing across all the test methods.
   
   Use 2048 for Nightly, and instead a smaller number (4) for local builds. It 
saves almost a minute of cpu time in tests.
   
   Before:
   ```
   The slowest suites (exceeding 1s) during this run:
 59.44s TestSimpleExplanationsWithFillerDocs (:lucene:core)
   ```
   
   After: (no longer on the list < 9s for sure)
   ```
   The slowest suites (exceeding 1s) during this run:
 14.98s TestSimpleTextDocValuesFormat (:lucene:codecs)
 14.06s TestLucene90DocValuesFormat (:lucene:core)
 13.96s TestLucene90DocValuesFormatMergeInstance (:lucene:core)
 13.64s TestFSTPostingsFormat (:lucene:codecs)
 11.41s TestPerFieldDocValuesFormat (:lucene:core)
 11.29s TestSimpleTextTermVectorsFormat (:lucene:codecs)
 11.10s TestAssertingDocValuesFormat (:lucene:test-framework)
 10.62s TestDirectPostingsFormat (:lucene:codecs)
  9.43s TestIndexSorting (:lucene:core)
  9.28s TestLatLonPointQueries (:lucene:core)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-12-04 Thread GitBox


dsmiley commented on a change in pull request #412:
URL: https://github.com/apache/lucene/pull/412#discussion_r762497281



##
File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java
##
@@ -113,118 +112,239 @@
   protected static final LabelledCharArrayMatcher[] ZERO_LEN_AUTOMATA_ARRAY =
   new LabelledCharArrayMatcher[0];
 
-  protected final IndexSearcher searcher; // if null, can only use 
highlightWithoutSearcher
+  protected final IndexSearcher searcher;
 
   protected final Analyzer indexAnalyzer;
 
-  private boolean defaultHandleMtq = true; // e.g. wildcards
+  private final int maxLength;
 
-  private boolean defaultHighlightPhrasesStrictly = true; // AKA "accuracy" or 
"query debugging"
+  private final Supplier defaultBreakIterator;
 
-  // For analysis, prefer MemoryIndexOffsetStrategy
-  private boolean defaultPassageRelevancyOverSpeed = true;
+  private final Predicate defaultFieldMatcher;
 
-  private int maxLength = DEFAULT_MAX_LENGTH;
+  private final PassageScorer defaultScorer;
 
-  // BreakIterator is stateful so we use a Supplier factory method
-  private Supplier defaultBreakIterator =
-  () -> BreakIterator.getSentenceInstance(Locale.ROOT);
+  private final PassageFormatter defaultFormatter;
 
-  private Predicate defaultFieldMatcher;
+  private final int defaultMaxNoHighlightPassages;
 
-  private PassageScorer defaultScorer = new PassageScorer();
+  // lazy initialized with double-check locking; protected so subclass can init
+  protected volatile FieldInfos fieldInfos;
 
-  private PassageFormatter defaultFormatter = new DefaultPassageFormatter();
+  private final int cacheFieldValCharsThreshold;
 
-  private int defaultMaxNoHighlightPassages = -1;
+  private final Set flags;
 
-  // lazy initialized with double-check locking; protected so subclass can init
-  protected volatile FieldInfos fieldInfos;
+  /** Builder for UnifiedHighlighter. */
+  public static class Builder {
+/** If null, can only use highlightWithoutSearcher. */
+private IndexSearcher searcher;
 
-  private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
+private Analyzer indexAnalyzer;
+private boolean handleMultiTermQuery = true;
+private boolean highlightPhrasesStrictly = true;
+private boolean passageRelevancyOverSpeed = true;
+private boolean weightMatches = true;
+private int maxLength = DEFAULT_MAX_LENGTH;
 
-  /** Extracts matching terms after rewriting against an empty index */
-  protected static Set extractTerms(Query query) throws IOException {
-Set queryTerms = new HashSet<>();
-
EMPTY_INDEXSEARCHER.rewrite(query).visit(QueryVisitor.termCollector(queryTerms));
-return queryTerms;
-  }
+/** BreakIterator is stateful so we use a Supplier factory method. */
+private Supplier breakIterator =
+() -> BreakIterator.getSentenceInstance(Locale.ROOT);
 
-  /**
-   * Constructs the highlighter with the given index searcher and analyzer.
-   *
-   * @param indexSearcher Usually required, unless {@link 
#highlightWithoutSearcher(String, Query,
-   * String, int)} is used, in which case this needs to be null.
-   * @param indexAnalyzer Required, even if in some circumstances it isn't 
used.
-   */
-  public UnifiedHighlighter(IndexSearcher indexSearcher, Analyzer 
indexAnalyzer) {
-this.searcher = indexSearcher; // TODO: make non nullable
-this.indexAnalyzer =
-Objects.requireNonNull(
-indexAnalyzer,
-"indexAnalyzer is required" + " (even if in some circumstances it 
isn't used)");
-  }
+private Predicate fieldMatcher;
+private PassageScorer scorer = new PassageScorer();
+private PassageFormatter formatter = new DefaultPassageFormatter();
+private int maxNoHighlightPassages = -1;
+private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
+private Set flags;
 
-  public void setHandleMultiTermQuery(boolean handleMtq) {
-this.defaultHandleMtq = handleMtq;
-  }
+/**
+ * Usually required, unless {@link #highlightWithoutSearcher(String, 
Query, String, int)} is
+ * used, in which case this needs to be null.
+ */
+public Builder withSearcher(IndexSearcher value) {

Review comment:
   Yeah; that's exactly what I meant. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] apanimesh061 commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-12-04 Thread GitBox


apanimesh061 commented on a change in pull request #412:
URL: https://github.com/apache/lucene/pull/412#discussion_r762514373



##
File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java
##
@@ -113,118 +112,239 @@
   protected static final LabelledCharArrayMatcher[] ZERO_LEN_AUTOMATA_ARRAY =
   new LabelledCharArrayMatcher[0];
 
-  protected final IndexSearcher searcher; // if null, can only use 
highlightWithoutSearcher
+  protected final IndexSearcher searcher;
 
   protected final Analyzer indexAnalyzer;
 
-  private boolean defaultHandleMtq = true; // e.g. wildcards
+  private final int maxLength;
 
-  private boolean defaultHighlightPhrasesStrictly = true; // AKA "accuracy" or 
"query debugging"
+  private final Supplier defaultBreakIterator;
 
-  // For analysis, prefer MemoryIndexOffsetStrategy
-  private boolean defaultPassageRelevancyOverSpeed = true;
+  private final Predicate defaultFieldMatcher;
 
-  private int maxLength = DEFAULT_MAX_LENGTH;
+  private final PassageScorer defaultScorer;
 
-  // BreakIterator is stateful so we use a Supplier factory method
-  private Supplier defaultBreakIterator =
-  () -> BreakIterator.getSentenceInstance(Locale.ROOT);
+  private final PassageFormatter defaultFormatter;
 
-  private Predicate defaultFieldMatcher;
+  private final int defaultMaxNoHighlightPassages;
 
-  private PassageScorer defaultScorer = new PassageScorer();
+  // lazy initialized with double-check locking; protected so subclass can init
+  protected volatile FieldInfos fieldInfos;
 
-  private PassageFormatter defaultFormatter = new DefaultPassageFormatter();
+  private final int cacheFieldValCharsThreshold;
 
-  private int defaultMaxNoHighlightPassages = -1;
+  private final Set flags;
 
-  // lazy initialized with double-check locking; protected so subclass can init
-  protected volatile FieldInfos fieldInfos;
+  /** Builder for UnifiedHighlighter. */
+  public static class Builder {
+/** If null, can only use highlightWithoutSearcher. */
+private IndexSearcher searcher;
 
-  private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
+private Analyzer indexAnalyzer;
+private boolean handleMultiTermQuery = true;
+private boolean highlightPhrasesStrictly = true;
+private boolean passageRelevancyOverSpeed = true;
+private boolean weightMatches = true;
+private int maxLength = DEFAULT_MAX_LENGTH;
 
-  /** Extracts matching terms after rewriting against an empty index */
-  protected static Set extractTerms(Query query) throws IOException {
-Set queryTerms = new HashSet<>();
-
EMPTY_INDEXSEARCHER.rewrite(query).visit(QueryVisitor.termCollector(queryTerms));
-return queryTerms;
-  }
+/** BreakIterator is stateful so we use a Supplier factory method. */
+private Supplier breakIterator =
+() -> BreakIterator.getSentenceInstance(Locale.ROOT);
 
-  /**
-   * Constructs the highlighter with the given index searcher and analyzer.
-   *
-   * @param indexSearcher Usually required, unless {@link 
#highlightWithoutSearcher(String, Query,
-   * String, int)} is used, in which case this needs to be null.
-   * @param indexAnalyzer Required, even if in some circumstances it isn't 
used.
-   */
-  public UnifiedHighlighter(IndexSearcher indexSearcher, Analyzer 
indexAnalyzer) {
-this.searcher = indexSearcher; // TODO: make non nullable
-this.indexAnalyzer =
-Objects.requireNonNull(
-indexAnalyzer,
-"indexAnalyzer is required" + " (even if in some circumstances it 
isn't used)");
-  }
+private Predicate fieldMatcher;
+private PassageScorer scorer = new PassageScorer();
+private PassageFormatter formatter = new DefaultPassageFormatter();
+private int maxNoHighlightPassages = -1;
+private int cacheFieldValCharsThreshold = DEFAULT_CACHE_CHARS_THRESHOLD;
+private Set flags;
 
-  public void setHandleMultiTermQuery(boolean handleMtq) {
-this.defaultHandleMtq = handleMtq;
-  }
+/**
+ * Usually required, unless {@link #highlightWithoutSearcher(String, 
Query, String, int)} is
+ * used, in which case this needs to be null.
+ */
+public Builder withSearcher(IndexSearcher value) {
+  this.searcher = value;
+  return self();
+}
 
-  public void setHighlightPhrasesStrictly(boolean highlightPhrasesStrictly) {
-this.defaultHighlightPhrasesStrictly = highlightPhrasesStrictly;
-  }
+/**
+ * This method sets the analyzer for the UH object. Required, even if in 
some circumstances it
+ * isn' used. The null check is performed in the constructor.
+ */
+public Builder withIndexAnalyzer(Analyzer value) {
+  this.indexAnalyzer = value;
+  return self();
+}
 
-  public void setMaxLength(int maxLength) {
-if (maxLength < 0 || maxLength == Integer.MAX_VALUE) {
-  // two reasons: no overflow problems in 
BreakIterator.preceding(offset+1),
-   

[GitHub] [lucene] apanimesh061 commented on a change in pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-12-04 Thread GitBox


apanimesh061 commented on a change in pull request #412:
URL: https://github.com/apache/lucene/pull/412#discussion_r762514390



##
File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java
##
@@ -823,24 +943,13 @@ public void visitLeaf(Query query) {
 return filteredTerms.toArray(new BytesRef[filteredTerms.size()]);
   }
 
-  /** Customize the highlighting flags to use by field. */
+  /**
+   * Customize the highlighting flags to use by field. Here the user can 
either specify the set of
+   * {@link HighlightFlag}s to be applied or use the boolean flags to populate 
final list of {@link

Review comment:
   Will rectify the comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets

2021-12-04 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453526#comment-17453526
 ] 

Haoyu Zhai commented on LUCENE-10229:
-

Seems for {{containedBy}} this inconsistency is introduced 
[here|https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/ConjunctionMatchesIterator.java#L60,L75],
 perhaps we could further subclass the {{ConjunctionMatchesIterator}} to a 
{{FilterMatchesIterator}} to let the offset methods return only offset of 
"source"?

> Match offsets should be consistent for fields with positions and fields with 
> offsets
> 
>
> Key: LUCENE-10229
> URL: https://issues.apache.org/jira/browse/LUCENE-10229
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
>
> This is a follow-up of LUCENE-10223 in which it was discovered that fields 
> with
> offsets don't highlight some more complex interval queries properly.  Alan 
> says:
> {quote}
> It's because it returns the position of the inner match, but the offsets of 
> the outer.  And so if you're re-analyzing and retrieving offsets by looking 
> at the positions, you get the 'right' thing.  It's not obvious to me what the 
> correct response is here, but thinking about it the current behaviour is kind 
> of the worst of both worlds, and perhaps we should change it so that you get 
> offsets of the inner match as standard, and then the outer match is returned 
> as part of the sub matches.
> {quote}
> Intervals are nicely separated into "basic intervals" and "filters" which 
> restrict some other source of intervals, here is the original documentation:
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50
> My experience from an extended period of using interval queries in a frontend 
> where they're highlighted is that filters are restrictions that should not be 
> highlighted - it's the source intervals that people care about. Filters are 
> what you remove or where you give proper context to source intervals.
> The test code contributed in LUCENE-10223 contains numerous query-highlight 
> examples (on fields with positions) where this intuition is demonstrated on 
> all kinds of interval functions:
> https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542
> This issue is about making the internals work consistently for fields with 
> positions and fields with offsets.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] apanimesh061 commented on pull request #412: LUCENE-10197: UnifiedHighlighter should use builders for thread-safety

2021-12-04 Thread GitBox


apanimesh061 commented on pull request #412:
URL: https://github.com/apache/lucene/pull/412#issuecomment-986174116


   @dsmiley 
   
   Updated the PR as per the comments and added the new builder functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets

2021-12-04 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453538#comment-17453538
 ] 

Dawid Weiss commented on LUCENE-10229:
--

Thanks for looking at this, [~zhai7631]! I honestly don't know what the problem 
is - didn't have time to look at it as I was absorbed by other issues. If you'd 
like to try to file a patch, please go ahead. [~romseygeek] is the expert on 
this code and I'm sure he'd guide us both here. I am just convinced about what 
the outcome should be - consistent with what the positions currently return. It 
is really more logical this way.

> Match offsets should be consistent for fields with positions and fields with 
> offsets
> 
>
> Key: LUCENE-10229
> URL: https://issues.apache.org/jira/browse/LUCENE-10229
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
>
> This is a follow-up of LUCENE-10223 in which it was discovered that fields 
> with
> offsets don't highlight some more complex interval queries properly.  Alan 
> says:
> {quote}
> It's because it returns the position of the inner match, but the offsets of 
> the outer.  And so if you're re-analyzing and retrieving offsets by looking 
> at the positions, you get the 'right' thing.  It's not obvious to me what the 
> correct response is here, but thinking about it the current behaviour is kind 
> of the worst of both worlds, and perhaps we should change it so that you get 
> offsets of the inner match as standard, and then the outer match is returned 
> as part of the sub matches.
> {quote}
> Intervals are nicely separated into "basic intervals" and "filters" which 
> restrict some other source of intervals, here is the original documentation:
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50
> My experience from an extended period of using interval queries in a frontend 
> where they're highlighted is that filters are restrictions that should not be 
> highlighted - it's the source intervals that people care about. Filters are 
> what you remove or where you give proper context to source intervals.
> The test code contributed in LUCENE-10223 contains numerous query-highlight 
> examples (on fields with positions) where this intuition is demonstrated on 
> all kinds of interval functions:
> https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542
> This issue is about making the internals work consistently for fields with 
> positions and fields with offsets.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org