[jira] [Created] (LUCENE-9920) Remove binary gradle-wrapper.jar from the repository

2021-04-10 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-9920:
---

 Summary: Remove binary gradle-wrapper.jar from the repository
 Key: LUCENE-9920
 URL: https://issues.apache.org/jira/browse/LUCENE-9920
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Dawid Weiss
Assignee: Dawid Weiss






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9920) Remove binary gradle-wrapper.jar from the repository

2021-04-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318503#comment-17318503
 ] 

Dawid Weiss commented on LUCENE-9920:
-

Just noticed we have gradle-wrapper.jar back in the repo again. I'll .gitignore 
it and remove it.

> Remove binary gradle-wrapper.jar from the repository
> 
>
> Key: LUCENE-9920
> URL: https://issues.apache.org/jira/browse/LUCENE-9920
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9920) Remove binary gradle-wrapper.jar from the repository

2021-04-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318504#comment-17318504
 ] 

ASF subversion and git services commented on LUCENE-9920:
-

Commit 4818a83cb204864595c352cf95855918410e20d5 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4818a83 ]

LUCENE-9920: Remove binary gradle-wrapper.jar from the repository


> Remove binary gradle-wrapper.jar from the repository
> 
>
> Key: LUCENE-9920
> URL: https://issues.apache.org/jira/browse/LUCENE-9920
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: main (9.0)
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9920) Remove binary gradle-wrapper.jar from the repository

2021-04-10 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9920.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> Remove binary gradle-wrapper.jar from the repository
> 
>
> Key: LUCENE-9920
> URL: https://issues.apache.org/jira/browse/LUCENE-9920
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: main (9.0)
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318517#comment-17318517
 ] 

Robert Muir commented on LUCENE-9914:
-

I don't think we should do this at all. We are generating a .jflex file so that 
tokens get assigned appropriate type from the tokenizer. It isn't similar to 
the domain name tokenizer at all, hence regenerating the standardtokenizer is 
fast: around 600msec for me.

The UnicodeData really isn't a very efficient structure for storing this kind 
of stuff: its just a sparsefixedbitset. It is better than a naive approach but 
not as efficient as ICU datastructures (also in the JDK). I think we should try 
to keep our usage of it limited.

> Modernize Emoji regeneration scripts
> 
>
> Key: LUCENE-9914
> URL: https://issues.apache.org/jira/browse/LUCENE-9914
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they 
> haven't been used in a while. They don't seem too scary (for perl) - just 
> fetch emoji unicode descriptions and parse them into a jflex macro and a test 
> case.
> It'd be good to convert them to use python, groovy or even java so that they 
> fit better in the build system. Alternatively - perhaps there is a way to get 
> these codepoint properties from Java directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-04-10 Thread Robert Muir (Jira)
Robert Muir created LUCENE-9921:
---

 Summary: Can ICU regeneration tasks treat icu version as input?
 Key: LUCENE-9921
 URL: https://issues.apache.org/jira/browse/LUCENE-9921
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


ICU 69 was released, so i was playing with the upgrade just to test it out and 
test out our regeneration.

Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks were 
SKIPPED by the build.

So I'm curious if the ICU version can be treated as an "input" to these tasks, 
such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9922) checksums files should use deterministic sort order?

2021-04-10 Thread Robert Muir (Jira)
Robert Muir created LUCENE-9922:
---

 Summary: checksums files should use deterministic sort order?
 Key: LUCENE-9922
 URL: https://issues.apache.org/jira/browse/LUCENE-9922
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


When regenerating, I notice some changed checksum files (when nothing actually 
changed).
For example when regenerating ICU, it produced this diff: none of the checksums 
are changed but files are simply listed in a different order.

{noformat}
diff --git a/lucene/analysis/icu/src/generated/checksums/genRbbi.json 
b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
index 7607c4edb94..f6ee7833c07 100644
--- a/lucene/analysis/icu/src/generated/checksums/genRbbi.json
+++ b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
@@ -1,6 +1,6 @@
 {
 "lucene/analysis/icu/src/data/uax29/Default.rbbi": 
"71bfaee5e81ac272aff828d1e44d0612be1b8363",
 "lucene/analysis/icu/src/data/uax29/MyanmarSyllable.rbbi": 
"4c6817658b454add5ec1f9ac8c0015ce8eb3b5f2",
-
"lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
 "1b9013b7ef4ba32a851a330c58a8fa820b9dda79",
-
"lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
 "cc023ec17e0148518086098691785a32b88ee09a"
+
"lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
 "cc023ec17e0148518086098691785a32b88ee09a",
+
"lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
 "1b9013b7ef4ba32a851a330c58a8fa820b9dda79"
 }
\ No newline at end of file
{noformat}

cc: [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9922) checksums files should use deterministic sort order?

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318525#comment-17318525
 ] 

Robert Muir commented on LUCENE-9922:
-

I tend to also produce random file ordering changes for {{utilGenLev.json}} and 
{{utilGenPacked.json}}

> checksums files should use deterministic sort order?
> 
>
> Key: LUCENE-9922
> URL: https://issues.apache.org/jira/browse/LUCENE-9922
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> When regenerating, I notice some changed checksum files (when nothing 
> actually changed).
> For example when regenerating ICU, it produced this diff: none of the 
> checksums are changed but files are simply listed in a different order.
> {noformat}
> diff --git a/lucene/analysis/icu/src/generated/checksums/genRbbi.json 
> b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> index 7607c4edb94..f6ee7833c07 100644
> --- a/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> +++ b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> @@ -1,6 +1,6 @@
>  {
>  "lucene/analysis/icu/src/data/uax29/Default.rbbi": 
> "71bfaee5e81ac272aff828d1e44d0612be1b8363",
>  "lucene/analysis/icu/src/data/uax29/MyanmarSyllable.rbbi": 
> "4c6817658b454add5ec1f9ac8c0015ce8eb3b5f2",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a"
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a",
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79"
>  }
> \ No newline at end of file
> {noformat}
> cc: [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-04-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318568#comment-17318568
 ] 

Uwe Schindler commented on LUCENE-9921:
---

For the unicode data file task it's done like that.

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9922) checksums files should use deterministic sort order?

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318569#comment-17318569
 ] 

Robert Muir commented on LUCENE-9922:
-

I have a PR coming... just wrap with treemap when we pass these checksums to 
json writer so it sorts by key. Then it doesn't rely on hashmap or filesystem 
order.

> checksums files should use deterministic sort order?
> 
>
> Key: LUCENE-9922
> URL: https://issues.apache.org/jira/browse/LUCENE-9922
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> When regenerating, I notice some changed checksum files (when nothing 
> actually changed).
> For example when regenerating ICU, it produced this diff: none of the 
> checksums are changed but files are simply listed in a different order.
> {noformat}
> diff --git a/lucene/analysis/icu/src/generated/checksums/genRbbi.json 
> b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> index 7607c4edb94..f6ee7833c07 100644
> --- a/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> +++ b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> @@ -1,6 +1,6 @@
>  {
>  "lucene/analysis/icu/src/data/uax29/Default.rbbi": 
> "71bfaee5e81ac272aff828d1e44d0612be1b8363",
>  "lucene/analysis/icu/src/data/uax29/MyanmarSyllable.rbbi": 
> "4c6817658b454add5ec1f9ac8c0015ce8eb3b5f2",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a"
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a",
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79"
>  }
> \ No newline at end of file
> {noformat}
> cc: [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318570#comment-17318570
 ] 

Robert Muir commented on LUCENE-9921:
-

Yes it worked correctly for that task. I'll look into it, thanks Uwe

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9922) checksums files should use deterministic sort order?

2021-04-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318572#comment-17318572
 ] 

Dawid Weiss commented on LUCENE-9922:
-

It's a bug, Robert. We should have the keys ordered - it also simplifies 
reading that file.

> checksums files should use deterministic sort order?
> 
>
> Key: LUCENE-9922
> URL: https://issues.apache.org/jira/browse/LUCENE-9922
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> When regenerating, I notice some changed checksum files (when nothing 
> actually changed).
> For example when regenerating ICU, it produced this diff: none of the 
> checksums are changed but files are simply listed in a different order.
> {noformat}
> diff --git a/lucene/analysis/icu/src/generated/checksums/genRbbi.json 
> b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> index 7607c4edb94..f6ee7833c07 100644
> --- a/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> +++ b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> @@ -1,6 +1,6 @@
>  {
>  "lucene/analysis/icu/src/data/uax29/Default.rbbi": 
> "71bfaee5e81ac272aff828d1e44d0612be1b8363",
>  "lucene/analysis/icu/src/data/uax29/MyanmarSyllable.rbbi": 
> "4c6817658b454add5ec1f9ac8c0015ce8eb3b5f2",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a"
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a",
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79"
>  }
> \ No newline at end of file
> {noformat}
> cc: [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request #75: LUCENE-9922: checksum files should use a deterministic sort order

2021-04-10 Thread GitBox


rmuir opened a new pull request #75:
URL: https://github.com/apache/lucene/pull/75


   This way the files don't unnecessarily change, depending on filesystem order 
or anything else.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9923) remove timestamp from generated ASCIITLD.jflex

2021-04-10 Thread Robert Muir (Jira)
Robert Muir created LUCENE-9923:
---

 Summary: remove timestamp from generated ASCIITLD.jflex
 Key: LUCENE-9923
 URL: https://issues.apache.org/jira/browse/LUCENE-9923
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


This causes the generation to always make changes to the file, even if the list 
of TLDs has changed. Let's avoid this as generating the resultant tokenizer is 
not trivial :)

Also if we fix this, it allows us to run ./gradlew regenerate --rerun-tasks and 
have no local changes (idempotency everywhere)

Diffs currently look like this:
{noformat}
--- 
a/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex
+++ 
b/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex
@@ -16,7 +16,7 @@
  */
 // Generated from IANA Root Zone Database 

 // file version from 2021 Apr 10, Sat 17:37:00 Coordinated Universal Time
-// generated on 2021 Apr 10, Sat 17:55:26 Coordinated Universal Time
+// generated on 2021 Apr 10, Sat 18:13:07 Coordinated Universal Time
 // by org.apache.lucene.analysis.standard.GenerateJflexTLDMacros

 // LUCENE-8278: None of the TLDs in {ASCIITLD} is a 1-character-shorter prefix 
of another TLD
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-04-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318575#comment-17318575
 ] 

Dawid Weiss commented on LUCENE-9921:
-

> so I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?

The checksumming hack currently uses file inputs/outputs only (while gradle 
tasks can use many different input types - properties, serializable types, 
etc.). This indeed is a problem with tasks that have non-file inputs (such as 
properties, library versions or files downloaded from the web, which don't 
exist in the repository for checkums checks - chicken and egg problem here). 

I don't know if it makes sense to put much effort into making it smarter than 
it is though. It serves the purpose (checksum validation of output files) and 
if you know what you're doing you can force-regenerate via {{--rerun-tasks}}. 
Adding properties or other stuff will make that "wrapper" code even more 
complex than it already is and it is pretty complex already.


> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9914) Modernize Emoji regeneration scripts

2021-04-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318576#comment-17318576
 ] 

Dawid Weiss commented on LUCENE-9914:
-

Haven't forgotten about it. I'm taking another look how to handle multiple ICU 
versions.

> Modernize Emoji regeneration scripts
> 
>
> Key: LUCENE-9914
> URL: https://issues.apache.org/jira/browse/LUCENE-9914
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> These are perl scripts... I don't think they had ant tasks in 8x and they 
> haven't been used in a while. They don't seem too scary (for perl) - just 
> fetch emoji unicode descriptions and parse them into a jflex macro and a test 
> case.
> It'd be good to convert them to use python, groovy or even java so that they 
> fit better in the build system. Alternatively - perhaps there is a way to get 
> these codepoint properties from Java directly?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9923) remove timestamp from generated ASCIITLD.jflex

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318578#comment-17318578
 ] 

Robert Muir commented on LUCENE-9923:
-

I think we just want to remove the {{generated on}} date here (which changes 
every time you run it regardless of inputs) and keep the {{Last-Modified}} 
date, which corresponds to the root.zone input file.

> remove timestamp from generated ASCIITLD.jflex
> --
>
> Key: LUCENE-9923
> URL: https://issues.apache.org/jira/browse/LUCENE-9923
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> This causes the generation to always make changes to the file, even if the 
> list of TLDs has changed. Let's avoid this as generating the resultant 
> tokenizer is not trivial :)
> Also if we fix this, it allows us to run ./gradlew regenerate --rerun-tasks 
> and have no local changes (idempotency everywhere)
> Diffs currently look like this:
> {noformat}
> --- 
> a/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex
> +++ 
> b/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex
> @@ -16,7 +16,7 @@
>   */
>  // Generated from IANA Root Zone Database 
> 
>  // file version from 2021 Apr 10, Sat 17:37:00 Coordinated Universal Time
> -// generated on 2021 Apr 10, Sat 17:55:26 Coordinated Universal Time
> +// generated on 2021 Apr 10, Sat 18:13:07 Coordinated Universal Time
>  // by org.apache.lucene.analysis.standard.GenerateJflexTLDMacros
>  // LUCENE-8278: None of the TLDs in {ASCIITLD} is a 1-character-shorter 
> prefix of another TLD
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318579#comment-17318579
 ] 

Robert Muir commented on LUCENE-9921:
-

Can we point the location of the actual resolved icu jar file as a file input?

For ICU I just really feel like it is *THE* input :)

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request #76: LUCENE-9923: remove always-changing timestamp from ASCIITLD.jflex generation

2021-04-10 Thread GitBox


rmuir opened a new pull request #76:
URL: https://github.com/apache/lucene/pull/76


   Every time you regenerate, it results in changes:
   * checksums/generateTlds.json
   * checksums/generateUAX29URLEmailTokenizer.json
   * ASCIITLD.jflex
   
   Remove the `new Date()` that causes this to happen. We already write the 
"good" date, that's the `Last-Modified` date of the root.zone input file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-10 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318581#comment-17318581
 ] 

Michael McCandless commented on LUCENE-9335:


bq. Not sure if this benchmark result is valid, given the hit count differs?

I *think* (not certain!) luceneutil verifies the top N hits are the same 
between candidate and baseline?

Since you are changing the BMW/BMM opto, I think it is expected that the 
(estimated) total hit count would change?  In which case, it's fine to ignore 
that difference.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-04-10 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318583#comment-17318583
 ] 

Dawid Weiss commented on LUCENE-9921:
-

The path is different for every developer - it's outside of the project root 
(in gradle's caches). 

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #75: LUCENE-9922: checksum files should use a deterministic sort order

2021-04-10 Thread GitBox


rmuir merged pull request #75:
URL: https://github.com/apache/lucene/pull/75


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9922) checksums files should use deterministic sort order?

2021-04-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318586#comment-17318586
 ] 

ASF subversion and git services commented on LUCENE-9922:
-

Commit 15bfb28d7f30361d56aa8f60e30b4f3170bd233c in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=15bfb28 ]

LUCENE-9922: checksum files should use a deterministic sort order (#75)

This way the files don't unnecessarily change, depending on filesystem
order or anything else.

> checksums files should use deterministic sort order?
> 
>
> Key: LUCENE-9922
> URL: https://issues.apache.org/jira/browse/LUCENE-9922
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When regenerating, I notice some changed checksum files (when nothing 
> actually changed).
> For example when regenerating ICU, it produced this diff: none of the 
> checksums are changed but files are simply listed in a different order.
> {noformat}
> diff --git a/lucene/analysis/icu/src/generated/checksums/genRbbi.json 
> b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> index 7607c4edb94..f6ee7833c07 100644
> --- a/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> +++ b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> @@ -1,6 +1,6 @@
>  {
>  "lucene/analysis/icu/src/data/uax29/Default.rbbi": 
> "71bfaee5e81ac272aff828d1e44d0612be1b8363",
>  "lucene/analysis/icu/src/data/uax29/MyanmarSyllable.rbbi": 
> "4c6817658b454add5ec1f9ac8c0015ce8eb3b5f2",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a"
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a",
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79"
>  }
> \ No newline at end of file
> {noformat}
> cc: [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9922) checksums files should use deterministic sort order?

2021-04-10 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9922.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> checksums files should use deterministic sort order?
> 
>
> Key: LUCENE-9922
> URL: https://issues.apache.org/jira/browse/LUCENE-9922
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When regenerating, I notice some changed checksum files (when nothing 
> actually changed).
> For example when regenerating ICU, it produced this diff: none of the 
> checksums are changed but files are simply listed in a different order.
> {noformat}
> diff --git a/lucene/analysis/icu/src/generated/checksums/genRbbi.json 
> b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> index 7607c4edb94..f6ee7833c07 100644
> --- a/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> +++ b/lucene/analysis/icu/src/generated/checksums/genRbbi.json
> @@ -1,6 +1,6 @@
>  {
>  "lucene/analysis/icu/src/data/uax29/Default.rbbi": 
> "71bfaee5e81ac272aff828d1e44d0612be1b8363",
>  "lucene/analysis/icu/src/data/uax29/MyanmarSyllable.rbbi": 
> "4c6817658b454add5ec1f9ac8c0015ce8eb3b5f2",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79",
> -
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a"
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/MyanmarSyllable.brk":
>  "cc023ec17e0148518086098691785a32b88ee09a",
> +
> "lucene/analysis/icu/src/resources/org/apache/lucene/analysis/icu/segmentation/Default.brk":
>  "1b9013b7ef4ba32a851a330c58a8fa820b9dda79"
>  }
> \ No newline at end of file
> {noformat}
> cc: [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #76: LUCENE-9923: remove always-changing timestamp from ASCIITLD.jflex generation

2021-04-10 Thread GitBox


rmuir merged pull request #76:
URL: https://github.com/apache/lucene/pull/76


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9923) remove timestamp from generated ASCIITLD.jflex

2021-04-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318589#comment-17318589
 ] 

ASF subversion and git services commented on LUCENE-9923:
-

Commit f5157d0cdeb8c3600b3ef98683d12c4eaf03 in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f51 ]

LUCENE-9923: remove always-changing timestamp from ASCIITLD.jflex generation 
(#76)

This makes regenerate idempotent by removing the new Date() from the
output.

We already have the root.zone's Last-Modified date, which is the one
that matters and only changes when the root.zone changes.

> remove timestamp from generated ASCIITLD.jflex
> --
>
> Key: LUCENE-9923
> URL: https://issues.apache.org/jira/browse/LUCENE-9923
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This causes the generation to always make changes to the file, even if the 
> list of TLDs has changed. Let's avoid this as generating the resultant 
> tokenizer is not trivial :)
> Also if we fix this, it allows us to run ./gradlew regenerate --rerun-tasks 
> and have no local changes (idempotency everywhere)
> Diffs currently look like this:
> {noformat}
> --- 
> a/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex
> +++ 
> b/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex
> @@ -16,7 +16,7 @@
>   */
>  // Generated from IANA Root Zone Database 
> 
>  // file version from 2021 Apr 10, Sat 17:37:00 Coordinated Universal Time
> -// generated on 2021 Apr 10, Sat 17:55:26 Coordinated Universal Time
> +// generated on 2021 Apr 10, Sat 18:13:07 Coordinated Universal Time
>  // by org.apache.lucene.analysis.standard.GenerateJflexTLDMacros
>  // LUCENE-8278: None of the TLDs in {ASCIITLD} is a 1-character-shorter 
> prefix of another TLD
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9923) remove timestamp from generated ASCIITLD.jflex

2021-04-10 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-9923.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> remove timestamp from generated ASCIITLD.jflex
> --
>
> Key: LUCENE-9923
> URL: https://issues.apache.org/jira/browse/LUCENE-9923
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This causes the generation to always make changes to the file, even if the 
> list of TLDs has changed. Let's avoid this as generating the resultant 
> tokenizer is not trivial :)
> Also if we fix this, it allows us to run ./gradlew regenerate --rerun-tasks 
> and have no local changes (idempotency everywhere)
> Diffs currently look like this:
> {noformat}
> --- 
> a/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex
> +++ 
> b/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/ASCIITLD.jflex
> @@ -16,7 +16,7 @@
>   */
>  // Generated from IANA Root Zone Database 
> 
>  // file version from 2021 Apr 10, Sat 17:37:00 Coordinated Universal Time
> -// generated on 2021 Apr 10, Sat 17:55:26 Coordinated Universal Time
> +// generated on 2021 Apr 10, Sat 18:13:07 Coordinated Universal Time
>  // by org.apache.lucene.analysis.standard.GenerateJflexTLDMacros
>  // LUCENE-8278: None of the TLDs in {ASCIITLD} is a 1-character-shorter 
> prefix of another TLD
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9924) regenerate TLD list from IANA TLD db, rather than root zone db

2021-04-10 Thread Robert Muir (Jira)
Robert Muir created LUCENE-9924:
---

 Summary: regenerate TLD list from IANA TLD db, rather than root 
zone db
 Key: LUCENE-9924
 URL: https://issues.apache.org/jira/browse/LUCENE-9924
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir


Currently the TLD list comes from root zone database (DNS records) and these 
are parsed with regular expressions. Instead we can use 
https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request #77: LUCENE-9924: generate TLD list from IANA TLD db, rather than root zone db

2021-04-10 Thread GitBox


rmuir opened a new pull request #77:
URL: https://github.com/apache/lucene/pull/77


   This adds a bit of simplicity as the file is a simple domain list,
   rather than a DNS zone. So the regexes parsing DNS can be removed.
   
   Also the file may change less often as it contains JUST the list of
   TLDs, and not any additional DNS metadata.
   
   As you can see, it produces the same output as what we are doing today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9924) regenerate TLD list from IANA TLD db, rather than root zone db

2021-04-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318596#comment-17318596
 ] 

Uwe Schindler commented on LUCENE-9924:
---

I just noticed: there are some puny code domains. Should they be converted to 
readable form?

> regenerate TLD list from IANA TLD db, rather than root zone db
> --
>
> Key: LUCENE-9924
> URL: https://issues.apache.org/jira/browse/LUCENE-9924
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Currently the TLD list comes from root zone database (DNS records) and these 
> are parsed with regular expressions. Instead we can use 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9924) regenerate TLD list from IANA TLD db, rather than root zone db

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318598#comment-17318598
 ] 

Robert Muir commented on LUCENE-9924:
-

In other words, this change creates the same output as what we are doing today 
:) You can see it from the diff. I only fixed the generated comments to refer 
to "IANA TLD db" rather than "root zone DB"

> regenerate TLD list from IANA TLD db, rather than root zone db
> --
>
> Key: LUCENE-9924
> URL: https://issues.apache.org/jira/browse/LUCENE-9924
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the TLD list comes from root zone database (DNS records) and these 
> are parsed with regular expressions. Instead we can use 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9924) regenerate TLD list from IANA TLD db, rather than root zone db

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318597#comment-17318597
 ] 

Robert Muir commented on LUCENE-9924:
-

No, this generator script explicitly wants only ascii and punycode domains.

> regenerate TLD list from IANA TLD db, rather than root zone db
> --
>
> Key: LUCENE-9924
> URL: https://issues.apache.org/jira/browse/LUCENE-9924
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the TLD list comes from root zone database (DNS records) and these 
> are parsed with regular expressions. Instead we can use 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9924) regenerate TLD list from IANA TLD db, rather than root zone db

2021-04-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318601#comment-17318601
 ] 

Uwe Schindler commented on LUCENE-9924:
---

Hi, yes that's a separate issue.

I just noticed that there are more and more puny code root domains. So a 
further improvement would be to allow the tokenizer to match both variants. 
Because in full text nobody would use the puny code one.

So in short: I would open issue to add both variants: ascii variant (puny code) 
and decided unicode version.

Puny code decoder is part of icu 

> regenerate TLD list from IANA TLD db, rather than root zone db
> --
>
> Key: LUCENE-9924
> URL: https://issues.apache.org/jira/browse/LUCENE-9924
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the TLD list comes from root zone database (DNS records) and these 
> are parsed with regular expressions. Instead we can use 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9924) regenerate TLD list from IANA TLD db, rather than root zone db

2021-04-10 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318601#comment-17318601
 ] 

Uwe Schindler edited comment on LUCENE-9924 at 4/10/21, 9:05 PM:
-

Hi, yes that's a separate issue.

I just noticed that there are more and more puny code root domains. So a 
further improvement would be to allow the tokenizer to match both variants. 
Because in full text nobody would use the puny code one.

So in short: I would open issue to add both variants: ascii variant (puny code) 
and decoded unicode version.

Puny code decoder is part of icu 


was (Author: thetaphi):
Hi, yes that's a separate issue.

I just noticed that there are more and more puny code root domains. So a 
further improvement would be to allow the tokenizer to match both variants. 
Because in full text nobody would use the puny code one.

So in short: I would open issue to add both variants: ascii variant (puny code) 
and decided unicode version.

Puny code decoder is part of icu 

> regenerate TLD list from IANA TLD db, rather than root zone db
> --
>
> Key: LUCENE-9924
> URL: https://issues.apache.org/jira/browse/LUCENE-9924
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the TLD list comes from root zone database (DNS records) and these 
> are parsed with regular expressions. Instead we can use 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9924) regenerate TLD list from IANA TLD db, rather than root zone db

2021-04-10 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318602#comment-17318602
 ] 

Robert Muir commented on LUCENE-9924:
-

Matching puny-decoded form is different, I'm not even sure we should do it. 
Here we are just getting the same exact data in an easier way.

> regenerate TLD list from IANA TLD db, rather than root zone db
> --
>
> Key: LUCENE-9924
> URL: https://issues.apache.org/jira/browse/LUCENE-9924
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the TLD list comes from root zone database (DNS records) and these 
> are parsed with regular expressions. Instead we can use 
> https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org