packed to not use RamUsageEstimator.NUM_BYTES_INT

Erick Erickson (Jira) Sun, 24 Nov 2019 18:58:03 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981264#comment-16981264
 ]


Erick Erickson edited comment on LUCENE-9060 at 11/25/19 2:56 AM:
------------------------------------------------------------------

[~jpountz] I'm circling back around to this now. At this point, I can't get it 
to run to completion after fixing the use of RamUsageEstimator.NUM_BYTES_INT. 
As you can see from the attached patch (differences.patch, don't confuse it 
with a real patch!) the output is vastly different. 

The failure manifests itself as an integer parsing exception (see below), but 
that looks like a consequence of files nfc.txt, nfkc.txt and nfkc_cp.txt being 
wrong. Doesn't look like the content was actually downloaded...

HTMLCharacterEntities.jflex is also different in a minor way, that in fact was 
the difference between this and the gradle_8 branch that got me started on 
this. Which I think is why HTMLStripCharFilter.java is different. I'm ignoring 
that until I get a run to completion.

Also, I have several unversioned files in 
lucene/core/src/javs/org/apsche/lucene/util/packed, Direct*.java, 
Packed*ThreeBlocks.java. IDK whether those get cleaned up on success or not, so 
ignoring them too until I get a clean run.

------------- This is almost certainly irrelevant. It's just the run error that 
started me looking at this after I fixed RamUsageEstimater.NUM_BYTES_INT and 
ran with -Xmx24G.

The line that's failing is around line 194, it's this line:
{code}
            int ch = Integer.parseInt(outputCodePoint, 16);
{code}

Here's the result of the extra tracing I put in to see what was happening here.
 {code}
gen-utr30-data-files:
     [java] Downloading nfkc.txt ... done.
     [java] Downloading nfkc_cf.txt ... done.
     [java] EOE: line '<html xmlns="http://www.w3.org/1999/xhtml"; itemscope="" 
itemtype="http://schema.org/WebPage";>'
     [java] EOE: lefthandside '<html xmlns'
     [java] EOE: righthandside '"http://www.w3.org/1999/xhtml"; itemscope="" 
itemtype="http://schema.org/WebPage";>'
     [java] EOE: about to parseInt on: '"http://www.w3.org/1999/xhtml";'
     [java] Downloading nfkc_cf.txt and making diacritic rules one-way ... 
java.lang.NumberFormatException: For input string: 
""http://www.w3.org/1999/xhtml"";
     [java]     at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
     [java]     at java.base/java.lang.Integer.parseInt(Integer.java:638)
     [java]     at 
org.apache.lucene.analysis.icu.GenerateUTR30DataFiles.getNFKCDataFilesFromIcuProject(GenerateUTR30DataFiles.java:200)
 <== really about line 194 or so.
{code}



was (Author: erickerickson):
[~jpountz] I'm circling back around to this now. At this point, I can't get it 
to run to completion after fixing the use of RamUsageEstimator.NUM_BYTES_INT. 
As you can see from the attached patch (differences.patch, don't confuse it 
with a real patch!) the output is vastly different. 

The failure manifests itself as an integer parsing exception (see below), but 
that looks like a consequence of files nfc.txt, nfkc.txt and nfkc_cp.txt being 
wrong. Doesn't look like the content was actually downloaded...

HTMLCharacterEntities.jflex is also different in a minor way, that in fact was 
the difference between this and the gradle_8 branch that got me started on 
this. Which I think is why HTMLStripCharFilter.java is different. I'm ignoring 
that until I get a run to completion.

Also, I have several unversioned files in 
lucene/core/src/javs/org/apsche/lucene/util/packed, Direct*.java, 
Packed*ThreeBlocks.java. IDK whether those get cleaned up on success or not, so 
ignoring them too until I get a clean run.

************* This is almost certainly irrelevant. It's just the run error that 
started me looking at this after I fixed RamUsageEstimater.NUM_BYTES_INT and 
ran with -Xmx24G.

The line that's failing is around line 194, it's this line:
{code}
            int ch = Integer.parseInt(outputCodePoint, 16);
{code}

Here's the result of the extra tracing I put in to see what was happening here.
 {code}
gen-utr30-data-files:
     [java] Downloading nfkc.txt ... done.
     [java] Downloading nfkc_cf.txt ... done.
     [java] EOE: line '<html xmlns="http://www.w3.org/1999/xhtml"; itemscope="" 
itemtype="http://schema.org/WebPage";>'
     [java] EOE: lefthandside '<html xmlns'
     [java] EOE: righthandside '"http://www.w3.org/1999/xhtml"; itemscope="" 
itemtype="http://schema.org/WebPage";>'
     [java] EOE: about to parseInt on: '"http://www.w3.org/1999/xhtml";'
     [java] Downloading nfkc_cf.txt and making diacritic rules one-way ... 
java.lang.NumberFormatException: For input string: 
""http://www.w3.org/1999/xhtml"";
     [java]     at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
     [java]     at java.base/java.lang.Integer.parseInt(Integer.java:638)
     [java]     at 
org.apache.lucene.analysis.icu.GenerateUTR30DataFiles.getNFKCDataFilesFromIcuProject(GenerateUTR30DataFiles.java:200)
 <== really about line 194 or so.
{code}


> Fix the files generated python scripts in lucene/util/packed to not use 
> RamUsageEstimator.NUM_BYTES_INT
> -------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-9060
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9060
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Erick Erickson
>            Priority: Major
>         Attachments: LUCENE-9060.patch, differences.patch
>
>
> RamUsageEstimator.NUM_BYTES_INT has been removed. But the Python code still 
> puts it in the generated code. Once you run "ant regenerate" (and I had to 
> run it with 24G!) you can no longer build.
> We should verify that warnings against hand-editing end up in the generated 
> code, although they weren't hand-edited in this case.
> It looks like the constants were removed as part of LUCENE-8745.
> I think it's just a straightforward substitution of "Integer.BYTES".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9060) Fix the files generated python scripts in lucene/util/packed to not use RamUsageEstimator.NUM_BYTES_INT

Reply via email to