[ https://issues.apache.org/jira/browse/LUCENE-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981264#comment-16981264 ]
Erick Erickson edited comment on LUCENE-9060 at 11/25/19 2:56 AM: ------------------------------------------------------------------ [~jpountz] I'm circling back around to this now. At this point, I can't get it to run to completion after fixing the use of RamUsageEstimator.NUM_BYTES_INT. As you can see from the attached patch (differences.patch, don't confuse it with a real patch!) the output is vastly different. The failure manifests itself as an integer parsing exception (see below), but that looks like a consequence of files nfc.txt, nfkc.txt and nfkc_cp.txt being wrong. Doesn't look like the content was actually downloaded... HTMLCharacterEntities.jflex is also different in a minor way, that in fact was the difference between this and the gradle_8 branch that got me started on this. Which I think is why HTMLStripCharFilter.java is different. I'm ignoring that until I get a run to completion. Also, I have several unversioned files in lucene/core/src/javs/org/apsche/lucene/util/packed, Direct*.java, Packed*ThreeBlocks.java. IDK whether those get cleaned up on success or not, so ignoring them too until I get a clean run. ------------- This is almost certainly irrelevant. It's just the run error that started me looking at this after I fixed RamUsageEstimater.NUM_BYTES_INT and ran with -Xmx24G. The line that's failing is around line 194, it's this line: {code} int ch = Integer.parseInt(outputCodePoint, 16); {code} Here's the result of the extra tracing I put in to see what was happening here. {code} gen-utr30-data-files: [java] Downloading nfkc.txt ... done. [java] Downloading nfkc_cf.txt ... done. [java] EOE: line '<html xmlns="http://www.w3.org/1999/xhtml" itemscope="" itemtype="http://schema.org/WebPage">' [java] EOE: lefthandside '<html xmlns' [java] EOE: righthandside '"http://www.w3.org/1999/xhtml" itemscope="" itemtype="http://schema.org/WebPage">' [java] EOE: about to parseInt on: '"http://www.w3.org/1999/xhtml"' [java] Downloading nfkc_cf.txt and making diacritic rules one-way ... java.lang.NumberFormatException: For input string: ""http://www.w3.org/1999/xhtml"" [java] at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) [java] at java.base/java.lang.Integer.parseInt(Integer.java:638) [java] at org.apache.lucene.analysis.icu.GenerateUTR30DataFiles.getNFKCDataFilesFromIcuProject(GenerateUTR30DataFiles.java:200) <== really about line 194 or so. {code} was (Author: erickerickson): [~jpountz] I'm circling back around to this now. At this point, I can't get it to run to completion after fixing the use of RamUsageEstimator.NUM_BYTES_INT. As you can see from the attached patch (differences.patch, don't confuse it with a real patch!) the output is vastly different. The failure manifests itself as an integer parsing exception (see below), but that looks like a consequence of files nfc.txt, nfkc.txt and nfkc_cp.txt being wrong. Doesn't look like the content was actually downloaded... HTMLCharacterEntities.jflex is also different in a minor way, that in fact was the difference between this and the gradle_8 branch that got me started on this. Which I think is why HTMLStripCharFilter.java is different. I'm ignoring that until I get a run to completion. Also, I have several unversioned files in lucene/core/src/javs/org/apsche/lucene/util/packed, Direct*.java, Packed*ThreeBlocks.java. IDK whether those get cleaned up on success or not, so ignoring them too until I get a clean run. ************* This is almost certainly irrelevant. It's just the run error that started me looking at this after I fixed RamUsageEstimater.NUM_BYTES_INT and ran with -Xmx24G. The line that's failing is around line 194, it's this line: {code} int ch = Integer.parseInt(outputCodePoint, 16); {code} Here's the result of the extra tracing I put in to see what was happening here. {code} gen-utr30-data-files: [java] Downloading nfkc.txt ... done. [java] Downloading nfkc_cf.txt ... done. [java] EOE: line '<html xmlns="http://www.w3.org/1999/xhtml" itemscope="" itemtype="http://schema.org/WebPage">' [java] EOE: lefthandside '<html xmlns' [java] EOE: righthandside '"http://www.w3.org/1999/xhtml" itemscope="" itemtype="http://schema.org/WebPage">' [java] EOE: about to parseInt on: '"http://www.w3.org/1999/xhtml"' [java] Downloading nfkc_cf.txt and making diacritic rules one-way ... java.lang.NumberFormatException: For input string: ""http://www.w3.org/1999/xhtml"" [java] at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) [java] at java.base/java.lang.Integer.parseInt(Integer.java:638) [java] at org.apache.lucene.analysis.icu.GenerateUTR30DataFiles.getNFKCDataFilesFromIcuProject(GenerateUTR30DataFiles.java:200) <== really about line 194 or so. {code} > Fix the files generated python scripts in lucene/util/packed to not use > RamUsageEstimator.NUM_BYTES_INT > ------------------------------------------------------------------------------------------------------- > > Key: LUCENE-9060 > URL: https://issues.apache.org/jira/browse/LUCENE-9060 > Project: Lucene - Core > Issue Type: Bug > Reporter: Erick Erickson > Priority: Major > Attachments: LUCENE-9060.patch, differences.patch > > > RamUsageEstimator.NUM_BYTES_INT has been removed. But the Python code still > puts it in the generated code. Once you run "ant regenerate" (and I had to > run it with 24G!) you can no longer build. > We should verify that warnings against hand-editing end up in the generated > code, although they weren't hand-edited in this case. > It looks like the constants were removed as part of LUCENE-8745. > I think it's just a straightforward substitution of "Integer.BYTES". -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org