[ https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luís Filipe Nassif resolved LUCENE-10681. ----------------------------------------- Resolution: Duplicate Seems a duplicate of LUCENE-8118 > ArrayIndexOutOfBoundsException while indexing large binary file > --------------------------------------------------------------- > > Key: LUCENE-10681 > URL: https://issues.apache.org/jira/browse/LUCENE-10681 > Project: Lucene - Core > Issue Type: Bug > Components: core/index > Affects Versions: 9.2 > Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1 > Reporter: Luís Filipe Nassif > Priority: Major > > Hello, > I looked for a similar issue, but didn't find one, so I'm creating this, > sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 > recently and an user reported error below while indexing a huge binary file > in a parent-children schema where strings extracted from the huge binary file > (using strings command) are indexed as thousands of ~10MB children text docs > of the parent metadata document: > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of > bounds for length 71428 > at > org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at iped.engine.task.index.IndexTask.process(IndexTask.java:148) > ~[iped-engine-4.0.2.jar:?] > at > iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) > ~[iped-engine-4.0.2.jar:?]{noformat} > > This seems an integer overflow to me, not sure... It didn't use to happen > with previous lucene-5.5.5 and indexing files like this is pretty common to > us, although with lucene-5.5.5 we used to break that huge file manually > before indexing and to index using IndexWriter.addDocument(Document) method > several times for each 10MB chunk, now we are using the > IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org