[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359100#comment-17359100
 ] 

Nam-Quang Tran commented on LUCENE-8118:
----------------------------------------

Update on my previous post: After some back and forth with the reporting user, 
it turned out that the crash was caused by some specific files named 
"delete.pdf". Unfortunately, the user deleted these files before I could get my 
hands on them. But sharing them may have been problematic anyway, as they were 
some old bank statements. In any case, it seems that feeding the contents of 
some bad PDF files to Lucene can cause AIOOBEs.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-8118
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8118
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 7.2
>         Environment: Debian/Stretch
> java version "1.8.0_144"                                                      
>                                                                               
>                                                    Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01)                                             
>                                                                               
>                                Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>            Reporter: Laura Dietz
>            Priority: Major
>         Attachments: LUCENE-8118_test.patch
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
>                                                                         at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>                                                                               
>                                                at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>                                                                               
>                                                at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>                                                                               
>                              at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)     
>                                                                               
>                                                 at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>                                                                               
>                                    at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>                                                                               
>                                       at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>                                                                               
>                                    at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>                                                                               
>                            at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>                                                                               
>                                              at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)    
>                                                                               
>                                                 at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
>         at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to