Well, seems that It'll be fixed in PDFBox 2.0.16 On Wed, May 15, 2019 at 5:35 PM Slava G <[email protected]> wrote:
> Will definitely try, is this rc available via maven? > > On Wed, May 15, 2019, 17:20 Tim Allison <[email protected]> wrote: > >> Yay! Tilman and colleagues on PDFBox really are _that_fast. :) >> >> You can try Tika’s integration w 2.0.15 in our 1.21-rc2: >> >> https://lists.apache.org/thread.html/2c027535156cc6862149490b289552d72ba5a9bff985fb7cce794e21@%3Cdev.tika.apache.org%3E >> >> On Wed, May 15, 2019 at 10:01 AM Slava G <[email protected]> wrote: >> >> > Sure, I can share it privately. >> > But seems that in PDFBox 2.0.15 it's already fixed as, when I run >> tika-app >> > (1.20) it's caused same issue, but when I ran extractText in PDFBox >> 2.0.15 >> > I got next : >> > May 15, 2019 4:59:11 PM org.apache.pdfbox.filter.FlateFilter decompress >> > WARNING: FlateFilter: premature end of stream due to a >> DataFormatException >> > May 15, 2019 4:59:11 PM org.apache.pdfbox.filter.FlateFilter decode >> > SEVERE: FlateFilter: stop reading corrupt stream due to a >> > DataFormatException >> > Exception in thread "main" java.io.IOException: >> > java.util.zip.DataFormatException: invalid literal/lengths set >> > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58) >> > at org.apache.pdfbox.filter.Filter.decode(Filter.java:87) >> > at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77) >> > at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175) >> > at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) >> > at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170) >> > at >> > >> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:91) >> > at >> > >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495) >> > at >> > >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479) >> > at >> > >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) >> > at >> > >> > >> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) >> > at >> > >> > >> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) >> > at >> > >> > >> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) >> > at >> > >> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) >> > at >> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:375) >> > at >> > >> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:272) >> > at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96) >> > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60) >> > Caused by: java.util.zip.DataFormatException: invalid literal/lengths >> set >> > at java.util.zip.Inflater.inflateBytes(Native Method) >> > at java.util.zip.Inflater.inflate(Inflater.java:259) >> > at java.util.zip.Inflater.inflate(Inflater.java:280) >> > at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83) >> > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50) >> > ... 17 more >> > >> > >> > On Wed, May 15, 2019 at 4:54 PM Tim Allison <[email protected]> >> wrote: >> > >> > > Sounds like it might be a bug. >> > > >> > > PDFBox colleagues, any recs? >> > > >> > > Slava, if you’re able to share the file even if only privately, >> that’ll >> > > help. >> > > >> > > On Wed, May 15, 2019 at 9:49 AM Slava G <[email protected]> wrote: >> > > >> > > > I have small pdf file (142kb) while I'm trying to parse it with >> TIKA my >> > > > entire app is crashing on OOM with heap dump on 36gb (nothing else >> in >> > the >> > > > code, hust parsing this PDF). >> > > > With possible error : FlateFilter: stop reading corrupt stream due >> to a >> > > > DataFormatException >> > > > And stack trace (at the moment of OOM): >> > > > "main" #1 prio=5 os_prio=0 tid=0x00007f6460009000 nid=0x4876 waiting >> > for >> > > > monitor entry [0x00007f646680d000] >> > > > java.lang.Thread.State: BLOCKED (on object monitor) >> > > > at java.util.HashMap.newNode(HashMap.java:1734) >> > > > at java.util.HashMap.putVal(HashMap.java:630) >> > > > at java.util.HashMap.put(HashMap.java:611) >> > > > at >> org.apache.fontbox.cmap.CMap.addCharMapping(CMap.java:191) >> > > > at >> > > > >> > >> org.apache.fontbox.cmap.CMapParser.parseBeginbfrange(CMapParser.java:398) >> > > > at >> > org.apache.fontbox.cmap.CMapParser.parse(CMapParser.java:136) >> > > > at >> > > > >> > >> org.apache.pdfbox.pdmodel.font.CMapManager.parseCMap(CMapManager.java:75) >> > > > at >> > > org.apache.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:197) >> > > > at >> > org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:137) >> > > > at >> > > > >> org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:176) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:83) >> > > > at >> > > > org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) >> > > > at >> > > > org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) >> > > > at >> > > > >> > > >> > >> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) >> > > > at >> > > org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) >> > > > at >> > org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172) >> > > > >> > > > >> > > > Please advise how can I detect that this can happen and skip such >> file >> > > > from the parsing. Or this is a bug ? >> > > > >> > > > Thanks >> > > > >> > > >> > >> >

