PDF extraction using Tika

Srinivas Kashyap Mon, 24 Aug 2020 08:09:44 -0700

Hello,

We are using TikaEntityProcessor to extract the content out of PDF and make the 
content searchable.


When jetty is run on windows based machine, we are able to successfully load 
documents using full import DIH(tika entity). Here PDF's is maintained in 
windows file system.

But when jetty solr is run on linux machine, and try to run DIH, we are getting 
below exception: (Here PDF's are maintained in linux filesystem)

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
content Processing Document # 1
                at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
                at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
                at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
                at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
                at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
content Processing Document # 1
                at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
                at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
                at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
                ... 4 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to read content Processing Document # 1
                at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
                at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
                at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
                at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
                at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
                at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
                ... 6 more
Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF 
content
                at 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
                at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
                at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
                at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
                at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
                at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
                ... 10 more
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
                at 
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)
                at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:226)
                at 
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:163)
                at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510)
                at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
                at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
                at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
                at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
                at 
org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
                at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
                at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
                at 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
                ... 15 more

Can you please suggest, how to extract PDF from linux based file system?

Thanks,
Srinivas Kashyap
________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.

PDF extraction using Tika

Reply via email to