Unable to extract images content (OCR) from PDF files using Solr

Damien Picard Thu, 22 Oct 2015 03:06:08 -0700

Hi,

I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from
PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.


Everything works fine, except when I want to extract content from embedding
images in PDF/Word etc. documents :

I send an extract request like this :
POST /update/extract?literal.id
=ocrpdf8&fmap.content=attr_content&uprefix=attr_

In attr_content, I get :
\n \n date 2015-08-28T13:23:03Z \n
pdf:PDFVersion 1.4 \n
xmp:CreatorTool PDFCreator Version 1.2.3 \n
 stream_content_type application/pdf \n
 Keywords \n
 subject \n
 dc:creator S050735 \n
 dcterms:created 2015-08-28T13:23:03Z \n
 Last-Modified 2015-08-28T13:23:03Z \n
 dcterms:modified 2015-08-28T13:23:03Z \n
 dc:format application/pdf; version=1.4 \n
 Last-Save-Date 2015-08-28T13:23:03Z \n
 stream_name imagepdf.pdf \n
 meta:save-date 2015-08-28T13:23:03Z \n
 pdf:encrypted false \n
 dc:title imagepdf \n
 modified 2015-08-28T13:23:03Z \n
 cp:subject \n
 Content-Type application/pdf \n
 stream_size 423660 \n
 X-Parsed-By org.apache.tika.parser.DefaultParser \n
 X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n
 creator S050735 \n
 meta:author S050735 \n
 dc:subject \n
 meta:creation-date 2015-08-28T13:23:03Z \n
 stream_source_info the-file \n
 created Fri Aug 28 13:23:03 UTC 2015 \n
 xmpTPg:NPages 1 \n
 Creation-Date 2015-08-28T13:23:03Z \n
 meta:keyword \n
 Author S050735 \n
 producer GPL Ghostscript 9.04 \n
 imagepdf \n
 \n
 page \n
 Page 1 sur 1\n \n
 28/08/2015
http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4...
\n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg
embedded:image2.jpg image2.jpg \n

So, tika works fine, but it doesn't apply OCR content extraction on the
embedded images.

When I post an image (JPG) on /update/extract, I get its content indexed
throught Tesseract OCR (attr_content) field :
\n \n stream_size 55422 \n
 X-Parsed-By org.apache.tika.parser.DefaultParser \n
 X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n
 stream_content_type image/jpeg \n
 stream_name OM_1.jpg \n
 stream_source_info the-file \n
 Content-Type image/jpeg \n \n \n
 ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was
visiting a.\ncertain public school, a school set in a typically
English\ncountryside, which on the June clay of my visit was wonder-\nfully
beauliful. The Head Master—-no less typical than his\nschool and the
country-side—pointed out the charms of\nboth, and his pride came out in the
ﬁnal remark which he made\nbeforehe left me. He explained that he had a
class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can
you\n\n, conceive anything more delightful than a class in
Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n
stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By
org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n
Resolution Units inch \n stream_source_info the-file \n Compression Type
Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n
tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1,
Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization
table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X
Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n
Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1
vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type
image/jpeg \n Y Resolution 72 dots

I see on Tika JIRA that I have to enable extractInlineImages in
org/apache/tika/parser/pdf/PDFParser.properties to force image extraction
on PDF. So I did it, and I package a tika-app-1.7.jar that contains the
tika-parsers-1.7.jar with this file modified to set to true this property.
Then, I test my Tika JAR using CLI :

# java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf

In this case, I get the images content :


Page 1 sur 1

28/08/2015
http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
..

Simple Evan!
Use Case
Sdsedulet

So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my
modified one, but the images remains not extracted in my pdf.

Does anybody know what I'm doing wrong ?

Thank you.

-- 
Damien Picard
Expert GWT
<http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html>
Mob : 06 11 51 47 78

Unable to extract images content (OCR) from PDF files using Solr

Reply via email to