Re: Unable to extract images content (OCR) from PDF files using Solr

Zheng Lin Edwin Yeo Tue, 22 Dec 2015 19:11:41 -0800

Hi,

I'm also facing the same issue as what you faced 2 months back, like able
to extract the image content if there are in .jpg or .png format, but not
able to extract the images in pdf, even after setting "extractInlineImages
true" in the PDFParser.properties.


Have you managed to find alternative solutions to this problem?

Regards,
Edwin

On 22 October 2015 at 18:05, Damien Picard <picard.dam...@gmail.com> wrote:

> Hi,
>
> I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from
> PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.
>
> Everything works fine, except when I want to extract content from embedding
> images in PDF/Word etc. documents :
>
> I send an extract request like this :
> POST /update/extract?literal.id
> =ocrpdf8&fmap.content=attr_content&uprefix=attr_
>
> In attr_content, I get :
> \n \n date 2015-08-28T13:23:03Z \n
> pdf:PDFVersion 1.4 \n
> xmp:CreatorTool PDFCreator Version 1.2.3 \n
>  stream_content_type application/pdf \n
>  Keywords \n
>  subject \n
>  dc:creator S050735 \n
>  dcterms:created 2015-08-28T13:23:03Z \n
>  Last-Modified 2015-08-28T13:23:03Z \n
>  dcterms:modified 2015-08-28T13:23:03Z \n
>  dc:format application/pdf; version=1.4 \n
>  Last-Save-Date 2015-08-28T13:23:03Z \n
>  stream_name imagepdf.pdf \n
>  meta:save-date 2015-08-28T13:23:03Z \n
>  pdf:encrypted false \n
>  dc:title imagepdf \n
>  modified 2015-08-28T13:23:03Z \n
>  cp:subject \n
>  Content-Type application/pdf \n
>  stream_size 423660 \n
>  X-Parsed-By org.apache.tika.parser.DefaultParser \n
>  X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n
>  creator S050735 \n
>  meta:author S050735 \n
>  dc:subject \n
>  meta:creation-date 2015-08-28T13:23:03Z \n
>  stream_source_info the-file \n
>  created Fri Aug 28 13:23:03 UTC 2015 \n
>  xmpTPg:NPages 1 \n
>  Creation-Date 2015-08-28T13:23:03Z \n
>  meta:keyword \n
>  Author S050735 \n
>  producer GPL Ghostscript 9.04 \n
>  imagepdf \n
>  \n
>  page \n
>  Page 1 sur 1\n \n
>  28/08/2015
> http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
> ..
> \n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg
> embedded:image2.jpg image2.jpg \n
>
> So, tika works fine, but it doesn't apply OCR content extraction on the
> embedded images.
>
> When I post an image (JPG) on /update/extract, I get its content indexed
> throught Tesseract OCR (attr_content) field :
> \n \n stream_size 55422 \n
>  X-Parsed-By org.apache.tika.parser.DefaultParser \n
>  X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n
>  stream_content_type image/jpeg \n
>  stream_name OM_1.jpg \n
>  stream_source_info the-file \n
>  Content-Type image/jpeg \n \n \n
>  ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was
> visiting a.\ncertain public school, a school set in a typically
> English\ncountryside, which on the June clay of my visit was wonder-\nfully
> beauliful. The Head Master—-no less typical than his\nschool and the
> country-side—pointed out the charms of\nboth, and his pride came out in the
> ﬁnal remark which he made\nbeforehe left me. He explained that he had a
> class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can
> you\n\n, conceive anything more delightful than a class in
> Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n
> stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
> X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By
> org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n
> Resolution Units inch \n stream_source_info the-file \n Compression Type
> Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n
> tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1,
> Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization
> table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X
> Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n
> Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1
> vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type
> image/jpeg \n Y Resolution 72 dots
>
> I see on Tika JIRA that I have to enable extractInlineImages in
> org/apache/tika/parser/pdf/PDFParser.properties to force image extraction
> on PDF. So I did it, and I package a tika-app-1.7.jar that contains the
> tika-parsers-1.7.jar with this file modified to set to true this property.
> Then, I test my Tika JAR using CLI :
>
> # java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf
>
> In this case, I get the images content :
>
>
> Page 1 sur 1
>
> 28/08/2015
> http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4
> .
> ..
>
> Simple Evan!
> Use Case
> Sdsedulet
>
> So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my
> modified one, but the images remains not extracted in my pdf.
>
> Does anybody know what I'm doing wrong ?
>
> Thank you.
>
> --
> Damien Picard
> Expert GWT
> <
> http://www.editions-eni.fr/livres/gwt-google-web-toolkit-developpez-des-applications-internet-riches-ria-en-java/.97a1a26e7d5be94763fc45ac2a1e961a.html
> >
> Mob : 06 11 51 47 78
>

Re: Unable to extract images content (OCR) from PDF files using Solr

Reply via email to