Some PDFs are purely scanned documents and have only a bitmap image for each page with no text, or sometimes they do a mediocre OCR on the page images which produces a fair amount of garbage in the text. My recollection is that PDFBox is parsing the PostScript for the text layer, which may be either empty or maybe has some special/hidden text that relates to the processing of the content but may not be the actual content (page image.)

You can usually tell if a PDF is "scanned" by zooming in and looking for edge artifacts on curved letters, which won't appear for PostScript fonts.

Try a "normal" PDF for comparison.

-- Jack Krupansky

-----Original Message----- From: Ahmet Arslan
Sent: Tuesday, August 14, 2012 7:30 PM
To: solr-user@lucene.apache.org
Subject: scanned pdf with solr cell

Hi All,

I have set of rich documents. Some of them are scanned pdf files. When I send a scanned pdf to extraction request handler, below icon appears in my Dock.

http://tinypic.com/r/2mpmo7o/6
http://tinypic.com/r/28ukxhj/6

Does anyone know what this is?

curl "http://localhost:8983/solr/documents/update/extract?literal.ID=ticaret_sicil_gazetesi&literal.URL=ticaret_sicil_gazetesi&commit=true"; -F "myfile=@ticaret_sicil_gazetesi.pdf"

No exception is seen on solr logs. Doc is indexed, content field is:

xmpTPg:NPages 4 Creation-Date 2011-08-24T13:03:16Z stream_source_info myfile created Wed Aug 24 16:03:16 EEST 2011 stream_content_type application/octet-stream stream_size 2302337 producer Image Recognition Integrated Systems, Autoformat5,0,0,229 stream_name ticaret_sicil_gazetesi.pdf Content-Type application/pdf creator I.R.I.S. page page page page

Environment: solr-trunk, Mac OS X Version 10.7.4, Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode), jetty.

Same thing happens with Solr 4.0-beta and Tomcat too.

Thanks,

Reply via email to