Re: scanned pdf with solr cell

Jack Krupansky Tue, 14 Aug 2012 17:01:34 -0700

Some PDFs are purely scanned documents and have only a bitmap image for eachpage with no text, or sometimes they do a mediocre OCR on the page imageswhich produces a fair amount of garbage in the text. My recollection is thatPDFBox is parsing the PostScript for the text layer, which may be eitherempty or maybe has some special/hidden text that relates to the processingof the content but may not be the actual content (page image.)

You can usually tell if a PDF is "scanned" by zooming in and looking foredge artifacts on curved letters, which won't appear for PostScript fonts.


Try a "normal" PDF for comparison.

-- Jack Krupansky

-----Original Message-----From: Ahmet Arslan

Sent: Tuesday, August 14, 2012 7:30 PM
To: solr-user@lucene.apache.org
Subject: scanned pdf with solr cell

Hi All,

I have set of rich documents. Some of them are scanned pdf files. When Isend a scanned pdf to extraction request handler, below icon appears in myDock.


http://tinypic.com/r/2mpmo7o/6
http://tinypic.com/r/28ukxhj/6

Does anyone know what this is?

curl"http://localhost:8983/solr/documents/update/extract?literal.ID=ticaret_sicil_gazetesi&literal.URL=ticaret_sicil_gazetesi&commit=true";-F "myfile=@ticaret_sicil_gazetesi.pdf"


No exception is seen on solr logs. Doc is indexed, content field is:

xmpTPg:NPages 4 Creation-Date 2011-08-24T13:03:16Z stream_source_infomyfile created Wed Aug 24 16:03:16 EEST 2011 stream_content_typeapplication/octet-stream stream_size 2302337 producer Image RecognitionIntegrated Systems, Autoformat5,0,0,229 stream_nameticaret_sicil_gazetesi.pdf Content-Type application/pdf creator I.R.I.S.page page page page

Environment: solr-trunk, Mac OS X Version 10.7.4, Java HotSpot(TM) 64-BitServer VM (build 20.8-b03-424, mixed mode), jetty.


Same thing happens with Solr 4.0-beta and Tomcat too.

Thanks,

Re: scanned pdf with solr cell

Reply via email to