Some PDFs are purely scanned documents and have only a bitmap image for each
page with no text, or sometimes they do a mediocre OCR on the page images
which produces a fair amount of garbage in the text. My recollection is that
PDFBox is parsing the PostScript for the text layer, which may be either
empty or maybe has some special/hidden text that relates to the processing
of the content but may not be the actual content (page image.)
You can usually tell if a PDF is "scanned" by zooming in and looking for
edge artifacts on curved letters, which won't appear for PostScript fonts.
Try a "normal" PDF for comparison.
-- Jack Krupansky
-----Original Message-----
From: Ahmet Arslan
Sent: Tuesday, August 14, 2012 7:30 PM
To: solr-user@lucene.apache.org
Subject: scanned pdf with solr cell
Hi All,
I have set of rich documents. Some of them are scanned pdf files. When I
send a scanned pdf to extraction request handler, below icon appears in my
Dock.
http://tinypic.com/r/2mpmo7o/6
http://tinypic.com/r/28ukxhj/6
Does anyone know what this is?
curl
"http://localhost:8983/solr/documents/update/extract?literal.ID=ticaret_sicil_gazetesi&literal.URL=ticaret_sicil_gazetesi&commit=true"
-F "myfile=@ticaret_sicil_gazetesi.pdf"
No exception is seen on solr logs. Doc is indexed, content field is:
xmpTPg:NPages 4 Creation-Date 2011-08-24T13:03:16Z stream_source_info
myfile created Wed Aug 24 16:03:16 EEST 2011 stream_content_type
application/octet-stream stream_size 2302337 producer Image Recognition
Integrated Systems, Autoformat5,0,0,229 stream_name
ticaret_sicil_gazetesi.pdf Content-Type application/pdf creator I.R.I.S.
page page page page
Environment: solr-trunk, Mac OS X Version 10.7.4, Java HotSpot(TM) 64-Bit
Server VM (build 20.8-b03-424, mixed mode), jetty.
Same thing happens with Solr 4.0-beta and Tomcat too.
Thanks,