Re: Issues when indexing PDF files

Zheng Lin Edwin Yeo Fri, 18 Dec 2015 00:59:20 -0800

Thanks for all your replies.

I did chance upon this question from stackoverflow which it says is able to
solve the issues:
http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/


However, when I tried to run it, it still get the same "?????" output in
the content, the same as what I get from the Tika app.

Regards,
Edwin


On 17 December 2015 at 23:58, Walter Underwood <wun...@wunderwood.org>
wrote:

> PDF isn’t really text. For example, it doesn’t have spaces, it just moves
> the next letter over farther. Letters might not be in reading order — two
> column text could be printed as horizontal scans. Custom fonts might not
> use an encoding that matches Unicode, which makes them encrypted (badly).
> And so on.
>
> As one of my coworkers said, trying to turn a PDF into structured text is
> like trying to turn hamburger back into a cow.
>
> PDF is where text goes to die.
>
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Dec 17, 2015, at 2:48 AM, Charlie Hull <char...@flax.co.uk> wrote:
> >
> > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> >> Hi Alexandre,
> >>
> >> Thanks for your reply.
> >>
> >> So the only way to solve this issue is to explore with PDF specific
> tools
> >> and change the encoding of the file?
> >> Is there any way to configure it in Solr?
> >
> > Solr uses Tika to extract plain text from PDFs. If the PDFs have been
> created in a way that Tika cannot easily extract the text, there's nothing
> you can do in Solr that will help.
> >
> > Unfortunately PDF isn't a content format but a presentation format - so
> extracting plain text is fraught with difficulty. You may see a character
> on a PDF page, but exactly how that character is generated (using a
> specific encoding, font, or even by drawing a picture) is outside your
> control. There are various businesses built on this premise - they charge
> for creating clean extracted text from PDFs - and even they have trouble
> with some PDFs.
> >
> > HTH
> >
> > Charlie
> >
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >> On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com
> >
> >> wrote:
> >>
> >>> They could be using custom fonts and non-Unicode characters. That's
> >>> probably something to explore with PDF specific tools.
> >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com>
> >>> wrote:
> >>>
> >>>> I've checked all the files which has problem with the content in the
> Solr
> >>>> index using the Tika app. All of them shows the same issues as what I
> see
> >>>> in the Solr index.
> >>>>
> >>>> So does the issues lies with the encoding of the file? Are we able to
> >>> check
> >>>> the encoding of the file?
> >>>>
> >>>>
> >>>> Regards,
> >>>> Edwin
> >>>>
> >>>>
> >>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Erik,
> >>>>>
> >>>>> I've shared the file on dropbox, which you can access via the link
> >>> here:
> >>>>>
> >>>
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> >>>>>
> >>>>> This is what I get from the Tika app after dropping the file in.
> >>>>>
> >>>>> Content-Length: 75092
> >>>>> Content-Type: application/pdf
> >>>>> Type: COSName{Info}
> >>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
> >>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> >>>>> X-TIKA:digest:SHA256:
> >>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> >>>>> access_permission:assemble_document: true
> >>>>> access_permission:can_modify: true
> >>>>> access_permission:can_print: true
> >>>>> access_permission:can_print_degraded: true
> >>>>> access_permission:extract_content: true
> >>>>> access_permission:extract_for_accessibility: true
> >>>>> access_permission:fill_in_form: true
> >>>>> access_permission:modify_annotations: true
> >>>>> dc:format: application/pdf; version=1.3
> >>>>> pdf:PDFVersion: 1.3
> >>>>> pdf:encrypted: false
> >>>>> producer: null
> >>>>> resourceName: Desmophen+670+BAe.pdf
> >>>>> xmpTPg:NPages: 3
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>> Edwin
> >>>>>
> >>>>>
> >>>>> On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Edwin - Can you share one of those PDF files?
> >>>>>>
> >>>>>> Also, drop the file into the Tika app and see what it sees directly
> -
> >>>> get
> >>>>>> the tika-app JAR and run that desktop application.
> >>>>>>
> >>>>>> Could be an encoding issue?
> >>>>>>
> >>>>>>         Erik
> >>>>>>
> >>>>>> —
> >>>>>> Erik Hatcher, Senior Solutions Architect
> >>>>>> http://www.lucidworks.com <http://www.lucidworks.com/>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> >>>> edwinye...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'm using Solr 5.3.0
> >>>>>>>
> >>>>>>> I'm indexing some PDF documents. However, for certain PDF files,
> >>> there
> >>>>>> are
> >>>>>>> chinese text in the documents, but after indexing, what is indexed
> >>> in
> >>>>>> the
> >>>>>>> content is either a series of "??????" or an empty content.
> >>>>>>>
> >>>>>>> I'm using the post.jar that comes together with Solr.
> >>>>>>>
> >>>>>>> What could be the reason that causes this?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Edwin
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
> > --
> > Charlie Hull
> > Flax - Open Source Enterprise Search
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web: www.flax.co.uk
>
>

Re: Issues when indexing PDF files

Reply via email to