Re: Issues when indexing PDF files

Zheng Lin Edwin Yeo Fri, 18 Dec 2015 09:52:51 -0800

Hi Erick,

Thanks for your reply.


However, it is unlikely to be the browser issue, as the same result occurs
when I tried it in the Tika app.

Regards,
Edwin


On 18 December 2015 at 23:39, Erick Erickson <erickerick...@gmail.com>
wrote:

> This could also simply be your browser isn't set up to
> display UTF-8, the characters may be just fine.
>
> Best,
> Erick
>
> On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo
> <edwinye...@gmail.com> wrote:
> > Thanks for all your replies.
> >
> > I did chance upon this question from stackoverflow which it says is able
> to
> > solve the issues:
> >
> http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/
> >
> > However, when I tried to run it, it still get the same "?????" output in
> > the content, the same as what I get from the Tika app.
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 December 2015 at 23:58, Walter Underwood <wun...@wunderwood.org>
> > wrote:
> >
> >> PDF isn’t really text. For example, it doesn’t have spaces, it just
> moves
> >> the next letter over farther. Letters might not be in reading order —
> two
> >> column text could be printed as horizontal scans. Custom fonts might not
> >> use an encoding that matches Unicode, which makes them encrypted
> (badly).
> >> And so on.
> >>
> >> As one of my coworkers said, trying to turn a PDF into structured text
> is
> >> like trying to turn hamburger back into a cow.
> >>
> >> PDF is where text goes to die.
> >>
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >> > On Dec 17, 2015, at 2:48 AM, Charlie Hull <char...@flax.co.uk> wrote:
> >> >
> >> > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> >> >> Hi Alexandre,
> >> >>
> >> >> Thanks for your reply.
> >> >>
> >> >> So the only way to solve this issue is to explore with PDF specific
> >> tools
> >> >> and change the encoding of the file?
> >> >> Is there any way to configure it in Solr?
> >> >
> >> > Solr uses Tika to extract plain text from PDFs. If the PDFs have been
> >> created in a way that Tika cannot easily extract the text, there's
> nothing
> >> you can do in Solr that will help.
> >> >
> >> > Unfortunately PDF isn't a content format but a presentation format -
> so
> >> extracting plain text is fraught with difficulty. You may see a
> character
> >> on a PDF page, but exactly how that character is generated (using a
> >> specific encoding, font, or even by drawing a picture) is outside your
> >> control. There are various businesses built on this premise - they
> charge
> >> for creating clean extracted text from PDFs - and even they have trouble
> >> with some PDFs.
> >> >
> >> > HTH
> >> >
> >> > Charlie
> >> >
> >> >>
> >> >> Regards,
> >> >> Edwin
> >> >>
> >> >>
> >> >> On 17 December 2015 at 15:42, Alexandre Rafalovitch <
> arafa...@gmail.com
> >> >
> >> >> wrote:
> >> >>
> >> >>> They could be using custom fonts and non-Unicode characters. That's
> >> >>> probably something to explore with PDF specific tools.
> >> >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com
> >
> >> >>> wrote:
> >> >>>
> >> >>>> I've checked all the files which has problem with the content in
> the
> >> Solr
> >> >>>> index using the Tika app. All of them shows the same issues as
> what I
> >> see
> >> >>>> in the Solr index.
> >> >>>>
> >> >>>> So does the issues lies with the encoding of the file? Are we able
> to
> >> >>> check
> >> >>>> the encoding of the file?
> >> >>>>
> >> >>>>
> >> >>>> Regards,
> >> >>>> Edwin
> >> >>>>
> >> >>>>
> >> >>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
> >> edwinye...@gmail.com>
> >> >>>> wrote:
> >> >>>>
> >> >>>>> Hi Erik,
> >> >>>>>
> >> >>>>> I've shared the file on dropbox, which you can access via the link
> >> >>> here:
> >> >>>>>
> >> >>>
> >>
> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> >> >>>>>
> >> >>>>> This is what I get from the Tika app after dropping the file in.
> >> >>>>>
> >> >>>>> Content-Length: 75092
> >> >>>>> Content-Type: application/pdf
> >> >>>>> Type: COSName{Info}
> >> >>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
> >> >>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> >> >>>>> X-TIKA:digest:SHA256:
> >> >>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> >> >>>>> access_permission:assemble_document: true
> >> >>>>> access_permission:can_modify: true
> >> >>>>> access_permission:can_print: true
> >> >>>>> access_permission:can_print_degraded: true
> >> >>>>> access_permission:extract_content: true
> >> >>>>> access_permission:extract_for_accessibility: true
> >> >>>>> access_permission:fill_in_form: true
> >> >>>>> access_permission:modify_annotations: true
> >> >>>>> dc:format: application/pdf; version=1.3
> >> >>>>> pdf:PDFVersion: 1.3
> >> >>>>> pdf:encrypted: false
> >> >>>>> producer: null
> >> >>>>> resourceName: Desmophen+670+BAe.pdf
> >> >>>>> xmpTPg:NPages: 3
> >> >>>>>
> >> >>>>>
> >> >>>>> Regards,
> >> >>>>> Edwin
> >> >>>>>
> >> >>>>>
> >> >>>>> On 17 December 2015 at 00:15, Erik Hatcher <
> erik.hatc...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>>> Edwin - Can you share one of those PDF files?
> >> >>>>>>
> >> >>>>>> Also, drop the file into the Tika app and see what it sees
> directly
> >> -
> >> >>>> get
> >> >>>>>> the tika-app JAR and run that desktop application.
> >> >>>>>>
> >> >>>>>> Could be an encoding issue?
> >> >>>>>>
> >> >>>>>>         Erik
> >> >>>>>>
> >> >>>>>> —
> >> >>>>>> Erik Hatcher, Senior Solutions Architect
> >> >>>>>> http://www.lucidworks.com <http://www.lucidworks.com/>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> >> >>>> edwinye...@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> Hi,
> >> >>>>>>>
> >> >>>>>>> I'm using Solr 5.3.0
> >> >>>>>>>
> >> >>>>>>> I'm indexing some PDF documents. However, for certain PDF files,
> >> >>> there
> >> >>>>>> are
> >> >>>>>>> chinese text in the documents, but after indexing, what is
> indexed
> >> >>> in
> >> >>>>>> the
> >> >>>>>>> content is either a series of "??????" or an empty content.
> >> >>>>>>>
> >> >>>>>>> I'm using the post.jar that comes together with Solr.
> >> >>>>>>>
> >> >>>>>>> What could be the reason that causes this?
> >> >>>>>>>
> >> >>>>>>> Regards,
> >> >>>>>>> Edwin
> >> >>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> >
> >> >
> >> > --
> >> > Charlie Hull
> >> > Flax - Open Source Enterprise Search
> >> >
> >> > tel/fax: +44 (0)8700 118334
> >> > mobile:  +44 (0)7767 825828
> >> > web: www.flax.co.uk
> >>
> >>
>

Re: Issues when indexing PDF files

Reply via email to