Re: Issues when indexing PDF files

Erick Erickson Fri, 18 Dec 2015 07:39:46 -0800

This could also simply be your browser isn't set up to
display UTF-8, the characters may be just fine.


Best,
Erick

On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo
<edwinye...@gmail.com> wrote:
> Thanks for all your replies.
>
> I did chance upon this question from stackoverflow which it says is able to
> solve the issues:
> http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/
>
> However, when I tried to run it, it still get the same "?????" output in
> the content, the same as what I get from the Tika app.
>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 23:58, Walter Underwood <wun...@wunderwood.org>
> wrote:
>
>> PDF isn’t really text. For example, it doesn’t have spaces, it just moves
>> the next letter over farther. Letters might not be in reading order — two
>> column text could be printed as horizontal scans. Custom fonts might not
>> use an encoding that matches Unicode, which makes them encrypted (badly).
>> And so on.
>>
>> As one of my coworkers said, trying to turn a PDF into structured text is
>> like trying to turn hamburger back into a cow.
>>
>> PDF is where text goes to die.
>>
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Dec 17, 2015, at 2:48 AM, Charlie Hull <char...@flax.co.uk> wrote:
>> >
>> > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
>> >> Hi Alexandre,
>> >>
>> >> Thanks for your reply.
>> >>
>> >> So the only way to solve this issue is to explore with PDF specific
>> tools
>> >> and change the encoding of the file?
>> >> Is there any way to configure it in Solr?
>> >
>> > Solr uses Tika to extract plain text from PDFs. If the PDFs have been
>> created in a way that Tika cannot easily extract the text, there's nothing
>> you can do in Solr that will help.
>> >
>> > Unfortunately PDF isn't a content format but a presentation format - so
>> extracting plain text is fraught with difficulty. You may see a character
>> on a PDF page, but exactly how that character is generated (using a
>> specific encoding, font, or even by drawing a picture) is outside your
>> control. There are various businesses built on this premise - they charge
>> for creating clean extracted text from PDFs - and even they have trouble
>> with some PDFs.
>> >
>> > HTH
>> >
>> > Charlie
>> >
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >> On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com
>> >
>> >> wrote:
>> >>
>> >>> They could be using custom fonts and non-Unicode characters. That's
>> >>> probably something to explore with PDF specific tools.
>> >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> I've checked all the files which has problem with the content in the
>> Solr
>> >>>> index using the Tika app. All of them shows the same issues as what I
>> see
>> >>>> in the Solr index.
>> >>>>
>> >>>> So does the issues lies with the encoding of the file? Are we able to
>> >>> check
>> >>>> the encoding of the file?
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>> Edwin
>> >>>>
>> >>>>
>> >>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Erik,
>> >>>>>
>> >>>>> I've shared the file on dropbox, which you can access via the link
>> >>> here:
>> >>>>>
>> >>>
>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
>> >>>>>
>> >>>>> This is what I get from the Tika app after dropping the file in.
>> >>>>>
>> >>>>> Content-Length: 75092
>> >>>>> Content-Type: application/pdf
>> >>>>> Type: COSName{Info}
>> >>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>> >>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
>> >>>>> X-TIKA:digest:SHA256:
>> >>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
>> >>>>> access_permission:assemble_document: true
>> >>>>> access_permission:can_modify: true
>> >>>>> access_permission:can_print: true
>> >>>>> access_permission:can_print_degraded: true
>> >>>>> access_permission:extract_content: true
>> >>>>> access_permission:extract_for_accessibility: true
>> >>>>> access_permission:fill_in_form: true
>> >>>>> access_permission:modify_annotations: true
>> >>>>> dc:format: application/pdf; version=1.3
>> >>>>> pdf:PDFVersion: 1.3
>> >>>>> pdf:encrypted: false
>> >>>>> producer: null
>> >>>>> resourceName: Desmophen+670+BAe.pdf
>> >>>>> xmpTPg:NPages: 3
>> >>>>>
>> >>>>>
>> >>>>> Regards,
>> >>>>> Edwin
>> >>>>>
>> >>>>>
>> >>>>> On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>> Edwin - Can you share one of those PDF files?
>> >>>>>>
>> >>>>>> Also, drop the file into the Tika app and see what it sees directly
>> -
>> >>>> get
>> >>>>>> the tika-app JAR and run that desktop application.
>> >>>>>>
>> >>>>>> Could be an encoding issue?
>> >>>>>>
>> >>>>>>         Erik
>> >>>>>>
>> >>>>>> —
>> >>>>>> Erik Hatcher, Senior Solutions Architect
>> >>>>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
>> >>>> edwinye...@gmail.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> I'm using Solr 5.3.0
>> >>>>>>>
>> >>>>>>> I'm indexing some PDF documents. However, for certain PDF files,
>> >>> there
>> >>>>>> are
>> >>>>>>> chinese text in the documents, but after indexing, what is indexed
>> >>> in
>> >>>>>> the
>> >>>>>>> content is either a series of "??????" or an empty content.
>> >>>>>>>
>> >>>>>>> I'm using the post.jar that comes together with Solr.
>> >>>>>>>
>> >>>>>>> What could be the reason that causes this?
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>> Edwin
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>> >
>> > --
>> > Charlie Hull
>> > Flax - Open Source Enterprise Search
>> >
>> > tel/fax: +44 (0)8700 118334
>> > mobile:  +44 (0)7767 825828
>> > web: www.flax.co.uk
>>
>>

Re: Issues when indexing PDF files

Reply via email to