Hi Erick, Thanks for your reply.
However, it is unlikely to be the browser issue, as the same result occurs when I tried it in the Tika app. Regards, Edwin On 18 December 2015 at 23:39, Erick Erickson <erickerick...@gmail.com> wrote: > This could also simply be your browser isn't set up to > display UTF-8, the characters may be just fine. > > Best, > Erick > > On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo > <edwinye...@gmail.com> wrote: > > Thanks for all your replies. > > > > I did chance upon this question from stackoverflow which it says is able > to > > solve the issues: > > > http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/ > > > > However, when I tried to run it, it still get the same "?????" output in > > the content, the same as what I get from the Tika app. > > > > Regards, > > Edwin > > > > > > On 17 December 2015 at 23:58, Walter Underwood <wun...@wunderwood.org> > > wrote: > > > >> PDF isn’t really text. For example, it doesn’t have spaces, it just > moves > >> the next letter over farther. Letters might not be in reading order — > two > >> column text could be printed as horizontal scans. Custom fonts might not > >> use an encoding that matches Unicode, which makes them encrypted > (badly). > >> And so on. > >> > >> As one of my coworkers said, trying to turn a PDF into structured text > is > >> like trying to turn hamburger back into a cow. > >> > >> PDF is where text goes to die. > >> > >> Walter Underwood > >> wun...@wunderwood.org > >> http://observer.wunderwood.org/ (my blog) > >> > >> > >> > On Dec 17, 2015, at 2:48 AM, Charlie Hull <char...@flax.co.uk> wrote: > >> > > >> > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote: > >> >> Hi Alexandre, > >> >> > >> >> Thanks for your reply. > >> >> > >> >> So the only way to solve this issue is to explore with PDF specific > >> tools > >> >> and change the encoding of the file? > >> >> Is there any way to configure it in Solr? > >> > > >> > Solr uses Tika to extract plain text from PDFs. If the PDFs have been > >> created in a way that Tika cannot easily extract the text, there's > nothing > >> you can do in Solr that will help. > >> > > >> > Unfortunately PDF isn't a content format but a presentation format - > so > >> extracting plain text is fraught with difficulty. You may see a > character > >> on a PDF page, but exactly how that character is generated (using a > >> specific encoding, font, or even by drawing a picture) is outside your > >> control. There are various businesses built on this premise - they > charge > >> for creating clean extracted text from PDFs - and even they have trouble > >> with some PDFs. > >> > > >> > HTH > >> > > >> > Charlie > >> > > >> >> > >> >> Regards, > >> >> Edwin > >> >> > >> >> > >> >> On 17 December 2015 at 15:42, Alexandre Rafalovitch < > arafa...@gmail.com > >> > > >> >> wrote: > >> >> > >> >>> They could be using custom fonts and non-Unicode characters. That's > >> >>> probably something to explore with PDF specific tools. > >> >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com > > > >> >>> wrote: > >> >>> > >> >>>> I've checked all the files which has problem with the content in > the > >> Solr > >> >>>> index using the Tika app. All of them shows the same issues as > what I > >> see > >> >>>> in the Solr index. > >> >>>> > >> >>>> So does the issues lies with the encoding of the file? Are we able > to > >> >>> check > >> >>>> the encoding of the file? > >> >>>> > >> >>>> > >> >>>> Regards, > >> >>>> Edwin > >> >>>> > >> >>>> > >> >>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo < > >> edwinye...@gmail.com> > >> >>>> wrote: > >> >>>> > >> >>>>> Hi Erik, > >> >>>>> > >> >>>>> I've shared the file on dropbox, which you can access via the link > >> >>> here: > >> >>>>> > >> >>> > >> > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0 > >> >>>>> > >> >>>>> This is what I get from the Tika app after dropping the file in. > >> >>>>> > >> >>>>> Content-Length: 75092 > >> >>>>> Content-Type: application/pdf > >> >>>>> Type: COSName{Info} > >> >>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser > >> >>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf > >> >>>>> X-TIKA:digest:SHA256: > >> >>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7 > >> >>>>> access_permission:assemble_document: true > >> >>>>> access_permission:can_modify: true > >> >>>>> access_permission:can_print: true > >> >>>>> access_permission:can_print_degraded: true > >> >>>>> access_permission:extract_content: true > >> >>>>> access_permission:extract_for_accessibility: true > >> >>>>> access_permission:fill_in_form: true > >> >>>>> access_permission:modify_annotations: true > >> >>>>> dc:format: application/pdf; version=1.3 > >> >>>>> pdf:PDFVersion: 1.3 > >> >>>>> pdf:encrypted: false > >> >>>>> producer: null > >> >>>>> resourceName: Desmophen+670+BAe.pdf > >> >>>>> xmpTPg:NPages: 3 > >> >>>>> > >> >>>>> > >> >>>>> Regards, > >> >>>>> Edwin > >> >>>>> > >> >>>>> > >> >>>>> On 17 December 2015 at 00:15, Erik Hatcher < > erik.hatc...@gmail.com> > >> >>>> wrote: > >> >>>>> > >> >>>>>> Edwin - Can you share one of those PDF files? > >> >>>>>> > >> >>>>>> Also, drop the file into the Tika app and see what it sees > directly > >> - > >> >>>> get > >> >>>>>> the tika-app JAR and run that desktop application. > >> >>>>>> > >> >>>>>> Could be an encoding issue? > >> >>>>>> > >> >>>>>> Erik > >> >>>>>> > >> >>>>>> — > >> >>>>>> Erik Hatcher, Senior Solutions Architect > >> >>>>>> http://www.lucidworks.com <http://www.lucidworks.com/> > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo < > >> >>>> edwinye...@gmail.com> > >> >>>>>> wrote: > >> >>>>>>> > >> >>>>>>> Hi, > >> >>>>>>> > >> >>>>>>> I'm using Solr 5.3.0 > >> >>>>>>> > >> >>>>>>> I'm indexing some PDF documents. However, for certain PDF files, > >> >>> there > >> >>>>>> are > >> >>>>>>> chinese text in the documents, but after indexing, what is > indexed > >> >>> in > >> >>>>>> the > >> >>>>>>> content is either a series of "??????" or an empty content. > >> >>>>>>> > >> >>>>>>> I'm using the post.jar that comes together with Solr. > >> >>>>>>> > >> >>>>>>> What could be the reason that causes this? > >> >>>>>>> > >> >>>>>>> Regards, > >> >>>>>>> Edwin > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>> > >> >> > >> > > >> > > >> > -- > >> > Charlie Hull > >> > Flax - Open Source Enterprise Search > >> > > >> > tel/fax: +44 (0)8700 118334 > >> > mobile: +44 (0)7767 825828 > >> > web: www.flax.co.uk > >> > >> >