Thanks for all your replies. I did chance upon this question from stackoverflow which it says is able to solve the issues: http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/
However, when I tried to run it, it still get the same "?????" output in the content, the same as what I get from the Tika app. Regards, Edwin On 17 December 2015 at 23:58, Walter Underwood <wun...@wunderwood.org> wrote: > PDF isn’t really text. For example, it doesn’t have spaces, it just moves > the next letter over farther. Letters might not be in reading order — two > column text could be printed as horizontal scans. Custom fonts might not > use an encoding that matches Unicode, which makes them encrypted (badly). > And so on. > > As one of my coworkers said, trying to turn a PDF into structured text is > like trying to turn hamburger back into a cow. > > PDF is where text goes to die. > > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Dec 17, 2015, at 2:48 AM, Charlie Hull <char...@flax.co.uk> wrote: > > > > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote: > >> Hi Alexandre, > >> > >> Thanks for your reply. > >> > >> So the only way to solve this issue is to explore with PDF specific > tools > >> and change the encoding of the file? > >> Is there any way to configure it in Solr? > > > > Solr uses Tika to extract plain text from PDFs. If the PDFs have been > created in a way that Tika cannot easily extract the text, there's nothing > you can do in Solr that will help. > > > > Unfortunately PDF isn't a content format but a presentation format - so > extracting plain text is fraught with difficulty. You may see a character > on a PDF page, but exactly how that character is generated (using a > specific encoding, font, or even by drawing a picture) is outside your > control. There are various businesses built on this premise - they charge > for creating clean extracted text from PDFs - and even they have trouble > with some PDFs. > > > > HTH > > > > Charlie > > > >> > >> Regards, > >> Edwin > >> > >> > >> On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com > > > >> wrote: > >> > >>> They could be using custom fonts and non-Unicode characters. That's > >>> probably something to explore with PDF specific tools. > >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com> > >>> wrote: > >>> > >>>> I've checked all the files which has problem with the content in the > Solr > >>>> index using the Tika app. All of them shows the same issues as what I > see > >>>> in the Solr index. > >>>> > >>>> So does the issues lies with the encoding of the file? Are we able to > >>> check > >>>> the encoding of the file? > >>>> > >>>> > >>>> Regards, > >>>> Edwin > >>>> > >>>> > >>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> > >>>> wrote: > >>>> > >>>>> Hi Erik, > >>>>> > >>>>> I've shared the file on dropbox, which you can access via the link > >>> here: > >>>>> > >>> > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0 > >>>>> > >>>>> This is what I get from the Tika app after dropping the file in. > >>>>> > >>>>> Content-Length: 75092 > >>>>> Content-Type: application/pdf > >>>>> Type: COSName{Info} > >>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser > >>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf > >>>>> X-TIKA:digest:SHA256: > >>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7 > >>>>> access_permission:assemble_document: true > >>>>> access_permission:can_modify: true > >>>>> access_permission:can_print: true > >>>>> access_permission:can_print_degraded: true > >>>>> access_permission:extract_content: true > >>>>> access_permission:extract_for_accessibility: true > >>>>> access_permission:fill_in_form: true > >>>>> access_permission:modify_annotations: true > >>>>> dc:format: application/pdf; version=1.3 > >>>>> pdf:PDFVersion: 1.3 > >>>>> pdf:encrypted: false > >>>>> producer: null > >>>>> resourceName: Desmophen+670+BAe.pdf > >>>>> xmpTPg:NPages: 3 > >>>>> > >>>>> > >>>>> Regards, > >>>>> Edwin > >>>>> > >>>>> > >>>>> On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com> > >>>> wrote: > >>>>> > >>>>>> Edwin - Can you share one of those PDF files? > >>>>>> > >>>>>> Also, drop the file into the Tika app and see what it sees directly > - > >>>> get > >>>>>> the tika-app JAR and run that desktop application. > >>>>>> > >>>>>> Could be an encoding issue? > >>>>>> > >>>>>> Erik > >>>>>> > >>>>>> — > >>>>>> Erik Hatcher, Senior Solutions Architect > >>>>>> http://www.lucidworks.com <http://www.lucidworks.com/> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo < > >>>> edwinye...@gmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> I'm using Solr 5.3.0 > >>>>>>> > >>>>>>> I'm indexing some PDF documents. However, for certain PDF files, > >>> there > >>>>>> are > >>>>>>> chinese text in the documents, but after indexing, what is indexed > >>> in > >>>>>> the > >>>>>>> content is either a series of "??????" or an empty content. > >>>>>>> > >>>>>>> I'm using the post.jar that comes together with Solr. > >>>>>>> > >>>>>>> What could be the reason that causes this? > >>>>>>> > >>>>>>> Regards, > >>>>>>> Edwin > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > > > -- > > Charlie Hull > > Flax - Open Source Enterprise Search > > > > tel/fax: +44 (0)8700 118334 > > mobile: +44 (0)7767 825828 > > web: www.flax.co.uk > >