This could also simply be your browser isn't set up to display UTF-8, the characters may be just fine.
Best, Erick On Fri, Dec 18, 2015 at 12:58 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Thanks for all your replies. > > I did chance upon this question from stackoverflow which it says is able to > solve the issues: > http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files/ > > However, when I tried to run it, it still get the same "?????" output in > the content, the same as what I get from the Tika app. > > Regards, > Edwin > > > On 17 December 2015 at 23:58, Walter Underwood <wun...@wunderwood.org> > wrote: > >> PDF isn’t really text. For example, it doesn’t have spaces, it just moves >> the next letter over farther. Letters might not be in reading order — two >> column text could be printed as horizontal scans. Custom fonts might not >> use an encoding that matches Unicode, which makes them encrypted (badly). >> And so on. >> >> As one of my coworkers said, trying to turn a PDF into structured text is >> like trying to turn hamburger back into a cow. >> >> PDF is where text goes to die. >> >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >> > On Dec 17, 2015, at 2:48 AM, Charlie Hull <char...@flax.co.uk> wrote: >> > >> > On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote: >> >> Hi Alexandre, >> >> >> >> Thanks for your reply. >> >> >> >> So the only way to solve this issue is to explore with PDF specific >> tools >> >> and change the encoding of the file? >> >> Is there any way to configure it in Solr? >> > >> > Solr uses Tika to extract plain text from PDFs. If the PDFs have been >> created in a way that Tika cannot easily extract the text, there's nothing >> you can do in Solr that will help. >> > >> > Unfortunately PDF isn't a content format but a presentation format - so >> extracting plain text is fraught with difficulty. You may see a character >> on a PDF page, but exactly how that character is generated (using a >> specific encoding, font, or even by drawing a picture) is outside your >> control. There are various businesses built on this premise - they charge >> for creating clean extracted text from PDFs - and even they have trouble >> with some PDFs. >> > >> > HTH >> > >> > Charlie >> > >> >> >> >> Regards, >> >> Edwin >> >> >> >> >> >> On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com >> > >> >> wrote: >> >> >> >>> They could be using custom fonts and non-Unicode characters. That's >> >>> probably something to explore with PDF specific tools. >> >>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com> >> >>> wrote: >> >>> >> >>>> I've checked all the files which has problem with the content in the >> Solr >> >>>> index using the Tika app. All of them shows the same issues as what I >> see >> >>>> in the Solr index. >> >>>> >> >>>> So does the issues lies with the encoding of the file? Are we able to >> >>> check >> >>>> the encoding of the file? >> >>>> >> >>>> >> >>>> Regards, >> >>>> Edwin >> >>>> >> >>>> >> >>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo < >> edwinye...@gmail.com> >> >>>> wrote: >> >>>> >> >>>>> Hi Erik, >> >>>>> >> >>>>> I've shared the file on dropbox, which you can access via the link >> >>> here: >> >>>>> >> >>> >> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0 >> >>>>> >> >>>>> This is what I get from the Tika app after dropping the file in. >> >>>>> >> >>>>> Content-Length: 75092 >> >>>>> Content-Type: application/pdf >> >>>>> Type: COSName{Info} >> >>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser >> >>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf >> >>>>> X-TIKA:digest:SHA256: >> >>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7 >> >>>>> access_permission:assemble_document: true >> >>>>> access_permission:can_modify: true >> >>>>> access_permission:can_print: true >> >>>>> access_permission:can_print_degraded: true >> >>>>> access_permission:extract_content: true >> >>>>> access_permission:extract_for_accessibility: true >> >>>>> access_permission:fill_in_form: true >> >>>>> access_permission:modify_annotations: true >> >>>>> dc:format: application/pdf; version=1.3 >> >>>>> pdf:PDFVersion: 1.3 >> >>>>> pdf:encrypted: false >> >>>>> producer: null >> >>>>> resourceName: Desmophen+670+BAe.pdf >> >>>>> xmpTPg:NPages: 3 >> >>>>> >> >>>>> >> >>>>> Regards, >> >>>>> Edwin >> >>>>> >> >>>>> >> >>>>> On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com> >> >>>> wrote: >> >>>>> >> >>>>>> Edwin - Can you share one of those PDF files? >> >>>>>> >> >>>>>> Also, drop the file into the Tika app and see what it sees directly >> - >> >>>> get >> >>>>>> the tika-app JAR and run that desktop application. >> >>>>>> >> >>>>>> Could be an encoding issue? >> >>>>>> >> >>>>>> Erik >> >>>>>> >> >>>>>> — >> >>>>>> Erik Hatcher, Senior Solutions Architect >> >>>>>> http://www.lucidworks.com <http://www.lucidworks.com/> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo < >> >>>> edwinye...@gmail.com> >> >>>>>> wrote: >> >>>>>>> >> >>>>>>> Hi, >> >>>>>>> >> >>>>>>> I'm using Solr 5.3.0 >> >>>>>>> >> >>>>>>> I'm indexing some PDF documents. However, for certain PDF files, >> >>> there >> >>>>>> are >> >>>>>>> chinese text in the documents, but after indexing, what is indexed >> >>> in >> >>>>>> the >> >>>>>>> content is either a series of "??????" or an empty content. >> >>>>>>> >> >>>>>>> I'm using the post.jar that comes together with Solr. >> >>>>>>> >> >>>>>>> What could be the reason that causes this? >> >>>>>>> >> >>>>>>> Regards, >> >>>>>>> Edwin >> >>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > >> > >> > -- >> > Charlie Hull >> > Flax - Open Source Enterprise Search >> > >> > tel/fax: +44 (0)8700 118334 >> > mobile: +44 (0)7767 825828 >> > web: www.flax.co.uk >> >>