> On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur
> <[email protected]> wrote:
> > I am using the PDFBox for one of the application. What I am
> doing is I
> > am extracting the PDF text from the PDF and generating the TOC
> > entries. But I am facing one problem, that is, if the PDF contains
> > these two characters "&#10016;"(✠) and "&#9402;"(Ⓔ) then the
> > processpage(PDPage,
> > COSStream) gives an IOException "Unknown encoding for
> 'UniJIS-UCS2-H' ". Can
> > you let us know is there any way as to overcome this problem?
>
> Unfortunately not. Unless someone else has a good answer,
> you'll probably need to look at the relevant source code in
> PDFBox to figure out what to do with this. If you do that,
> we'd be happy to apply any fix you may come up with.
I'm haven't a better answer than Jukka, but perhaps a hint were to look for the 
solution.
As far as I understand, the are several unicode-mappings defined in 
Resources/cmap. You have to look,
if the 2 characters you mentioned above are part of the mapping-table 
"UniJIS-UCS2-H". If not, the question
will be: is there a problem with the mapping-file or with the 
document-producing software.

HTH
Andreas
----------------------------------------------------------------
Vorsitzender des Aufsichtsrates: Alwin Fitting
Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann

Sitz der Gesellschaft: Dortmund
Eingetragen beim Amtsgericht Dortmund 
Handelsregister-Nr. HR B 21222 
USt.-IdNr. DE 2588 96 719

Reply via email to