> This is not purely a PDFBox problem, but I'm using PDFBox and hoping I can > solve the issue using PDFBox. Thanks for your help. It's already knwon that there is a problem, but there isn't a solution yet.
> I am analyzing and modifying PDF text using PDFBox and regular expressions. > Every PDF that needs to be analyzed comes from Microsoft Word. Therefore > they contain embedded fonts. When I analyze the text and then replace it, I > get text running together like this: > > http://criminy.webfactional.com/media/images/PDFError02/a_zA_Z0_9_symbols.png > > Where it should be: akbcdef...pqr...za....@$^... > > What I've noticed is that MS word writes it's embedded fonts with width > values of 0 for some of the letters, which differs on the fonts used and > version of MS Word used. I'm able to fix this by running: > > font.getWidths().set(ascii('K')-32,new COSFloat((float)690.0)); > > for each offending letter (usually, this is letters with a width of 0). Now > I am trying to determine the best way to compute the width of these letters > as I would like to be able to apply a general case font width correction, > rather than hope that the MS Word pdf generation doesn't mess up the widths > any more than they currently are. Is this problem independent from the type of font, e.g. TrueType, Type1, OpenType etc.? > The worst case scenario, I think, is that I can render each letter, crop it > and take the pixel width of it, and then convert the pixel width to the text > space width. That seems hardly ideal, though. I also do not think that the > width of the character is guaranteed to be the same for two differing fonts, > or a properties file listing the text space widths would be the easy > solution. What version of pdfbox do you use? > Please let me know your thoughts I already found a similar problem with missing font-widths, (some of them were null). This was fixed last year in may. But obviously there are still some issues left. Andreas Lehmkühler
