Hi Tilman!
Thank you very much for your attention!
You can find the file "p4_alt.pdf" in this folder
<https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing>.
"Extra infos.pdf" file shows some output from PDF Debugger and others.
I'm sorry, I sent the pdf file as an attachment in my first message,
but I didn't know that it wouldn't work.
Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <[email protected]>
escreveu:
> Hi,
>
> please upload your file to a sharehoster.
>
> Tilman
>
> On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
> > Hi everyone,
> >
> > I'm not sure if this is the same as FAQ "How come I am getting
> > gibberish(G38G43G36G51G5) when extracting text?"...
> >
> > I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
> > (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> >
> > I'm trying to understand how this PDF chunk (from p4_fix.pdf
> attached)
> >
> > BT
> > /G1F7 6.0 Tf
> > 94.871 773.806 Td
> > <004200430044> Tj
> > ET
> >
> > becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
> > Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
> >
> > Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
> >
> > The renders that allow me to copy the text give me "BCD" text.
> >
> > It seems that PDFBox extraction tool follows the item "9.10.2
> > Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
> > the others choose a different way.
> >
> > Could you help me to understand if there is a problem with the
> > PDF file, with the renders or with the extract text tool?
> >
> > Thank you!
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>