[Bug 158329] Can't find text with Niqqud in exported PDF

bugzilla-daemon Mon, 30 Sep 2024 11:52:32 -0700

https://bugs.documentfoundation.org/show_bug.cgi?id=158329


--- Comment #12 from ⁨خالد حسني⁩ <[email protected]> ---
(In reply to David Huggins-Daines from comment #11)
> (In reply to ⁨خالد حسني⁩ from comment #10)
> > instead have 2+
> > glyphs mapped to 2+ characters which requires /ActualText which in turn is
> > badly supported in PDF readers and lead to this and the duplicate bug.
> 
> Hi!  Thank you for tracking down this problem!
> 
> In the case of the duplicate bug (#161514) I am not convinced that, as you
> say, "The PDF has valid character data".  The problem there is that the
> character <02> is not mapped to anything in the ToUnicode CMap:

That is a still fully-complaint and valid PDF and all the character data is.
The use of ActualText is by design, lack of support in PDF readers is an
unfortunate limitation, but so it the sate of text extraction from PDF in
general.

Using ActualText is unavoidable. It can be avoided in the particular cases
here, but not in general.


> The problem with /ActualText (aside from not being supported by any PDF
> readers except Acrobat...) is that there's no way to tell which characters
> in the /ActualText correspond to which characters in the text object, which
> becomes an issue for layout analysis and low-level text extraction in
> libraries like pdfminer/pdfplumber.  I'm looking at implementing support for
> it there and this is a real stumbling block.

We use ActualText for the smallest range of glyphs that we can map to a range
of characters, so if an ActualText tag is used then we don’t have any
information that can tell which glyphs in this sequence belongs to which
characters (this regression notwithstanding of course).

When shaping text, there are 4 possiple glyph to character relationships:
1. one glyph to one character: this is the common case and it can be handled by
ToUnicode.
2. one glyph to many characters, AKA ligatures: this can also be handled by
ToUnicode.
3. many glyphs to one character, AKA decomposition: this can not be handled by
ToUnicode and ActualText tags must be used.
4. many glyphs to many characters, which can happen in scripts that reorders
input text. Again, this can not be handled by ToUnicode and ActualText tags
must be used.

On top of that, ToUnicode mapping must be unique, a glyph can appear there only
once, but fonts might map different characters to the same glyph, and in this
case ToUnicode to be used for one of these mappings, and all the others will
need ActualText.

The case here can be fixed. Using HarfBuzz cluster level 0 is not required, but
it was the quickest way to fix bug 151350 and I didn’t think about the
implications this has on PDF text extraction.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 158329] Can't find text with Niqqud in exported PDF

Reply via email to