https://bugs.documentfoundation.org/show_bug.cgi?id=158329
--- Comment #12 from خالد حسني <[email protected]> --- (In reply to David Huggins-Daines from comment #11) > (In reply to خالد حسني from comment #10) > > instead have 2+ > > glyphs mapped to 2+ characters which requires /ActualText which in turn is > > badly supported in PDF readers and lead to this and the duplicate bug. > > Hi! Thank you for tracking down this problem! > > In the case of the duplicate bug (#161514) I am not convinced that, as you > say, "The PDF has valid character data". The problem there is that the > character <02> is not mapped to anything in the ToUnicode CMap: That is a still fully-complaint and valid PDF and all the character data is. The use of ActualText is by design, lack of support in PDF readers is an unfortunate limitation, but so it the sate of text extraction from PDF in general. Using ActualText is unavoidable. It can be avoided in the particular cases here, but not in general. > The problem with /ActualText (aside from not being supported by any PDF > readers except Acrobat...) is that there's no way to tell which characters > in the /ActualText correspond to which characters in the text object, which > becomes an issue for layout analysis and low-level text extraction in > libraries like pdfminer/pdfplumber. I'm looking at implementing support for > it there and this is a real stumbling block. We use ActualText for the smallest range of glyphs that we can map to a range of characters, so if an ActualText tag is used then we don’t have any information that can tell which glyphs in this sequence belongs to which characters (this regression notwithstanding of course). When shaping text, there are 4 possiple glyph to character relationships: 1. one glyph to one character: this is the common case and it can be handled by ToUnicode. 2. one glyph to many characters, AKA ligatures: this can also be handled by ToUnicode. 3. many glyphs to one character, AKA decomposition: this can not be handled by ToUnicode and ActualText tags must be used. 4. many glyphs to many characters, which can happen in scripts that reorders input text. Again, this can not be handled by ToUnicode and ActualText tags must be used. On top of that, ToUnicode mapping must be unique, a glyph can appear there only once, but fonts might map different characters to the same glyph, and in this case ToUnicode to be used for one of these mappings, and all the others will need ActualText. The case here can be fixed. Using HarfBuzz cluster level 0 is not required, but it was the quickest way to fix bug 151350 and I didn’t think about the implications this has on PDF text extraction. -- You are receiving this mail because: You are the assignee for the bug.
