Well, I did write in https://lists.debian.org/debian-user/2016/09/msg00653.html that "This is one area where a bit of experimentation will help much more than trying to understand the scattered documentation."
On Tue 20 Sep 2016 at 15:08:58 (+0100), Brian wrote: > On Mon 19 Sep 2016 at 22:41:23 -0500, David Wright wrote: > > > On Sun 18 Sep 2016 at 16:14:37 (-0400), Haines Brown wrote: > > > I've begun to experience problems using the mouse to select a passage in > > > a PDF displayed with xpdf 3.03-10 in order to paste it elsewhere. > > > > > > The ends of lines are truncated to varying degrees. For example in a > > > PDF with this: > > > > > > 123456789 > > > 123456789 > > > 1234567 > > > > > > The past might look like > > > > > > 12345678 > > > 1234567 > > > 123456 > > > > Can you confirm that dragging your mouse produces a black rectangle, > > and that the rectangle has the last digits (the ones that get lost) > > highlighted thus. > > Could be a possible cause. My mouse skills aren't brilliant and not > precisely positioning the rectangle has often lead to my having to redo > the copying. > > What could also be tried is a search for '123456789'. Searching is just > another form of text extraction. If it cannot be found a string cannot > be copied correctly after highlighting it. That's a good idea, and it seems to correlate with pdftotext's behaviour but is much quicker. > > My own experience is all or nothing. What I get correlates with the > > output of pdftotext; if that can extract the text, I can copy it > > with the mouse, if not then I can't. PDFs I produce with paps, for > > example, don't work: I don't know why this is the case. > > How do you produce a PDF using paps? Sorry, missed out a step. The paps output is filtered through ps2pdf so that could explain a lot. Thanks for reminding me. (The clue is in the name!) > > Actually, there is a third case: the pasted text is garbage. I think > > this happens if the fonts are stripped of unused glyphs and then > > packed into the minimum number of fonts to save memory. I may be > > wrong here, though. > > One table in a PDF stores character shapes (glyphs). This table is used > by mupdf (say) to draw the page. mupdf does this without knowing that it > is text; it is interested only in the shapess. > > A second table (the ToUnicode map) is used to work out what the text > says. The first table says that first shape in the word "Debian" looks > like a "D". The second table says that that shape has a particular > unicode value. > > A defective or missing ToUnicode map has mupdf having no idea what the > shapes mean, although it will render them them correctly on the screen > or in print. So it resorts to a default mapping. The result is garbage > for copy/paste. However, it can be logical garbage; every "D" becomes > "X", every "b" a "P" etc. When searching, the string being looked for > will not be found. ("Debian" is "XGP?yL", for example). > > https://github.com/angea/PDF101/tree/master/handcoded/textextract > > is of interest. Useful reference, thanks. > > > Evince apparently does not support selecting text for copying. This does > > > not happen on other machines. > > > > My experience here is similar to xpdf but with a few differences: when > > it works (the same files do), the selection is line by line (ie like > > an xterm) rather than a strict rectangle; if it can't do it, it > > doesn't highlight (whereas xpdf "lies": it highlights but fails to > > copy); the highlighting may be coloured (white→blue, black→white) or > > black (which hides the text). > > Evince seems to be aware if *all* the text is not copiable and will then > not allow it to be selected. It does not appear to be aware when only > portions of a document are not copiable/searchable and these portions > are selectable. Well, man xpdf says baldly "Dragging the mouse with the left button held down will highlight an arbitrary rectangle." I guess I hadn't realised just how bald that rectangle can be. It's tedious ascertaining anything about xpdf in the "jessie period" because so much of it is broken; I have to repeat everything in wheezy to make sure the problem is ephemeral. (Will these problems go away?) Cheers, David.