RE: [Groff] On copying text from PDF files that started with gro

Ted Harding Thu, 25 Jan 2007 12:39:58 -0800

On 25-Jan-07 Stephen Holland wrote:
> [...]
> Recently I was using my Mac's spotlight search engine to look for  
> files through keyword searches and found I was having trouble  
> locating pages.  Also, when I copy text from a PDF so generated the  
> text copies with odd spacing.
> 
> This behavior seems to be related to kerning functions generated by  
> groff.  Following is an example of the problem.
> [...]


This is an interesting question! I hadn't been aware of this problem
before, so have just carried out a test with my usual method of
accessing text in PDF files, with a PDF file generated from groff
(groff -> PS -> PDF via pstopdf [a tweaked version of ghostscript's
ps2pdf]).

The paragraph of text as displayed in the PDF file reads as follows
(because of long lines, I have indicated continuations by marking
a line-break below with " \" where I have introduced my own line
break, followed by a ">>  " for its continuation on the next line
below. All other ends of line are where the lines ended in the PDF
display):

  The usage defined in (1) ensures that the final \
>>  value $Y sub N$ of the CUSUM is zero, while the
  first value will be $X sub 1 - M$ and therefore relatively \
>>  small. Hence the plot of the CUM-
  SUM will (if there is a change in level) start off \
>>  heading in one direction, then reverse and
  head in the opposite direction. A suitable vertical \
>>  scaling can then be chosen to give the
  best effect. This method therefore ensures optimum \
>>  visibility of any change in level.

I have also written "$X sub N$" and "$X sub 1 - M$ to indicate
(using eqn code) that there is properly formatted mathematical
printing at these points.

The above paragraph was produced by groff without suppressing
any of its normal layout functions (line-filling with stretchable
interword spaces and hyphenation, kerning, and ligatures, etc.).

Now for the Test!

I opened a new text file for editing, and used the Acrobrat
Reader "Text Selection" tool to enable me to highlight and
copy blocks of text from the PDF window to the text window
using the mouse. Here is what got copied into the text window
(where I have made line-breaks arbitrarily, but otherise
have changed nothing):

  The usage deåned in (1) ensures that the ånal value YN of
  the CUSUM is zero, while the årst value will be X1 - M
  and therefore relatively small. Hence the plot of the
  CUMSUM will (if there is a change in level) start off
  heading in one direction, then reverse and head in the
  opposite direction. A suitable vertical scaling can then
  be chosen to give the best effect. This method therefore
  ensures optimum visibility of any change in level.

So what has happened here, compared with the PDF display?

First, note that the ligature "fi" has come through as "å".
This is fair enough, since I allowed ligatures originally,
and the result is legitimately a single character; also,
the text-window is using iso-8859-1, so would not cope with
unicode/utf8 characters anyway.

The rest, however, is (to my mind) surprisingly faithful to
the intent of the original.

In particular, the hyphenated CUM-SUM (end of 2nd line in PDF)
has been reconstituted to a single word, as it originally was.

What was printed (as formatted equations) as

  $X sub N$ came out as "XN"

and as

  $X sub 1 - M$ came out as "X1 - M"

and I guess you can't ask for better than that in mere text
extraction. It's certainly good enough to search for, if
you're hunting for equations as well as ordinary words.

In no case have separate words been run together, nor have
ay words been split.

Finally, although there is plenty of kerning active in the
PS file from which the PDF file was derived (and therefore
presumably carried over to the PDF file), none of this shows
up as any kind of break in the text copied to the text file.

The only feature that could "break" a program indexing the
output might be the representation of the ligatures. This
might not be a problem anyway, in the context of a suitable
locale; but in any case it might be wise (when the document
is intended to be used in this way) to turn off the ligature
feature in groff. Then that issue would not arise at all,
and the result would be perfectly readable in the first place.

Thus, this little test shows that using the mouse to copy
text from a PDF file (displayed by Acrobat Reader) does not
generate any of the problems described by Stephen Holland.

This leads me to suspect that the comments by Gunnar and
Ralph, relating to the capacity of some PDF->text programs,
may well be close to the mark!

My mouse (with its tail plugged into X Windows on Linux)
has very little brain, yet it has worked really well.

Hmmm.
Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 25-Jan-07                                       Time: 20:39:40
------------------------------ XFMail ------------------------------


_______________________________________________
Groff mailing list
Groff@gnu.org
http://lists.gnu.org/mailman/listinfo/groff

RE: [Groff] On copying text from PDF files that started with gro

Reply via email to