On 25-Jan-07 Stephen Holland wrote: > [...] > Recently I was using my Mac's spotlight search engine to look for > files through keyword searches and found I was having trouble > locating pages. Also, when I copy text from a PDF so generated the > text copies with odd spacing. > > This behavior seems to be related to kerning functions generated by > groff. Following is an example of the problem. > [...]
This is an interesting question! I hadn't been aware of this problem before, so have just carried out a test with my usual method of accessing text in PDF files, with a PDF file generated from groff (groff -> PS -> PDF via pstopdf [a tweaked version of ghostscript's ps2pdf]). The paragraph of text as displayed in the PDF file reads as follows (because of long lines, I have indicated continuations by marking a line-break below with " \" where I have introduced my own line break, followed by a ">> " for its continuation on the next line below. All other ends of line are where the lines ended in the PDF display): The usage defined in (1) ensures that the final \ >> value $Y sub N$ of the CUSUM is zero, while the first value will be $X sub 1 - M$ and therefore relatively \ >> small. Hence the plot of the CUM- SUM will (if there is a change in level) start off \ >> heading in one direction, then reverse and head in the opposite direction. A suitable vertical \ >> scaling can then be chosen to give the best effect. This method therefore ensures optimum \ >> visibility of any change in level. I have also written "$X sub N$" and "$X sub 1 - M$ to indicate (using eqn code) that there is properly formatted mathematical printing at these points. The above paragraph was produced by groff without suppressing any of its normal layout functions (line-filling with stretchable interword spaces and hyphenation, kerning, and ligatures, etc.). Now for the Test! I opened a new text file for editing, and used the Acrobrat Reader "Text Selection" tool to enable me to highlight and copy blocks of text from the PDF window to the text window using the mouse. Here is what got copied into the text window (where I have made line-breaks arbitrarily, but otherise have changed nothing): The usage deåned in (1) ensures that the ånal value YN of the CUSUM is zero, while the årst value will be X1 - M and therefore relatively small. Hence the plot of the CUMSUM will (if there is a change in level) start off heading in one direction, then reverse and head in the opposite direction. A suitable vertical scaling can then be chosen to give the best effect. This method therefore ensures optimum visibility of any change in level. So what has happened here, compared with the PDF display? First, note that the ligature "fi" has come through as "å". This is fair enough, since I allowed ligatures originally, and the result is legitimately a single character; also, the text-window is using iso-8859-1, so would not cope with unicode/utf8 characters anyway. The rest, however, is (to my mind) surprisingly faithful to the intent of the original. In particular, the hyphenated CUM-SUM (end of 2nd line in PDF) has been reconstituted to a single word, as it originally was. What was printed (as formatted equations) as $X sub N$ came out as "XN" and as $X sub 1 - M$ came out as "X1 - M" and I guess you can't ask for better than that in mere text extraction. It's certainly good enough to search for, if you're hunting for equations as well as ordinary words. In no case have separate words been run together, nor have ay words been split. Finally, although there is plenty of kerning active in the PS file from which the PDF file was derived (and therefore presumably carried over to the PDF file), none of this shows up as any kind of break in the text copied to the text file. The only feature that could "break" a program indexing the output might be the representation of the ligatures. This might not be a problem anyway, in the context of a suitable locale; but in any case it might be wise (when the document is intended to be used in this way) to turn off the ligature feature in groff. Then that issue would not arise at all, and the result would be perfectly readable in the first place. Thus, this little test shows that using the mouse to copy text from a PDF file (displayed by Acrobat Reader) does not generate any of the problems described by Stephen Holland. This leads me to suspect that the comments by Gunnar and Ralph, relating to the capacity of some PDF->text programs, may well be close to the mark! My mouse (with its tail plugged into X Windows on Linux) has very little brain, yet it has worked really well. Hmmm. Best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 25-Jan-07 Time: 20:39:40 ------------------------------ XFMail ------------------------------ _______________________________________________ Groff mailing list Groff@gnu.org http://lists.gnu.org/mailman/listinfo/groff