On 06/24/2024 12:35 AM, Richard wrote:
Hello,
this very much depends on what you are expecting it to do. In general, PDFs
are only meant to be viewed - and printed - they where never meant for
anything else. ...
Second sentence should read:
... only meant to be viewed by those with *NORMAL* vision ...
I'm attempting to read a USDA document.[1]
The printed version of this document is marginally readable.
Tools such as "Atril Document Viewer" provide selected magnification.
For this particular document and monitor, 150% is comfortable. Requires
re-positioning the viewpoint 500 to 600 times to read document.
For _this_ document, Atril can select all the text on a page in a manner
that can be pasted in a "reasonable" manner to a Pluma document.
It will:
a. ignore actual graphics.
b. put title/headings/??? on a separate line.
c. all text between full page-width title/headings/??? will be
treated as a logical unit.
It will not:
1. put a blank line between paragraphs.
2. put a blank line above/below lines containing title/headings/???.
3. identify superscripts in some manner.
All this suggests that it should be able to extract text from a PDF and
create a HTML document likely using only <p>, <br>, <sup>, and <li> in
its <body>.
[1]
https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
_Thrifty Food Plan, 2021_
Food and Nutrition Service
August 2021
FNS-916