On 06/24/2024 12:35 AM, Richard wrote:
Hello,
this very much depends on what you are expecting it to do. In general, PDFs
are only meant to be viewed - and printed - they where never meant for
anything else. ...

Second sentence should read:
... only meant to be viewed by those with *NORMAL* vision ...

I'm attempting to read a USDA document.[1]
The printed version of this document is marginally readable.

Tools such as "Atril Document Viewer" provide selected magnification.
For this particular document and monitor, 150% is comfortable. Requires re-positioning the viewpoint 500 to 600 times to read document.

For _this_ document, Atril can select all the text on a page in a manner that can be pasted in a "reasonable" manner to a Pluma document.

It will:
   a. ignore actual graphics.
   b. put title/headings/??? on a separate line.
   c. all text between full page-width title/headings/??? will be
      treated as a logical unit.
It will not:
   1. put a blank line between paragraphs.
   2. put a blank line above/below lines containing title/headings/???.
   3. identify superscripts in some manner.

All this suggests that it should be able to extract text from a PDF and create a HTML document likely using only <p>, <br>, <sup>, and <li> in its <body>.


[1] https://fns-prod.azureedge.us/sites/default/files/resource-files/TFP2021.pdf
    _Thrifty Food Plan, 2021_
    Food and Nutrition Service
    August 2021
    FNS-916

Reply via email to