Am 24.08.2021 um 06:17 schrieb flywire:
With a bit of customisation, PDFBox should be able to parse pdf to md
<https://www.markdownguide.org/cheat-sheet/>. This probably involves a
process like PDFText2HTML.java
<https://svn.apache.org/repos/asf/pdfbox/branches/2.0/tools/src/main/java/org/apache/pdfbox/tools/PDFText2HTML.java>,
possibly just modifying that processor, but I'm open to advice.
I can find tutorials on how to program in Java but I'd like to know the
approach (how to go about it) with PDFBox. A lot of syntax is just matching
patterns, an approach that lets me use leaflet.js without knowing js.
Hopefully, any code given in an answer is explained clearly enough so I can
understand it.
The problem is that this would require code changes. The command line
utilities are for some mainstream requirements.
Some of your requirements (separating lines) can probably be met by
using the methods of PDFTextStripper. Have a look at its javadoc. And
yes, PDFText2HTML.java does some of this.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]