Hi,

[email protected] wrote:
If you know, where everything you want to highlight is, it is not
too much work. It is a lot work to extract this information.

What do you mean by "where everything is"? Is it enough to know the
"word" or do I have to know the exact coordinates and height and with
of every word I want to highlight?

Depends... in many cases it is easy to find a word you search for in the PDF (e.g. if it is short and not often broken over lines, columns or pages) but it can be difficult as for long names (image trying to find "Dihydrogen monoxide" in a small table), e.g. broken into

Di-
hydrogen
mono-
xide

or even over pages.


Did you use PDFBox to extract the
text or another (open source) tool?

We are using several different tools -- perhaps I'll write a comparison in some years ;-).


Perhaps you get an idea of the problems from a poster we had on a workshop:
http://dx.doi.org/10.1038/npre.2009.3141.1
Thanks for the link. Seems that you really invested a lot of time in
this topic.

It's already more than I expected when I started ;-).

Regards,
 Roman


Reply via email to