Hi,
[email protected] wrote:
If you know, where everything you want to highlight is, it is not
too much work. It is a lot work to extract this information.
What do you mean by "where everything is"? Is it enough to know the
"word" or do I have to know the exact coordinates and height and with
of every word I want to highlight?
Depends... in many cases it is easy to find a word you search for in the
PDF (e.g. if it is short and not often broken over lines, columns or
pages) but it can be difficult as for long names (image trying to find
"Dihydrogen monoxide" in a small table), e.g. broken into
Di-
hydrogen
mono-
xide
or even over pages.
Did you use PDFBox to extract the
text or another (open source) tool?
We are using several different tools -- perhaps I'll write a comparison
in some years ;-).
Perhaps you get an idea of the problems from a poster we had on a
workshop:
http://dx.doi.org/10.1038/npre.2009.3141.1
Thanks for the link. Seems that you really invested a lot of time in
this topic.
It's already more than I expected when I started ;-).
Regards,
Roman