We have used PDFBox (through our extension PDF2SVG) to extract vectors and overlays in scientific images and as part of this to find "questionable practice". As an example here is a scientific graph http://www.slideshare.net/petermurrayrust/contentmining-in-neuroscience (slides 22,23,24). 22 shows an acceptable spectrum, but analysis of the vectors and layers in 23 shows a white square, which has been used to obscure an impurity peak, The student involved confessed and has probably ruined their career.
This may even have been introduced *before* creating the PDF - if rich graphical formats are imported into PDF they often preserve all the vectors and layers. The positive side of this is that we can often use this to extract high quality data from PDFs, as long as the images are not mutilated into bitmaps. An example of this is the reconstruction of high quality astronomical data from graphs: see slides 34/36/37/38. If anyone is interested in data extraction from graphs using PDF2SVG contact me offlist. It's alpha but may be useful for those who are happy to do some hacking. On Sat, Oct 31, 2015 at 12:02 PM, Tilman Hausherr <[email protected]> wrote: > Am 31.10.2015 um 04:05 schrieb Sriram Varadharajan: > >> Is there any other alternative like overlaying an opaque rectangle on top >> of the rectangle box that has the data . I know the coordinates as i use >> it >> to extract the data from the PDF at the first place . >> >> I am also OK filling out rectangles with dark colors . At the end i need >> only the borders and no data . >> > > Heh heh...: > > http://news.bbc.co.uk/2/hi/europe/4504589.stm > > Tilman > > > > > >> >> >> On Fri, Oct 30, 2015 at 7:11 PM, John Hewson <[email protected]> wrote: >> >> This is a very hard thing to get right, especially if you have compliance >>> needs. >>> There are just so many ways that sensitive data could remain embedded in >>> the resulting document. >>> >>> If you want my advice, don’t attempt this. >>> >>> — John >>> >>> On 30 Oct 2015, at 18:37, Sriram Varadharajan <[email protected]> >>>> >>> wrote: >>> >>>> We are using PDFBox to process PDF that contains sensitive data . >>>> >>> Currently >>> >>>> we don't store these PDF (even after encrypting) due to security >>>> >>> compliance >>> >>>> . If there is an ability to strip the data out of PDF we can save the >>>> >>> file >>> >>>> and we can use them for analytical purposes >>>> >>>> Question is Does PDF box or any other utility out there gives the >>>> >>> ability >>> >>>> to blank out all the Data in the PDF and just save the skeleton alone ? >>>> Please share any custom solutions or ideas if any !! >>>> >>>> Thanks >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

