We have used PDFBox (through our extension PDF2SVG)  to extract vectors and
overlays in scientific images and as part of this to find "questionable
practice". As an example here is a scientific graph
http://www.slideshare.net/petermurrayrust/contentmining-in-neuroscience
(slides 22,23,24). 22 shows an acceptable spectrum, but analysis of the
vectors and layers in 23 shows a white square, which has been used to
obscure an impurity peak, The student involved confessed and has probably
ruined their career.

This may even have been introduced *before* creating the PDF - if rich
graphical formats are imported into PDF they often preserve all the vectors
and layers. The positive side of this is that we can often use this to
extract high quality data from PDFs, as long as the images are not
mutilated into bitmaps. An example of this is the reconstruction of high
quality astronomical data from graphs: see slides 34/36/37/38.

If anyone is interested in data extraction from graphs using PDF2SVG
contact me offlist. It's alpha but may be useful for those who are happy to
do some hacking.

On Sat, Oct 31, 2015 at 12:02 PM, Tilman Hausherr <[email protected]>
wrote:

> Am 31.10.2015 um 04:05 schrieb Sriram Varadharajan:
>
>> Is there any other alternative like overlaying an opaque rectangle on top
>> of the rectangle box that has the data . I know the coordinates as i use
>> it
>> to extract the data from the PDF at the first place .
>>
>> I am also OK filling out rectangles with dark colors . At the end i need
>> only the borders and no data .
>>
>
> Heh heh...:
>
> http://news.bbc.co.uk/2/hi/europe/4504589.stm
>
> Tilman
>
>
>
>
>
>>
>>
>> On Fri, Oct 30, 2015 at 7:11 PM, John Hewson <[email protected]> wrote:
>>
>> This is a very hard thing to get right, especially if you have compliance
>>> needs.
>>> There are just so many ways that sensitive data could remain embedded in
>>> the resulting document.
>>>
>>> If you want my advice, don’t attempt this.
>>>
>>> — John
>>>
>>> On 30 Oct 2015, at 18:37, Sriram Varadharajan <[email protected]>
>>>>
>>> wrote:
>>>
>>>> We are using PDFBox to process PDF that contains sensitive data .
>>>>
>>> Currently
>>>
>>>> we don't store these PDF (even after encrypting) due to security
>>>>
>>> compliance
>>>
>>>> . If there is an ability to strip the data out of PDF we can save the
>>>>
>>> file
>>>
>>>> and we can use them for analytical purposes
>>>>
>>>> Question is  Does PDF box or any other utility out there gives the
>>>>
>>> ability
>>>
>>>> to blank out all the Data in the PDF and just save the skeleton alone ?
>>>> Please share any custom solutions or ideas if any !!
>>>>
>>>> Thanks
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to