Re: [R] "Complex?" import of pdf files (criminal records) into R table

Marc Schwartz Thu, 15 Oct 2009 07:30:23 -0700

On Oct 15, 2009, at 3:43 AM, Biedermann, Jürgen wrote:

Hi there,
I'm facing the decision if it would be possible to transform severalmore or less complex pdf files into an R Table-Format or if it hasto be done manually. I think it would be a impudent to expect acomplete solution, but I would be grateful if anyone could give mean advice on how the structure of such a R-program could look like,and if it's possible in general.
Here the problem:
Each pdf file belongs to a person. The pdf files actually representthe anonymous criminal record of a person. Each entry should lead toone row with the person number as key. The different lines shouldform the columns. The criminal record actually looks like this:
---------------------------------------------------
Header with irrelevant text for us | Date: xx.xx.xxxx (relevantfor us)
Anonymous person number: xxxxxxxxxxx

Entries in the register

1. xx.xx.1902  -City-
  Be in force since: xx.xx.1902
  Date of offense:xx.xx.xxxx
  Elements of the offence: For example "Rape"
  Section in law: §176, §178 Abs. 1
  Sentenced to 5 years imprisonment
  "Irrelevant text for us"
  Accommodation in an forensic psychiatry
  Accommodation sentenced on probation
  Rest of sentence sentenced on probation until the xx.xx.xxxx

2. xx.xx.1910
  Be in force since: ....
  .....

-----------------------------------------------------------------------
        
The problem is that the entries do not always have the samestructure. The first 6 lines are structurally the same in each entryof the criminal record (each entry has a line for the judgementdate, the "be in force" date, the date of offence, the elements ofthe offence, the Sections in law, and the sentence).
But then depending on the sentence different lines emerge whichcontain information if the person was sentenced on probation, if theprobation was withdrawn again, when the person was released etc.So, I think, these lines should be allocated to different columnsdepending on key words. The definition of the key words for mostcases would not be the problem, actually. If a certain column is notrelevant in an entry (so, the key word didn't emerge) NA should beput in the place.But because sometimes (in rare cases), the entries contain spellingerrors, at the end, all the lines of an entry, which could not beallocated to a column should be put in a column to check themmanually.
In the end the table should look more of less like this.

--------------------------------------------------
"Per.Numb";"EntryNumber";"Judg.Date";"DateOffen.";...;"Probation.until";"Released";"Not allocated"
xxxx1   1   xx.xx.1902  xx.xx.1901 ... xx.xx.1905 NA  "blablabla"
xxxx1   2   xx.xx.1910  xx.xx.1909 ... NA        1925  "blablabla"
xxxx2   1   xx.xx.1924  xx.xx.1923 ... NA        NA  "blablabla"
------------------------------------------------------------------

Could anyone help me?
Thanks

Greetings
Jürgen

You don't indicate the OS you are on, but you will want to get a holdof 'pdftotext', which is a command line application that can extractthe textual content from the PDF files. On most Linuxen, it is alreadyinstalled, but for Windows and OSX you will likely need to Google forit.

The basic approach is to loop over each PDF file, use pdftotext to getthe text content and dump it into a regular text file. That file canthen be read into R using ?readLines.

This can all be done within R using the ?system command. Get the namesof the PDF files in a given folder by using ?list.files with a "\\.pdf" or "\\.PDF" search pattern. Then ?paste together the fullcommand using a prefix along the lines of "pdftotext -layout -nopgbrk", presuming that the pdftotext command is in your $PATH. Thesuffix to be paste()d will be the name of the input PDF file and thename of the output text file. So you end up with a command linecharacter vector along the lines of:


  "pdftotext -layout -nopgbrk xxxxx.pdf xxxxx.txt"

where the x's are the specific file basenames. Review the pdftotextoptions to understand what is being done and if you should need tomodify them for your particular files.

Once you have the data in R for each file, you will then need toprocess the content line by line, looking for the keywords that areassociated with the content you require. Using ?grep is perhaps theeasiest way to accomplish that. You can then use ?gsub to replace/strip the keywords, leaving you with the data only, for each line. Formulti line scenarios, you will need to keep track of where the keywordfor the first line is and then look for the subsequent keyword orperhaps a blank line, to know when to stop aggregating the data forthat initial keyword.

It then becomes a matter of reorganizing the content that you needinto the format you require for subsequent work.

I have not looked for 'text processing' related packages on CRAN, soyou may wish to look there first in case there is anything relevant.


HTH,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] "Complex?" import of pdf files (criminal records) into R table

Reply via email to