[R] "Complex?" import of pdf files (criminal records) into R table

Biedermann, Jürgen Thu, 15 Oct 2009 07:01:46 -0700

Hi there,

I'm facing the decision if it would be possible to transform severalmore or less complex pdf files into an R Table-Format or if it has to bedone manually. I think it would be a impudent to expect a completesolution, but I would be grateful if anyone could give me an advice onhow the structure of such a R-program could look like, and if it'spossible in general.


Here the problem:

Each pdf file belongs to a person. The pdf files actually represent theanonymous criminal record of a person. Each entry should lead to one rowwith the person number as key. The different lines should form thecolumns. The criminal record actually looks like this:



---------------------------------------------------
Header with irrelevant text for us   |  Date: xx.xx.xxxx (relevant for us)

Anonymous person number: xxxxxxxxxxx

Entries in the register

1. xx.xx.1902  -City-
   Be in force since: xx.xx.1902
   Date of offense:xx.xx.xxxx
   Elements of the offence: For example "Rape"
   Section in law: §176, §178 Abs. 1
   Sentenced to 5 years imprisonment
   "Irrelevant text for us"
   Accommodation in an forensic psychiatry
   Accommodation sentenced on probation
   Rest of sentence sentenced on probation until the xx.xx.xxxx

2. xx.xx.1910
   Be in force since: ....
   .....

-----------------------------------------------------------------------

The problem is that the entries do not always have the same structure.The first 6 lines are structurally the same in each entry of thecriminal record (each entry has a line for the judgement date, the "bein force" date, the date of offence, the elements of the offence, theSections in law, and the sentence).

But then depending on the sentence different lines emerge which containinformation if the person was sentenced on probation, if the probationwas withdrawn again, when the person was released etc.So, I think, these lines should be allocated to different columnsdepending on key words. The definition of the key words for most caseswould not be the problem, actually. If a certain column is not relevantin an entry (so, the key word didn't emerge) NA should be put in the place.But because sometimes (in rare cases), the entries contain spellingerrors, at the end, all the lines of an entry, which could not beallocated to a column should be put in a column to check them manually.


In the end the table should look more of less like this.

--------------------------------------------------

"Per.Numb";"EntryNumber";"Judg.Date";"DateOffen.";...;"Probation.until";"Released";"Not allocated"


xxxx1   1   xx.xx.1902  xx.xx.1901 ... xx.xx.1905 NA  "blablabla"
xxxx1   2   xx.xx.1910  xx.xx.1909 ... NA        1925  "blablabla"
xxxx2   1   xx.xx.1924  xx.xx.1923 ... NA        NA  "blablabla"
------------------------------------------------------------------

Could anyone help me?
Thanks

Greetings
Jürgen

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] "Complex?" import of pdf files (criminal records) into R table

Reply via email to