Re: [R] How to remove square brackets, etc. from address strings?

Rui Barradas Sun, 27 May 2012 10:11:12 -0700

Hello,

Though I've not been following this thread, it seems like a regularexpressions problem.

In the code below, I've created a 'testdata' variable based on your post.


# create a vector with two elements.
x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y <- gsub("Germany", "Portugal", x)
testdata <- c(x, y)

# 's' is a list of character vectors, each element's final word is acountry

s <- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))


If this isn't it, sorry for the intrusion.

Rui Barradas

Em 27-05-2012 17:29, Sabina Arndt escreveu:

Hello r-help members,
I'm very grateful for the reply which Sarah Goslee sent to me in sucha prompt and helpful manner.It took me some time, but with a few amendments her suggestion nowworks not only for an example but for my entire data file as well:
> results
  [1] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
  [5] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
...

Thank you very much for that, dear Sarah!
All these names actually belong to the very first record, though,which contains eight addresses instead of only one:
> testdata[1]
[1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten;Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany;[Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig,Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany;[Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael]Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich,Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary CtrClin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med,Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt,Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med,Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.]Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin,Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
> results[1]
  [1] "GERMANY"

How can I put the country names back into their original lines / order?
This is an example of the correct result I'd like to receive:

> results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY""GERMANY" "GERMANY"
How can I achieve this result?
I think counting the semicolons outside square brackets - i.e. theones before a "[" but behind a "]" would be helpful in this regard,but I'm not sure how to do that, unfortunately. These semicolonsdirectly follow the country names, like this, e.g.: "... Germany; [..."If I add "+ 1" to their number it results in the number of addressesfor each record / line.
Thank you very much in advance!

Faithfully yours,

Sabina Arndt


Am 26.05.2012 00:19, schrieb Sarah Goslee:
Part of your problem is that your regexes have spaces in them, so
that's what you're matching.

A small reproducible example would be more useful. I'm not feeling
inclined to wade through all your linked files on Friday evening, but
see if this helps:
testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery,Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& MolDiagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; Blueher,Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept InternalMed, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] UnivLeipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;[Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol&Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen;Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys,Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst AnimSci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium PharmaceutAG, Martinsried, Germany"
results<- gsub("\\[.*?\\]", "", testdata)
results<- unlist(strsplit(results, ";"))
results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$","\\1", x))
names(results)<- NULL
results
[1] "New Zealand" "USA"         "Germany"     "Germany"     "Germany"
    "Germany"     "Germany"     "Germany"


Sarah
On Fri, May 25, 2012 at 4:31 PM, SabinaArndt<sabina.ar...@hotmail.de> wrote:
Hello r-help members,
the solutions which Sarah Goslee and arun sent to me in such aprompt andhelpful manner work well with the examples I cut from the data.frameI'm
analyzing. Thank you very much for that!
I incorporated them into my R-script and discovered that it stilldoesn't
work properly, unfortunately. I have no idea why that's the case.
You see, I want to extract country names from the contents oftab-delimited
text files. This is an example of the data I'm using:
http://pastebin.com/mYZNDXg6
This is the script I'm using to import the data:
http://pastebin.com/Z10UUH3z (It requires the text files to be in afolder
which doesn't contain any other .txt files.)
This is the script I'm using to extract the country names:
http://pastebin.com/G37fuPba
This is the string that's in the relevant field of the first record I'm
working on:

[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz,
Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany;[Teupser,Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med,InstLab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke;Kern,Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, FacMed, Dept
Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ
Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
[Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol&Toxicol,Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster,Daniel]
Univ Leipzig, Fac Med, Inst Med Phys&  Biophys, Leipzig, Germany;
[Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin,
Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried,Germany
This is the incorrect result my extraction script gives me for thefirst
record:
C1s[1]
  [1] "[ENGEL,  KATHRIN M. Y." "KRISTIN"                "TORSTEN"
  [4] "GERMANY"                "DANIEL"                 "LESCA MIRIAM"
  [7] "GERMANY"                "ANKE"                   "MATTHIAS"
[10] "MATTHIAS"               "GERMANY"                "KERSTIN"
[13] "GERMANY" "GERMANY" "[SCHEIDT,HOLGER
A."
[16] "JUERGEN"                "GERMANY"                "HUMBOLDT"
[19] "GERMANY"
For some reason the first and sixth pair of the eight squarebrackets are
not removed ... Do you understand why?
Instead I'd like to get this result, though:
C1s[1]
  [1] "GERMANY"        "GERMANY"        "GERMANY"
  [4] "GERMANY"        "GERMANY"        "GERMANY"
  [7] "HUMBOLDT"        "GERMANY"

What am I doing wrong? What are the errors in my R-script?
Would anybody be so kind as to take a look and help me out, please?
Thank you very much in advance!

Faithfully yours,

Sabina Arndt
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to remove square brackets, etc. from address strings?

Reply via email to