Re: [R] selecting dataframe columns based on substring of col name(s)

David Winsemius Wed, 21 Jun 2017 12:28:48 -0700

> On Jun 21, 2017, at 9:11 AM, Evan Cooch <evan.co...@gmail.com> wrote:
> 
> Suppose I have the following sort of dataframe, where each column name has a 
> common structure: prefix, followed by a number (for this example, col1, col2, 
> col3 and col4):
> 
> d = data.frame( col1=runif(10), col2=runif(10), col3=runif(10),col4=runif(10))
> 
> What I haven't been able to suss out is how to efficiently 
> 'extract/manipulate/play with' columns from the data frame, making use of 
> this common structure.
> 
> Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I 
> could subset the dataframe d in any number of ways -- for example
> 
> piece <- d[,c("col2","col3","col4")]
> 
> Works as expected, but for *big* problems (where I might have dozens -> 
> hundreds of columns -- often the case with big design matrices output by some 
> linear models program or another), having to write them all out using 
> c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering about 
> is if there is a way to simply select over the "changing part" of the column 
> name (you can do this relatively easily in a data step in SAS, for example). 
> Heuristically, something like:
> 
> piece <- df[,col2:col4]
> 
> where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the 
> prefix 'col', and then simply select over the changing suffic -- i.e., column 
> number).
> 
> Now, if I use the "to" function in the lessR package, I can get there from 
> here fairly easily:
> 
> piece <- d[,to("col",4,from=2,same.size=FALSE)]
> 
> But, is there a better way? Beyond 'efficiency' (ease of implementation), 
> part of what constitutes 'better' might be something in base R, rather than 
> relying on a package?


After staring at the code for the base function subset with a thought to 
hacking it to do this I realized that should be already part of the evaluation 
result from its current form:

 names(airquality)
#[1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"  

subset(airquality, 
          Temp > 90,             # this is the row selection
          select = Ozone:Solar.R) # and this selects columns
#--------
    Ozone Solar.R
42     NA     259
43     NA     250
69     97     267
70     97     272
75     NA     291
102    NA     222
120    76     203
121   118     225
122    84     237
123    85     188
124    96     167
125    78     197
126    73     183
127    91     189

Bert's advice to work with the numbers is good, but conversion to numeric 
designations of columns inside the `select`-expression is actually what is 
occurring inside `subset`.

-- 

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] selecting dataframe columns based on substring of col name(s)

Reply via email to