Re: [R] sorting variable names containing digits

John Fox Sun, 21 Dec 2008 21:34:13 -0800

Dear Gabor,

Thank you (again) for this second suggestion, which does exactly what I
want. At the risk of appearing ungrateful, and although the judgment is
admittedly subjective, I don't find it simpler than mysort().


For curiosity, I tried some timings of the two functions for the sample
problems that I supplied:

> system.time(for (i in 1:100) mysort(s))
   user  system elapsed 
  1.498   0.006   1.503 

> system.time(for (i in 1:100) mysort2(s))
   user  system elapsed 
  6.026   0.028   6.059 

> system.time(for (i in 1:100) mysort(t))
   user  system elapsed 
  0.858   0.003   0.874 

> system.time(for (i in 1:100) mysort2(t))
   user  system elapsed 
  2.736   0.014   2.757 

This is on a  2.4 GHz Core 2 Duo MacBook. I don't know of course
whether this generalizes to other problems. I suspect that the
recursive solution will look worse as the number of "components" of the
names increases, but of course names are unlikely to have a large
number of components.

Best,
 John

On Sun, 21 Dec 2008 23:28:51 -0500
 "Gabor Grothendieck" <ggrothendi...@gmail.com> wrote:
> Another possibility is to use strapply in gsubfn giving a solution
> that is non-recursive and shorter:
> 
> library(gsubfn)
> 
> mysort2 <- function(s) {
>       L <- strapply(s, "([0-9]+)|([^0-9]+)",
>               ~ if (nchar(x)) sprintf("%9d", as.numeric(x)) else y)
>       L2 <- t(do.call(cbind, lapply(L, ts)))
>       L3 <- replace(L2, is.na(L2), "")
>       ord <- do.call(order, as.data.frame(L3, stringsAsFactors = FALSE))
>       s[ord]
> }
> 
> 
> First strapply breaks up each string into a character vector of the
> numeric
> and non-numeric components.  We pad each numeric component on the
> left with spaces using sprintf so they are all 9 wide.  The next line
> turns that
> into a matrix L2 and then we replace the NAs giving L3.  Finally we
> order it
> and apply the ordering, ord, to get the sorted version.
> 
> The gsubfn home page is at:
> http://gsubfn.googlecode.com
> 
> Here is some sample output:
> 
> > mysort2(s)
>  [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
> "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"   "y10a10"
> > mysort(s)
>  [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
> "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"   "y10a10"
> 
> > mysort2(t)
> [1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
> > mysort(t)
> [1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
> 
> 
> On Sun, Dec 21, 2008 at 9:57 PM, John Fox <j...@mcmaster.ca> wrote:
> > Dear Gabor,
> >
> > Thanks for this -- I was unaware of mixedsort(). As you point out,
> > however, mixedsort() doesn't cover all of the cases in which I'm
> > interested and which are handled by mysort().
> >
> > Regards,
> >  John
> >
> > On Sun, 21 Dec 2008 20:51:17 -0500
> >  "Gabor Grothendieck" <ggrothendi...@gmail.com> wrote:
> >> mixedsort in gtools will give the same result as mysort(s) but
> >> differs in the case of t.
> >>
> >> On Sun, Dec 21, 2008 at 8:33 PM, John Fox <j...@mcmaster.ca>
> wrote:
> >> > Dear r-helpers,
> >> >
> >> > I'm looking for a way of sorting variable names in a "natural"
> >> order, when
> >> > the names are composed of digits and other characters. I know
> that
> >> this is a
> >> > vague idea, and that sorting character strings is a complex
> topic,
> >> but
> >> > perhaps a couple of examples will clarify what I mean:
> >> >
> >> >> s <- c("x1b", "x1a", "x02b", "x02a", "x02", "y1a1", "y10a2",
> >> > +   "y10a10", "y10a1", "y2", "var10a2", "var2", "y10")
> >> >
> >> >> sort(s)
> >> >  [1] "var10a2" "var2"    "x02"     "x02a"    "x02b"    "x1a"
> >> >  [7] "x1b"     "y10"     "y10a1"   "y10a10"  "y10a2"   "y1a1"
> >> > [13] "y2"
> >> >
> >> >> mysort(s)
> >> >  [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
> >> >  [7] "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"
> >> > [13] "y10a10"
> >> >
> >> >> t <- c("q10.1.1", "q10.2.1", "q2.1.1", "q10.10.2")
> >> >
> >> >> sort(t)
> >> > [1] "q10.1.1"  "q10.10.2" "q10.2.1"  "q2.1.1"
> >> >
> >> >> mysort(t)
> >> > [1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
> >> >
> >> > Here, sort() is the standard R function and mysort() is a
> >> replacement, which
> >> > sorts the names into the order that seems natural to me, at
> least
> >> in the
> >> > cases that I've tried:
> >> >
> >> > mysort <- function(x){
> >> >  sort.helper <- function(x){
> >> >    prefix <- strsplit(x, "[0-9]")
> >> >    prefix <- sapply(prefix, "[", 1)
> >> >    prefix[is.na(prefix)] <- ""
> >> >    suffix <- strsplit(x, "[^0-9]")
> >> >    suffix <- as.numeric(sapply(suffix, "[", 2))
> >> >    suffix[is.na(suffix)] <- -Inf
> >> >    remainder <- sub("[^0-9]+", "", x)
> >> >    remainder <- sub("[0-9]+", "", remainder)
> >> >    if (all (remainder == "")) list(prefix, suffix)
> >> >    else c(list(prefix, suffix), Recall(remainder))
> >> >    }
> >> >  ord <- do.call("order", sort.helper(x))
> >> >  x[ord]
> >> >   }
> >> >
> >> > I have a couple of applications in mind, one of which is
> >> recognizing
> >> > repeated-measures variables in "wide" longitudinal datasets,
> which
> >> often are
> >> > named in the form x1, x2, ... , xn.
> >> >
> >> > mysort(), which works by recursively slicing off pairs of
> non-digit
> >> and
> >> > digit strings, seems more complicated than it should have to be,
> >> and I
> >> > wonder whether anyone has a more elegant solution. I don't think
> >> that
> >> > efficiency is a serious issue for the applications I'm
> considering,
> >> but of
> >> > course a more efficient solution would be of interest.
> >> >
> >> > Thanks,
> >> >  John
> >> >
> >> > ------------------------------
> >> > John Fox, Professor
> >> > Department of Sociology
> >> > McMaster University
> >> > Hamilton, Ontario, Canada
> >> > web: socserv.mcmaster.ca/jfox
> >> >
> >> > ______________________________________________
> >> > R-help@r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible
> code.
> >> >
> >
> > --------------------------------
> > John Fox, Professor
> > Department of Sociology
> > McMaster University
> > Hamilton, Ontario, Canada
> > http://socserv.mcmaster.ca/jfox/
> >

--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
http://socserv.mcmaster.ca/jfox/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] sorting variable names containing digits

Reply via email to