Re: [R] sorting variable names containing digits

Gabor Grothendieck Mon, 22 Dec 2008 04:41:47 -0800

Note that mysort2 is slightly more general as it handles the case
that the strings begin with numerics:


> u <- c("51a2", "2a4")
> mysort(u)
[1] "51a2" "2a4"
> mysort2(u)
[1] "2a4"  "51a2"

On Mon, Dec 22, 2008 at 12:32 AM, John Fox <j...@mcmaster.ca> wrote:
> Dear Gabor,
>
> Thank you (again) for this second suggestion, which does exactly what I
> want. At the risk of appearing ungrateful, and although the judgment is
> admittedly subjective, I don't find it simpler than mysort().
>
> For curiosity, I tried some timings of the two functions for the sample
> problems that I supplied:
>
>> system.time(for (i in 1:100) mysort(s))
>   user  system elapsed
>  1.498   0.006   1.503
>
>> system.time(for (i in 1:100) mysort2(s))
>   user  system elapsed
>  6.026   0.028   6.059
>
>> system.time(for (i in 1:100) mysort(t))
>   user  system elapsed
>  0.858   0.003   0.874
>
>> system.time(for (i in 1:100) mysort2(t))
>   user  system elapsed
>  2.736   0.014   2.757
>
> This is on a  2.4 GHz Core 2 Duo MacBook. I don't know of course
> whether this generalizes to other problems. I suspect that the
> recursive solution will look worse as the number of "components" of the
> names increases, but of course names are unlikely to have a large
> number of components.
>
> Best,
>  John
>
> On Sun, 21 Dec 2008 23:28:51 -0500
>  "Gabor Grothendieck" <ggrothendi...@gmail.com> wrote:
>> Another possibility is to use strapply in gsubfn giving a solution
>> that is non-recursive and shorter:
>>
>> library(gsubfn)
>>
>> mysort2 <- function(s) {
>>       L <- strapply(s, "([0-9]+)|([^0-9]+)",
>>               ~ if (nchar(x)) sprintf("%9d", as.numeric(x)) else y)
>>       L2 <- t(do.call(cbind, lapply(L, ts)))
>>       L3 <- replace(L2, is.na(L2), "")
>>       ord <- do.call(order, as.data.frame(L3, stringsAsFactors = FALSE))
>>       s[ord]
>> }
>>
>>
>> First strapply breaks up each string into a character vector of the
>> numeric
>> and non-numeric components.  We pad each numeric component on the
>> left with spaces using sprintf so they are all 9 wide.  The next line
>> turns that
>> into a matrix L2 and then we replace the NAs giving L3.  Finally we
>> order it
>> and apply the ordering, ord, to get the sorted version.
>>
>> The gsubfn home page is at:
>> http://gsubfn.googlecode.com
>>
>> Here is some sample output:
>>
>> > mysort2(s)
>>  [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
>> "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"   "y10a10"
>> > mysort(s)
>>  [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
>> "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"   "y10a10"
>>
>> > mysort2(t)
>> [1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
>> > mysort(t)
>> [1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
>>
>>
>> On Sun, Dec 21, 2008 at 9:57 PM, John Fox <j...@mcmaster.ca> wrote:
>> > Dear Gabor,
>> >
>> > Thanks for this -- I was unaware of mixedsort(). As you point out,
>> > however, mixedsort() doesn't cover all of the cases in which I'm
>> > interested and which are handled by mysort().
>> >
>> > Regards,
>> >  John
>> >
>> > On Sun, 21 Dec 2008 20:51:17 -0500
>> >  "Gabor Grothendieck" <ggrothendi...@gmail.com> wrote:
>> >> mixedsort in gtools will give the same result as mysort(s) but
>> >> differs in the case of t.
>> >>
>> >> On Sun, Dec 21, 2008 at 8:33 PM, John Fox <j...@mcmaster.ca>
>> wrote:
>> >> > Dear r-helpers,
>> >> >
>> >> > I'm looking for a way of sorting variable names in a "natural"
>> >> order, when
>> >> > the names are composed of digits and other characters. I know
>> that
>> >> this is a
>> >> > vague idea, and that sorting character strings is a complex
>> topic,
>> >> but
>> >> > perhaps a couple of examples will clarify what I mean:
>> >> >
>> >> >> s <- c("x1b", "x1a", "x02b", "x02a", "x02", "y1a1", "y10a2",
>> >> > +   "y10a10", "y10a1", "y2", "var10a2", "var2", "y10")
>> >> >
>> >> >> sort(s)
>> >> >  [1] "var10a2" "var2"    "x02"     "x02a"    "x02b"    "x1a"
>> >> >  [7] "x1b"     "y10"     "y10a1"   "y10a10"  "y10a2"   "y1a1"
>> >> > [13] "y2"
>> >> >
>> >> >> mysort(s)
>> >> >  [1] "var2"    "var10a2" "x1a"     "x1b"     "x02"     "x02a"
>> >> >  [7] "x02b"    "y1a1"    "y2"      "y10"     "y10a1"   "y10a2"
>> >> > [13] "y10a10"
>> >> >
>> >> >> t <- c("q10.1.1", "q10.2.1", "q2.1.1", "q10.10.2")
>> >> >
>> >> >> sort(t)
>> >> > [1] "q10.1.1"  "q10.10.2" "q10.2.1"  "q2.1.1"
>> >> >
>> >> >> mysort(t)
>> >> > [1] "q2.1.1"   "q10.1.1"  "q10.2.1"  "q10.10.2"
>> >> >
>> >> > Here, sort() is the standard R function and mysort() is a
>> >> replacement, which
>> >> > sorts the names into the order that seems natural to me, at
>> least
>> >> in the
>> >> > cases that I've tried:
>> >> >
>> >> > mysort <- function(x){
>> >> >  sort.helper <- function(x){
>> >> >    prefix <- strsplit(x, "[0-9]")
>> >> >    prefix <- sapply(prefix, "[", 1)
>> >> >    prefix[is.na(prefix)] <- ""
>> >> >    suffix <- strsplit(x, "[^0-9]")
>> >> >    suffix <- as.numeric(sapply(suffix, "[", 2))
>> >> >    suffix[is.na(suffix)] <- -Inf
>> >> >    remainder <- sub("[^0-9]+", "", x)
>> >> >    remainder <- sub("[0-9]+", "", remainder)
>> >> >    if (all (remainder == "")) list(prefix, suffix)
>> >> >    else c(list(prefix, suffix), Recall(remainder))
>> >> >    }
>> >> >  ord <- do.call("order", sort.helper(x))
>> >> >  x[ord]
>> >> >   }
>> >> >
>> >> > I have a couple of applications in mind, one of which is
>> >> recognizing
>> >> > repeated-measures variables in "wide" longitudinal datasets,
>> which
>> >> often are
>> >> > named in the form x1, x2, ... , xn.
>> >> >
>> >> > mysort(), which works by recursively slicing off pairs of
>> non-digit
>> >> and
>> >> > digit strings, seems more complicated than it should have to be,
>> >> and I
>> >> > wonder whether anyone has a more elegant solution. I don't think
>> >> that
>> >> > efficiency is a serious issue for the applications I'm
>> considering,
>> >> but of
>> >> > course a more efficient solution would be of interest.
>> >> >
>> >> > Thanks,
>> >> >  John
>> >> >
>> >> > ------------------------------
>> >> > John Fox, Professor
>> >> > Department of Sociology
>> >> > McMaster University
>> >> > Hamilton, Ontario, Canada
>> >> > web: socserv.mcmaster.ca/jfox
>> >> >
>> >> > ______________________________________________
>> >> > R-help@r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible
>> code.
>> >> >
>> >
>> > --------------------------------
>> > John Fox, Professor
>> > Department of Sociology
>> > McMaster University
>> > Hamilton, Ontario, Canada
>> > http://socserv.mcmaster.ca/jfox/
>> >
>
> --------------------------------
> John Fox, Professor
> Department of Sociology
> McMaster University
> Hamilton, Ontario, Canada
> http://socserv.mcmaster.ca/jfox/
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] sorting variable names containing digits

Reply via email to