Re: [R] Mixed sorting/ordering of strings acknowledging roman numerals?

David Winsemius Sun, 07 Sep 2014 20:49:05 -0700

On Sep 7, 2014, at 7:40 PM, Henrik Bengtsson wrote:

> Thank you David - it took me awhile to get back to this and dig into
> it.  It's clever to imitate gtools::mixedorder() as far as possible.
> A few comments:
> 
> 1. It took me a while to understand why you picked 3899 in your
> Roman-to-integer table; it's because roman(x) is NA for x > 3899.
> (BTW, in 'utils', there's utils:::.roman2numeric() which could be
> utilized, but it's currently internal.)


Yes, that was the reason. I didn't think I needed a Roman-to-numeric function 
because I discovered the roman numbers were actually simple numeric vectors to 
which a class had been assigned and that it was the class-facilities that did 
all the work. The standard Ops functions were just acting on numeric vectors.

If one doesn't take care, their "romanity" can be lost:

> R <- as.roman(10^(0:4))
> R
[1] I    X    C    M    <NA>

> unclass(R)
[1]    1   10  100 1000   NA

> sum(R, na.rm=TRUE)
[1] 1111
> as.roman(sum(R, na.rm=TRUE))
[1] MCXI

> 
> 2. I think you forgot D=500 and M=1000.

Quite possible. I suspect Greg will have corrected the omission, but if not, 
this will be helpful to him.

> 
> 3. There was a typo in your code; I think you meant rank.roman instead
> of rank.numeric in one place.
> 

I understood Greg's intention to wrap this into the mixedorder and mixed sort 
duo.

Best;
David.

> 4. The idea behind nonnumeric() is to identify non-numeric substrings
> by is.na(as.numeric()).  Unfortunately, for romans that does not work.
> Instead, we need to use is.na(numeric(x)) here, i.e.
> 
>  nonnumeric <- function(x) {
>      suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
>  }
> 
> Actually, gtools::mixedorder() could use the same.
> 
> 5. I undid your ".numeric" to ".roman" to minimize any differences to
> gtools::mixedorder().
> 
> 
> With the above fixes, we now have:
> 
> mixedorderRoman <- function (x)
> {
>    if (length(x) < 1)
>        return(NULL)
>    else if (length(x) == 1)
>        return(1)
>    if (is.numeric(x))
>        return(order(x))
>    delim = "\\$\\@\\$"
>    # NOTE: Note that as.roman(x) is NA for x > 3899
>    romanC <- as.character( as.roman(1:3899) )
>    numeric <- function(x) {
>        suppressWarnings(match(x, romanC))
>    }
>    nonnumeric <- function(x) {
>        suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
>            NA))
>    }
>    x <- as.character(x)
>    which.nas <- which(is.na(x))
>    which.blanks <- which(x == "")
>    if (length(which.blanks) > 0)
>        x[which.blanks] <- -Inf
>    if (length(which.nas) > 0)
>        x[which.nas] <- Inf
>    delimited <- gsub("([IVXCLM]+)",
>        paste(delim, "\\1", delim, sep = ""), x)
>    step1 <- strsplit(delimited, delim)
>    step1 <- lapply(step1, function(x) x[x > ""])
>    step1.numeric <- lapply(step1, numeric)
>    step1.character <- lapply(step1, nonnumeric)
>    maxelem <- max(sapply(step1, length))
>    step1.numeric.t <- lapply(1:maxelem, function(i) sapply(step1.numeric,
>        function(x) x[i]))
>    step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
>        function(x) x[i]))
>    rank.numeric <- sapply(step1.numeric.t, rank)
>    rank.character <- sapply(step1.character.t, function(x)
> as.numeric(factor(x)))
>    rank.numeric[!is.na(rank.character)] <- 0
>    rank.character <- t(t(rank.character) + apply(matrix(rank.numeric),
>        2, max, na.rm = TRUE))
>    rank.overall <- ifelse(is.na(rank.character), rank.numeric,
>        rank.character)
>    order.frame <- as.data.frame(rank.overall)
>    if (length(which.nas) > 0)
>        order.frame[which.nas, ] <- Inf
>    retval <- do.call("order", order.frame)
>    return(retval)
> }
> 
> 
> The difference to gtools::mixedorder() is minimal:
> 
> <     romanC <- as.character( as.roman(1:3899) )
> 21c11
> <         suppressWarnings(match(x, romanC))
> ---
>>        suppressWarnings(as.numeric(x))
> 24c14
> <         suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
> ---
>>        suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
> 34c24
> <     delimited <- gsub("([IVXCLDM]+)",
> ---
>>    delimited <- 
>> gsub("([+-]{0,1}[0-9]+\\.{0,1}[0-9]*([eE][\\+\\-]{0,1}[0-9]+\\.{0,1}[0-9]*){0,1})",
> 59,62d48
> 
> This difference is so small that the above could now be an option to
> mixedorder() with minimal overhead added, e.g. mixedorder(y,
> type=c("decimal", "roman")).  One could even imagine adding support
> for "binary", "octal" and "hexadecimal" (not done).
> 
> Greg (maintainer of gtools; cc:ed), is this something you would
> consider adding to gtools?  I've modified the gtools source code
> available on CRAN (that's the only source I found), added package
> tests, updated the Rd and verified it passes R CMD check.  If
> interested, please find the updates at:
> 
>  https://github.com/HenrikBengtsson/gtools/compare/cran:master...master
> 
> Thanks
> 
> Henrik
> 
> On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius <dwinsem...@comcast.net> 
> wrote:
>> 
>> On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:
>> 
>>> Hi,
>>> 
>>> does anyone know of an implementation/function that sorts strings that
>>> *contain* roman numerals (I, II, III, IV, V, ...) which are treated as
>>> numbers.  In 'gtools' there is mixedsort() which does this for strings
>>> that contains (decimal) numbers.  I'm looking for a "mixedsortroman()"
>>> function that does the same but with roman numbers, e.g.
>> 
>> It's pretty easy to sort something you know to be congruent with the 
>> existing roman class:
>> 
>> romanC <- as.character( as.roman(1:3899) )
>> match(c("I", "II", "III","X","V"), romanC)
>> #[1]  1  2  3 10  5
>> 
>> But I guess you already know that, so you want a regex approach to parsing. 
>> Looking at the path taken by Warnes, it would involve doing something like 
>> his regex based insertion of a delimiter for "Roman numeral" but simpler 
>> because he needed to deal with decimal points and signs and exponent 
>> notation, none of which you appear to need. If you only need to consider 
>> character and Roman, then this hack of Warnes tools succeeds:
>> 
>> mixedorderRoman <- function (x)
>> {
>>    if (length(x) < 1)
>>        return(NULL)
>>    else if (length(x) == 1)
>>        return(1)
>>    if (is.numeric(x))
>>        return(order(x))
>>    delim = "\\$\\@\\$"
>>    roman <- function(x) {
>>        suppressWarnings(match(x, romanC))
>>    }
>>    nonnumeric <- function(x) {
>>        suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
>>            NA))
>>    }
>>    x <- as.character(x)
>>    which.nas <- which(is.na(x))
>>    which.blanks <- which(x == "")
>>    if (length(which.blanks) > 0)
>>        x[which.blanks] <- -Inf
>>    if (length(which.nas) > 0)
>>        x[which.nas] <- Inf
>>    delimited <- gsub("([IVXCL]+)",
>>        paste(delim, "\\1", delim, sep = ""), x)
>>    step1 <- strsplit(delimited, delim)
>>    step1 <- lapply(step1, function(x) x[x > ""])
>>    step1.roman <- lapply(step1, roman)
>>    step1.character <- lapply(step1, nonnumeric)
>>    maxelem <- max(sapply(step1, length))
>>    step1.roman.t <- lapply(1:maxelem, function(i) sapply(step1.roman,
>>        function(x) x[i]))
>>    step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
>>        function(x) x[i]))
>>    rank.roman <- sapply(step1.roman.t, rank)
>>    rank.character <- sapply(step1.character.t, function(x) 
>> as.numeric(factor(x)))
>>    rank.roman[!is.na(rank.character)] <- 0
>>    rank.character <- t(t(rank.character) + apply(matrix(rank.roman),
>>        2, max, na.rm = TRUE))
>>    rank.overall <- ifelse(is.na(rank.character), rank.numeric,
>>        rank.character)
>>    order.frame <- as.data.frame(rank.overall)
>>    if (length(which.nas) > 0)
>>        order.frame[which.nas, ] <- Inf
>>    retval <- do.call("order", order.frame)
>>    return(retval)
>> }
>> 
>> y[mixedorderRoman(y)]
>> [1] "chr I"    "chr II"   "chr III"  "chr IV"   "chr IX"
>> [6] "chr V"    "chr VI"   "chr VII"  "chr VIII" "chr X"
>> [11] "chr XI"   "chr XII"
>> 
>> 
>> --
>> David.
>>> 
>>> ## DECIMAL NUMBERS
>>>> x <- sprintf("chr %d", 12:1)
>>>> x
>>> [1] "chr 12" "chr 11" "chr 10" "chr 9"  "chr 8"
>>> [6] "chr 7"  "chr 6"  "chr 5"  "chr 4"  "chr 3"
>>> [11] "chr 2"  "chr 1"
>>> 
>>>> sort(x)
>>> [1] "chr 1"  "chr 10" "chr 11" "chr 12" "chr 2"
>>> [6] "chr 3"  "chr 4"  "chr 5"  "chr 6"  "chr 7"
>>> [11] "chr 8"  "chr 9"
>>> 
>>>> gtools::mixedsort(x)
>>> [1] "chr 1"  "chr 2"  "chr 3"  "chr 4"  "chr 5"
>>> [6] "chr 6"  "chr 7"  "chr 8"  "chr 9"  "chr 10"
>>> [11] "chr 11" "chr 12"
>>> 
>>> 
>>> ## ROMAN NUMBERS
>>>> y <- sprintf("chr %s", as.roman(12:1))
>>>> y
>>> [1] "chr XII"  "chr XI"   "chr X"    "chr IX"
>>> [5] "chr VIII" "chr VII"  "chr VI"   "chr V"
>>> [9] "chr IV"   "chr III"  "chr II"   "chr I"
>>> 
>>>> sort(y)
>>> [1] "chr I"    "chr II"   "chr III"  "chr IV"
>>> [5] "chr IX"   "chr V"    "chr VI"   "chr VII"
>>> [9] "chr VIII" "chr X"    "chr XI"   "chr XII"
>>> 
>>>> mixedsortroman(y)
>>> [1] "chr I"    "chr II"   "chr III"  "chr IV"
>>> [5] "chr V"    "chr VI"   "chr VII"  "chr VIII"
>>> [9] "chr IX"   "chr X"    "chr XI"   "chr XII"
>>> 
>>> The latter is what I'm looking for.
>>> 
>>> Before hacking together something myself (e.g. identify roman numerals
>>> substrings, translate them to decimal numbers, use gtools::mixedsort()
>>> to sort them and then translate them back to roman numbers), I'd like
>>> to hear if someone already has this implemented/know of a package that
>>> does this.
>>> 
>>> Thanks,
>>> 
>>> Henrik
>>> 
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> David Winsemius
>> Alameda, CA, USA
>> 

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Mixed sorting/ordering of strings acknowledging roman numerals?

Reply via email to