On Sep 7, 2014, at 7:40 PM, Henrik Bengtsson wrote:
> Thank you David - it took me awhile to get back to this and dig into
> it. It's clever to imitate gtools::mixedorder() as far as possible.
> A few comments:
>
> 1. It took me a while to understand why you picked 3899 in your
> Roman-to-integer table; it's because roman(x) is NA for x > 3899.
> (BTW, in 'utils', there's utils:::.roman2numeric() which could be
> utilized, but it's currently internal.)
Yes, that was the reason. I didn't think I needed a Roman-to-numeric function
because I discovered the roman numbers were actually simple numeric vectors to
which a class had been assigned and that it was the class-facilities that did
all the work. The standard Ops functions were just acting on numeric vectors.
If one doesn't take care, their "romanity" can be lost:
> R <- as.roman(10^(0:4))
> R
[1] I X C M <NA>
> unclass(R)
[1] 1 10 100 1000 NA
> sum(R, na.rm=TRUE)
[1] 1111
> as.roman(sum(R, na.rm=TRUE))
[1] MCXI
>
> 2. I think you forgot D=500 and M=1000.
Quite possible. I suspect Greg will have corrected the omission, but if not,
this will be helpful to him.
>
> 3. There was a typo in your code; I think you meant rank.roman instead
> of rank.numeric in one place.
>
I understood Greg's intention to wrap this into the mixedorder and mixed sort
duo.
Best;
David.
> 4. The idea behind nonnumeric() is to identify non-numeric substrings
> by is.na(as.numeric()). Unfortunately, for romans that does not work.
> Instead, we need to use is.na(numeric(x)) here, i.e.
>
> nonnumeric <- function(x) {
> suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
> }
>
> Actually, gtools::mixedorder() could use the same.
>
> 5. I undid your ".numeric" to ".roman" to minimize any differences to
> gtools::mixedorder().
>
>
> With the above fixes, we now have:
>
> mixedorderRoman <- function (x)
> {
> if (length(x) < 1)
> return(NULL)
> else if (length(x) == 1)
> return(1)
> if (is.numeric(x))
> return(order(x))
> delim = "\\$\\@\\$"
> # NOTE: Note that as.roman(x) is NA for x > 3899
> romanC <- as.character( as.roman(1:3899) )
> numeric <- function(x) {
> suppressWarnings(match(x, romanC))
> }
> nonnumeric <- function(x) {
> suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
> NA))
> }
> x <- as.character(x)
> which.nas <- which(is.na(x))
> which.blanks <- which(x == "")
> if (length(which.blanks) > 0)
> x[which.blanks] <- -Inf
> if (length(which.nas) > 0)
> x[which.nas] <- Inf
> delimited <- gsub("([IVXCLM]+)",
> paste(delim, "\\1", delim, sep = ""), x)
> step1 <- strsplit(delimited, delim)
> step1 <- lapply(step1, function(x) x[x > ""])
> step1.numeric <- lapply(step1, numeric)
> step1.character <- lapply(step1, nonnumeric)
> maxelem <- max(sapply(step1, length))
> step1.numeric.t <- lapply(1:maxelem, function(i) sapply(step1.numeric,
> function(x) x[i]))
> step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
> function(x) x[i]))
> rank.numeric <- sapply(step1.numeric.t, rank)
> rank.character <- sapply(step1.character.t, function(x)
> as.numeric(factor(x)))
> rank.numeric[!is.na(rank.character)] <- 0
> rank.character <- t(t(rank.character) + apply(matrix(rank.numeric),
> 2, max, na.rm = TRUE))
> rank.overall <- ifelse(is.na(rank.character), rank.numeric,
> rank.character)
> order.frame <- as.data.frame(rank.overall)
> if (length(which.nas) > 0)
> order.frame[which.nas, ] <- Inf
> retval <- do.call("order", order.frame)
> return(retval)
> }
>
>
> The difference to gtools::mixedorder() is minimal:
>
> < romanC <- as.character( as.roman(1:3899) )
> 21c11
> < suppressWarnings(match(x, romanC))
> ---
>> suppressWarnings(as.numeric(x))
> 24c14
> < suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
> ---
>> suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
> 34c24
> < delimited <- gsub("([IVXCLDM]+)",
> ---
>> delimited <-
>> gsub("([+-]{0,1}[0-9]+\\.{0,1}[0-9]*([eE][\\+\\-]{0,1}[0-9]+\\.{0,1}[0-9]*){0,1})",
> 59,62d48
>
> This difference is so small that the above could now be an option to
> mixedorder() with minimal overhead added, e.g. mixedorder(y,
> type=c("decimal", "roman")). One could even imagine adding support
> for "binary", "octal" and "hexadecimal" (not done).
>
> Greg (maintainer of gtools; cc:ed), is this something you would
> consider adding to gtools? I've modified the gtools source code
> available on CRAN (that's the only source I found), added package
> tests, updated the Rd and verified it passes R CMD check. If
> interested, please find the updates at:
>
> https://github.com/HenrikBengtsson/gtools/compare/cran:master...master
>
> Thanks
>
> Henrik
>
> On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius <[email protected]>
> wrote:
>>
>> On Aug 26, 2014, at 5:24 PM, Henrik Bengtsson wrote:
>>
>>> Hi,
>>>
>>> does anyone know of an implementation/function that sorts strings that
>>> *contain* roman numerals (I, II, III, IV, V, ...) which are treated as
>>> numbers. In 'gtools' there is mixedsort() which does this for strings
>>> that contains (decimal) numbers. I'm looking for a "mixedsortroman()"
>>> function that does the same but with roman numbers, e.g.
>>
>> It's pretty easy to sort something you know to be congruent with the
>> existing roman class:
>>
>> romanC <- as.character( as.roman(1:3899) )
>> match(c("I", "II", "III","X","V"), romanC)
>> #[1] 1 2 3 10 5
>>
>> But I guess you already know that, so you want a regex approach to parsing.
>> Looking at the path taken by Warnes, it would involve doing something like
>> his regex based insertion of a delimiter for "Roman numeral" but simpler
>> because he needed to deal with decimal points and signs and exponent
>> notation, none of which you appear to need. If you only need to consider
>> character and Roman, then this hack of Warnes tools succeeds:
>>
>> mixedorderRoman <- function (x)
>> {
>> if (length(x) < 1)
>> return(NULL)
>> else if (length(x) == 1)
>> return(1)
>> if (is.numeric(x))
>> return(order(x))
>> delim = "\\$\\@\\$"
>> roman <- function(x) {
>> suppressWarnings(match(x, romanC))
>> }
>> nonnumeric <- function(x) {
>> suppressWarnings(ifelse(is.na(as.numeric(x)), toupper(x),
>> NA))
>> }
>> x <- as.character(x)
>> which.nas <- which(is.na(x))
>> which.blanks <- which(x == "")
>> if (length(which.blanks) > 0)
>> x[which.blanks] <- -Inf
>> if (length(which.nas) > 0)
>> x[which.nas] <- Inf
>> delimited <- gsub("([IVXCL]+)",
>> paste(delim, "\\1", delim, sep = ""), x)
>> step1 <- strsplit(delimited, delim)
>> step1 <- lapply(step1, function(x) x[x > ""])
>> step1.roman <- lapply(step1, roman)
>> step1.character <- lapply(step1, nonnumeric)
>> maxelem <- max(sapply(step1, length))
>> step1.roman.t <- lapply(1:maxelem, function(i) sapply(step1.roman,
>> function(x) x[i]))
>> step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
>> function(x) x[i]))
>> rank.roman <- sapply(step1.roman.t, rank)
>> rank.character <- sapply(step1.character.t, function(x)
>> as.numeric(factor(x)))
>> rank.roman[!is.na(rank.character)] <- 0
>> rank.character <- t(t(rank.character) + apply(matrix(rank.roman),
>> 2, max, na.rm = TRUE))
>> rank.overall <- ifelse(is.na(rank.character), rank.numeric,
>> rank.character)
>> order.frame <- as.data.frame(rank.overall)
>> if (length(which.nas) > 0)
>> order.frame[which.nas, ] <- Inf
>> retval <- do.call("order", order.frame)
>> return(retval)
>> }
>>
>> y[mixedorderRoman(y)]
>> [1] "chr I" "chr II" "chr III" "chr IV" "chr IX"
>> [6] "chr V" "chr VI" "chr VII" "chr VIII" "chr X"
>> [11] "chr XI" "chr XII"
>>
>>
>> --
>> David.
>>>
>>> ## DECIMAL NUMBERS
>>>> x <- sprintf("chr %d", 12:1)
>>>> x
>>> [1] "chr 12" "chr 11" "chr 10" "chr 9" "chr 8"
>>> [6] "chr 7" "chr 6" "chr 5" "chr 4" "chr 3"
>>> [11] "chr 2" "chr 1"
>>>
>>>> sort(x)
>>> [1] "chr 1" "chr 10" "chr 11" "chr 12" "chr 2"
>>> [6] "chr 3" "chr 4" "chr 5" "chr 6" "chr 7"
>>> [11] "chr 8" "chr 9"
>>>
>>>> gtools::mixedsort(x)
>>> [1] "chr 1" "chr 2" "chr 3" "chr 4" "chr 5"
>>> [6] "chr 6" "chr 7" "chr 8" "chr 9" "chr 10"
>>> [11] "chr 11" "chr 12"
>>>
>>>
>>> ## ROMAN NUMBERS
>>>> y <- sprintf("chr %s", as.roman(12:1))
>>>> y
>>> [1] "chr XII" "chr XI" "chr X" "chr IX"
>>> [5] "chr VIII" "chr VII" "chr VI" "chr V"
>>> [9] "chr IV" "chr III" "chr II" "chr I"
>>>
>>>> sort(y)
>>> [1] "chr I" "chr II" "chr III" "chr IV"
>>> [5] "chr IX" "chr V" "chr VI" "chr VII"
>>> [9] "chr VIII" "chr X" "chr XI" "chr XII"
>>>
>>>> mixedsortroman(y)
>>> [1] "chr I" "chr II" "chr III" "chr IV"
>>> [5] "chr V" "chr VI" "chr VII" "chr VIII"
>>> [9] "chr IX" "chr X" "chr XI" "chr XII"
>>>
>>> The latter is what I'm looking for.
>>>
>>> Before hacking together something myself (e.g. identify roman numerals
>>> substrings, translate them to decimal numbers, use gtools::mixedsort()
>>> to sort them and then translate them back to roman numbers), I'd like
>>> to hear if someone already has this implemented/know of a package that
>>> does this.
>>>
>>> Thanks,
>>>
>>> Henrik
>>>
>>> ______________________________________________
>>> [email protected] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius
>> Alameda, CA, USA
>>
David Winsemius
Alameda, CA, USA
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.