[Rd] duplicated factor labels.

Paul Johnson Wed, 14 Jun 2017 17:00:36 -0700

Dear R devel

I've been wondering about this for a while. I am sorry to ask for your
time, but can one of you help me understand this?


This concerns duplicated labels, not levels, in the factor function.

I think it is hard to understand that factor() fails, but levels()
after does not

>  x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
Error in `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels,  :
  factor level [3] is duplicated
> y <- factor(x, levels = xlevels)
> levels(y) <- xlabels
> y
[1] 1    <NA> <NA> 4    4    4
Levels: 1 4

If the latter use of levels() causes a good, expected result, couldn't
factor(..., labels = xlabels) be made to the same thing?

That's the gist of it. To signal to you that I've been trying to
figure this out on my own, here is a revision I've tested in R's
factor function which "seems" to fix the matter. (Of course, probably
causes lots of other problems I don't understand, that's why I'm
writing to  you now.)

In the factor function, the class of f is assigned *after* levels(f) is called

    levels(f) <- ## nl == nL or 1
    if (nl == nL) as.character(labels)
    else paste0(labels, seq_along(levels))
    class(f) <- c(if(ordered) "ordered", "factor")

At that point, f is an integer, and levels(f) is a primitive

> `levels<-`
function (x, value)  .Primitive("levels<-")

That's what generates the error.  I don't understand well what
.Primitive means here. I need to walk past that detail.

Suppose I revise the factor function to put the class(f) line before
the level(). Then `levels<-.factor` is called and all seems well.

factor <- function (x = character(), levels, labels = levels, exclude = NA,
    ordered = is.ordered(x), nmax = NA)
{
    if (is.null(x))
        x <- character()
    nx <- names(x)
    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        levels <- unique(as.character(y)[ind])
    }
    force(ordered)
    if (!is.character(x))
        x <- as.character(x)
    levels <- levels[is.na(match(levels, exclude))]
    f <- match(x, levels)
    if (!is.null(nx))
        names(f) <- nx
    nl <- length(labels)
    nL <- length(levels)
    if (!any(nl == c(1L, nL)))
        stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
            nl, nL), domain = NA)
    ## class() moved up 3 rows
    class(f) <- c(if (ordered) "ordered", "factor")
    levels(f) <- if (nl == nL)
                  as.character(labels)
         else paste0(labels, seq_along(levels))
    f
}

> assignInNamespace("factor", factor, "base")
> x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
> y
[1] 1    <NA> <NA> 4    4    4
Levels: 1 4
> attributes(y)
$class
[1] "factor"

$levels
[1] "1" "4"

That's a "good" answer for me.

But I broke your function. I eliminated the check for duplicated levels.

> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
> y
[1] 1    4    <NA> <NA> <NA> <NA>
Levels: 1 4

Rather than have factor return the "duplicated levels" error when
there are duplicated values in labels, I wonder why it is not better
to have a check for duplicated levels directly. For example, insert a
new else in this stanza

    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        levels <- unique(as.character(y)[ind])
    } ##next is new part
        else {
        levels <- unique(levels)
    }

That will cause an error when there are duplicated levels because
there are more labels than levels:

> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
Error in factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) :
  invalid 'labels'; length 6 should be 1 or 2

So, in conclusion, if levels() can work after creating a factor, I
wish equivalent labels argument would be accepted. What is your
opinion?

pj
-- 
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] duplicated factor labels.

Reply via email to