Dear R devel I've been wondering about this for a while. I am sorry to ask for your time, but can one of you help me understand this?
This concerns duplicated labels, not levels, in the factor function. I think it is hard to understand that factor() fails, but levels() after does not > x <- 1:6 > xlevels <- 1:6 > xlabels <- c(1, NA, NA, 4, 4, 4) > y <- factor(x, levels = xlevels, labels = xlabels) Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, : factor level [3] is duplicated > y <- factor(x, levels = xlevels) > levels(y) <- xlabels > y [1] 1 <NA> <NA> 4 4 4 Levels: 1 4 If the latter use of levels() causes a good, expected result, couldn't factor(..., labels = xlabels) be made to the same thing? That's the gist of it. To signal to you that I've been trying to figure this out on my own, here is a revision I've tested in R's factor function which "seems" to fix the matter. (Of course, probably causes lots of other problems I don't understand, that's why I'm writing to you now.) In the factor function, the class of f is assigned *after* levels(f) is called levels(f) <- ## nl == nL or 1 if (nl == nL) as.character(labels) else paste0(labels, seq_along(levels)) class(f) <- c(if(ordered) "ordered", "factor") At that point, f is an integer, and levels(f) is a primitive > `levels<-` function (x, value) .Primitive("levels<-") That's what generates the error. I don't understand well what .Primitive means here. I need to walk past that detail. Suppose I revise the factor function to put the class(f) line before the level(). Then `levels<-.factor` is called and all seems well. factor <- function (x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA) { if (is.null(x)) x <- character() nx <- names(x) if (missing(levels)) { y <- unique(x, nmax = nmax) ind <- sort.list(y) levels <- unique(as.character(y)[ind]) } force(ordered) if (!is.character(x)) x <- as.character(x) levels <- levels[is.na(match(levels, exclude))] f <- match(x, levels) if (!is.null(nx)) names(f) <- nx nl <- length(labels) nL <- length(levels) if (!any(nl == c(1L, nL))) stop(gettextf("invalid 'labels'; length %d should be 1 or %d", nl, nL), domain = NA) ## class() moved up 3 rows class(f) <- c(if (ordered) "ordered", "factor") levels(f) <- if (nl == nL) as.character(labels) else paste0(labels, seq_along(levels)) f } > assignInNamespace("factor", factor, "base") > x <- 1:6 > xlevels <- 1:6 > xlabels <- c(1, NA, NA, 4, 4, 4) > y <- factor(x, levels = xlevels, labels = xlabels) > y [1] 1 <NA> <NA> 4 4 4 Levels: 1 4 > attributes(y) $class [1] "factor" $levels [1] "1" "4" That's a "good" answer for me. But I broke your function. I eliminated the check for duplicated levels. > y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) > y [1] 1 4 <NA> <NA> <NA> <NA> Levels: 1 4 Rather than have factor return the "duplicated levels" error when there are duplicated values in labels, I wonder why it is not better to have a check for duplicated levels directly. For example, insert a new else in this stanza if (missing(levels)) { y <- unique(x, nmax = nmax) ind <- sort.list(y) levels <- unique(as.character(y)[ind]) } ##next is new part else { levels <- unique(levels) } That will cause an error when there are duplicated levels because there are more labels than levels: > y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) Error in factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) : invalid 'labels'; length 6 should be 1 or 2 So, in conclusion, if levels() can work after creating a factor, I wish equivalent labels argument would be accepted. What is your opinion? pj -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel