In R devel, function 'factor' has been changed, allowing and merging duplicated 
'labels'.

Issue 1: Handling of specified 'labels' without duplicates is slower than 
before.
Example:
x <- rep(1:26, 40000)
system.time(factor(x, levels=1:26, labels=letters))

Function 'factor' is already rather slow because of conversion to character. 
Please don't add slowdown.

Issue 2: While default 'labels' is 'levels', not specifying 'labels' may be 
different from specifying 'labels' to be the same as 'levels'.

Example 1:
as.integer(factor(c(NA,2,3), levels = c(2, NA), exclude = NULL))
is different from
as.integer(factor(c(NA,2,3), levels = c(2, NA), labels = c(2, NA), exclude = 
NULL))

File reg-tests-1d.R indicates that 'factor' behavior with NA is slightly 
changed, for the better. NA entry (because it is unmatched to 'levels' argument 
or is in 'exclude') is absorbed into NA in "levels" attribute (comes from 
'labels' argument), if any. The issue is that it happens only when 'labels' is 
specified.

Function 'factor' could use match(xlevs, nlevs)[f]. It doesn't match NA to NA 
level. When 'f' is long enough, longer than 'xlevs', it is faster than 
match(xlevs[f], nlevs).

Example 2:
With
levs <- c("A","A")  ,
factor(levs, levels=levs)
gives error, but
factor(levs, levels=levs, labels=levs)
doesn't.

Note: In theory, if function 'factor' merged duplicated 'labels' in all cases, 
at least in
factor(c(sqrt(2)^2, 2))  ,
function 'factor' could do matching on original 'x' (without conversion to 
character), as in R before version 2.10.0. If function 'factor' did it,
factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels = c("sqrt(2)^2", "2"))
could take sqrt(2)^2 and 2 as distinct.


Another thing: Function 'factor' in R devel uses 'order' instead of 'sort.list'.

The case of as.factor(x) for
x <- as.data.frame(character(0))
in tests/isas-tests.Rout.save reveals that 'order' on data frame is strange.

x <- as.data.frame(character(0))
y <- unique(x)
length(y)  # 1
length(order(y))  # 0
length(as.character(y))  # 1

order(y) is not as long as as.character(y).

Another example:
length(mtcars)  # 11
length(order(mtcars))  # 352

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to