Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
> strncasecmp is not standard C (not even C99), but R does have a substitute > for it. Unfortunately strncasecmp is not usable with multibyte charsets: > Linux systems have wcsncasecmp but that is not portable. In these days of > widespread use of UTF-8 that is a blocking issue, I am afraid. What could help are the functions mbrtowc and towctrans and simple long integer comparison. Are the functions mbrtowc and towctrans available under Windows? mbrtowc seems to be available as Rmbrtowc in src/gnuwin32/extra.c. I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide and used in do_tolower. Does this mean that tolower is not usable with utf-8 under Windows? > In the case of grep I think all you need is > > grep(tolower(pattern), tolower(x), fixed = TRUE) > > and similarly for regexpr. Yes. this is correct, but it has disadvantages. It needs more space and, if value=TRUE, we would have to do something like x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] This is hard to implement in src/library/base/R/grep.R, where the call to .Internal(grep(pattern,...)) is the last command and I think this should be preserved. > >Ignore case option is not meaningfull in gsub. > > sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE) > > is different from 'ignore.case=FALSE', and I see the meaning as clear. > So what did you mean? (Unfortunately the tolower trick does not work for > [g]sub.) The meaning of ignore.case in [g]sub is problematic due to the following. sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE) produces [1] "xyzD" "xyzd" but the user may in fact need the following [1] "XYZD" "xyzd" It is correct that "xyzD" "xyzd" is produced, but the user should be aware of the fact that several substitutions like x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE sub("ABC", "XYZ", x) # ignore.case=FALSE may be more useful. I have another question concerning the speed of grep. I expected that fgrep_one function is slower than calling a library routine for regular expressions. In particular, if the pattern has a lot of long partial matches in the target string, I expected that it may be much slower. A short example is y <- "ab" x <- "aaab" grep(y,x) which requires 110 comparisons (10 comparisons for each of 11 possible beginnings of y in x). In general, the complexity in the worst case is O(m*n), where m,n are the lengths of y,x resp. I would expect that the library function for matching regular expressions needs time O(m+n) and so will be faster. However, the result obtained on a larger example is > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "") > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "") > y <- paste(c(rep("a", times = 1), x2), collapse = "") > z <- rep(y, times = 100) > system.time(i <- grep(x1, z, fixed = T)) [1] 1.970 0.000 1.985 0.000 0.000 > system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow (*) [1] 40.374 0.003 40.381 0.000 0.000 > system.time(i <- grep(x2, z, fixed = T)) [1] 0.113 0.000 0.113 0.000 0.000 > system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than fgrep_one [1] 0.019 0.000 0.019 0.000 0.000 Do you have an explanation of these results, in particular (*)? Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
On Thu, 17 May 2007, Petr Savicky wrote: >> strncasecmp is not standard C (not even C99), but R does have a substitute >> for it. Unfortunately strncasecmp is not usable with multibyte charsets: >> Linux systems have wcsncasecmp but that is not portable. In these days of >> widespread use of UTF-8 that is a blocking issue, I am afraid. > > What could help are the functions mbrtowc and towctrans and simple > long integer comparison. Are the functions mbrtowc and towctrans > available under Windows? mbrtowc seems to be available as Rmbrtowc > in src/gnuwin32/extra.c. > > I did not find towctrans defined in R sources, but it is in > gnuwin32/Rdll.hide I don't see it in Rdll.hide. It is a C99 function (see your unix man page). > and used in do_tolower. Does this mean that tolower is not usable > with utf-8 under Windows? UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in so far as that makes sense: Chinese chars do not have 'case'). Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is not currently active. >> In the case of grep I think all you need is >> >> grep(tolower(pattern), tolower(x), fixed = TRUE) >> >> and similarly for regexpr. > > Yes. this is correct, but it has disadvantages. It needs more > space and, if value=TRUE, we would have to do something like > x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] > This is hard to implement in src/library/base/R/grep.R, > where the call to .Internal(grep(pattern,...)) is the last command > and I think this should be preserved. > >>> Ignore case option is not meaningfull in gsub. >> >> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE) >> >> is different from 'ignore.case=FALSE', and I see the meaning as clear. >> So what did you mean? (Unfortunately the tolower trick does not work for >> [g]sub.) > > The meaning of ignore.case in [g]sub is problematic due to the following. > sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE) > produces > [1] "xyzD" "xyzd" > but the user may in fact need the following > [1] "XYZD" "xyzd" He may, but that is not what 'ignore case' means, more like 'case honouring'. > It is correct that "xyzD" "xyzd" is produced, but the user > should be aware of the fact that several substitutions like > x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE > sub("ABC", "XYZ", x) # ignore.case=FALSE > may be more useful. > > I have another question concerning the speed of grep. I expected that > fgrep_one function is slower than calling a library routine > for regular expressions. In particular, if the pattern has a lot of > long partial matches in the target string, I expected that it may be much > slower. A short example is > y <- "ab" > x <- "aaab" > grep(y,x) > which requires 110 comparisons (10 comparisons for each of 11 possible > beginnings of y in x). In general, the complexity in the worst case is > O(m*n), where m,n are the lengths of y,x resp. I would expect that > the library function for matching regular expressions needs > time O(m+n) and so will be faster. However, the result obtained > on a larger example is > > > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "") > > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "") > > y <- paste(c(rep("a", times = 1), x2), collapse = "") > > z <- rep(y, times = 100) > > > system.time(i <- grep(x1, z, fixed = T)) > [1] 1.970 0.000 1.985 0.000 0.000 > > > system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow > (*) > [1] 40.374 0.003 40.381 0.000 0.000 > > > system.time(i <- grep(x2, z, fixed = T)) > [1] 0.113 0.000 0.113 0.000 0.000 > > > system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than > fgrep_one > [1] 0.019 0.000 0.019 0.000 0.000 > > Do you have an explanation of these results, in particular (*)? Yes, there is a comment on the help page to that effect. But these are highly atypical uses. Try perl=TRUE, and be aware that the locale matters a lot in such tests (via the charset). No one is attempting to make R a fast string-processing language and so developers resources are spent on performance where it matters to more typical usage. (E.g. reducing duplication in as.double and friends speeds up just about every R session, and speeds up some numerical sessions dramatically.) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] RFC: adding an 'exact' argument to [[
Hi all, One of the things I find most problematic in R is the partial matching of names in lists. Robert and I have discussed this and we believe that having a mechanism that does not do partial matching would be of significant benefit to R programmers. To that end, I have written a patch that modifies the behavior of "[[" as follows: 1. [[ gains an 'exact' argument with default value NA 2. Behavior of 'exact' argument: exact=NA partial matching is performed as usual, however, a warning will be issued when a partial match occurs. This is the default. exact=TRUE no partial matching is performed. exact=FALSE partial matching is allowed and no warning issued if it occurs. This change has been discussed among R-core members and there appeared to be a general consensus that this approach was a good way to proceed. However, we are interested in other suggestions from the broader R developer community. Some additional rationale for our approach: Lists are used as the underlying data structures in many R programs and in these cases the named elements are not a fixed set of things with a fixed set of names. For these programs, [[ will be used with an argument that gets evaluated at runtime and partial matching here is almost always a disaster. Furthermore, dealing with data that has common prefixes happens often and is not an exceptional circumstance (a precondition for partial matching issues). We have tested a similar patch that simply eliminated partial matching for [[ on all CRAN and Bioconductor packages and did not see any obvious failures. A downside of this approach is that S4 methods on [[ will need to be modified to accommodate the new signature. However, by adding an argument, we are able to move more slowly towards a non-partially matching [[ (eventually, the default could be exact=TRUE, but that is a discussion for another day). + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] model.matrix bug? Nested factor yields singular design matrix.
Hi all, I believe this is a bug in the model.matrix function. I'd like a second opinion before filing a bug report. If I have a nested covariate B with multiple values for just one level of A, I can not get a non-singular design matrix out of model.matrix > df <- data.frame(A = factor(c("a", "a", "x", "x"), levels = c("x", "a")), + B = factor(c("b", "x", "x", "x"), levels = c("x", "b"))) > > df A B 1 a b 2 a x 3 x x 4 x x So of course the full design matrix is singular, this is expected. > model.matrix(~ A * B, df) (Intercept) Aa Bb Aa:Bb 1 1 1 1 1 2 1 1 0 0 3 1 0 0 0 4 1 0 0 0 attr(,"assign") [1] 0 1 2 3 attr(,"contrasts") attr(,"contrasts")$A [1] "contr.treatment" attr(,"contrasts")$B [1] "contr.treatment" I'd like to drop the B main effect column, but get the unexpected result of a column of zeroes. > model.matrix(~ A * B - B, df) (Intercept) Aa Ax:Bb Aa:Bb 1 1 1 0 1 2 1 1 0 0 3 1 0 0 0 4 1 0 0 0 attr(,"assign") [1] 0 1 2 2 attr(,"contrasts") attr(,"contrasts")$A [1] "contr.treatment" attr(,"contrasts")$B [1] "contr.treatment" > > This does not happen in S-PLUS. > info() S info file C:\DOCUME~1\kilroy\LOCALS~1\Temp\S0107EB.tmp will be removed at session end $Sinfo: Enterprise Developer Version 7.0.6 for Microsoft Windows : 2005 SHOME: C:/PROGRAMFILES/INSIGHTFUL/splus70 prog.name: SPLUS.EXE load.date: Sun Dec 04 23:15:59 2005 date: Thu May 17 07:38:16 PDT 2007 > options(contrasts = c("contr.treatment", "contr.poly")) > df <- data.frame(A = factor(c("a", "a", "x", "x"), levels = c("x", "a")), + B = factor(c("b", "x", "x", "x"), levels = c("x", "b"))) > model.matrix(~ A * B - B, df) (Intercept) A A:B 1 1 1 1 2 1 1 0 3 1 0 0 4 1 0 0 > This is what I was expecting to get in R, but can not. Alternate specifications in R continue to yield a singular design matrix > > model.matrix(~ A/B, df) (Intercept) Aa Ax:Bb Aa:Bb 1 1 1 0 1 2 1 1 0 0 3 1 0 0 0 4 1 0 0 0 attr(,"assign") [1] 0 1 2 2 attr(,"contrasts") attr(,"contrasts")$A [1] "contr.treatment" attr(,"contrasts")$B [1] "contr.treatment" > model.matrix(~ A + A:B, df) (Intercept) Aa Ax:Bb Aa:Bb 1 1 1 0 1 2 1 1 0 0 3 1 0 0 0 4 1 0 0 0 attr(,"assign") [1] 0 1 2 2 attr(,"contrasts") attr(,"contrasts")$A [1] "contr.treatment" attr(,"contrasts")$B [1] "contr.treatment" Why is the Ax:Bb column being included? Have I missed a control parameter or some other way of specifying to model.matrix not to include this extra column? Any feedback appreciated. Best regards Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: [EMAIL PROTECTED] tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] model.matrix bug? Nested factor yields singular design matrix.
Apologies - I forgot the session info. > sessionInfo() R version 2.5.0 (2007-04-23) powerpc-apple-darwin8.9.1 locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" "base" other attached packages: lme4 Matrix lattice "0.99875-0" "0.99875-1""0.15-5" -Original Message- From: [EMAIL PROTECTED] on behalf of Steven McKinney Sent: Thu 5/17/2007 11:41 AM To: r-devel@r-project.org Subject: [Rd] model.matrix bug? Nested factor yields singular design matrix. Hi all, I believe this is a bug in the model.matrix function. I'd like a second opinion before filing a bug report. If I have a nested covariate B with multiple values for just one level of A, I can not get a non-singular design matrix out of model.matrix > df <- data.frame(A = factor(c("a", "a", "x", "x"), levels = c("x", "a")), + B = factor(c("b", "x", "x", "x"), levels = c("x", "b"))) > > df A B 1 a b 2 a x 3 x x 4 x x So of course the full design matrix is singular, this is expected. > model.matrix(~ A * B, df) (Intercept) Aa Bb Aa:Bb 1 1 1 1 1 2 1 1 0 0 3 1 0 0 0 4 1 0 0 0 attr(,"assign") [1] 0 1 2 3 attr(,"contrasts") attr(,"contrasts")$A [1] "contr.treatment" attr(,"contrasts")$B [1] "contr.treatment" I'd like to drop the B main effect column, but get the unexpected result of a column of zeroes. > model.matrix(~ A * B - B, df) (Intercept) Aa Ax:Bb Aa:Bb 1 1 1 0 1 2 1 1 0 0 3 1 0 0 0 4 1 0 0 0 attr(,"assign") [1] 0 1 2 2 attr(,"contrasts") attr(,"contrasts")$A [1] "contr.treatment" attr(,"contrasts")$B [1] "contr.treatment" > > This does not happen in S-PLUS. > info() S info file C:\DOCUME~1\kilroy\LOCALS~1\Temp\S0107EB.tmp will be removed at session end $Sinfo: Enterprise Developer Version 7.0.6 for Microsoft Windows : 2005 SHOME: C:/PROGRAMFILES/INSIGHTFUL/splus70 prog.name: SPLUS.EXE load.date: Sun Dec 04 23:15:59 2005 date: Thu May 17 07:38:16 PDT 2007 > options(contrasts = c("contr.treatment", "contr.poly")) > df <- data.frame(A = factor(c("a", "a", "x", "x"), levels = c("x", "a")), + B = factor(c("b", "x", "x", "x"), levels = c("x", "b"))) > model.matrix(~ A * B - B, df) (Intercept) A A:B 1 1 1 1 2 1 1 0 3 1 0 0 4 1 0 0 > This is what I was expecting to get in R, but can not. Alternate specifications in R continue to yield a singular design matrix > > model.matrix(~ A/B, df) (Intercept) Aa Ax:Bb Aa:Bb 1 1 1 0 1 2 1 1 0 0 3 1 0 0 0 4 1 0 0 0 attr(,"assign") [1] 0 1 2 2 attr(,"contrasts") attr(,"contrasts")$A [1] "contr.treatment" attr(,"contrasts")$B [1] "contr.treatment" > model.matrix(~ A + A:B, df) (Intercept) Aa Ax:Bb Aa:Bb 1 1 1 0 1 2 1 1 0 0 3 1 0 0 0 4 1 0 0 0 attr(,"assign") [1] 0 1 2 2 attr(,"contrasts") attr(,"contrasts")$A [1] "contr.treatment" attr(,"contrasts")$B [1] "contr.treatment" Why is the Ax:Bb column being included? Have I missed a control parameter or some other way of specifying to model.matrix not to include this extra column? Any feedback appreciated. Best regards Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: [EMAIL PROTECTED] tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: adding an 'exact' argument to [[
On Thu, 17 May 2007, Seth Falcon wrote: > One of the things I find most problematic in R is the partial matching > of names in lists. Robert and I have discussed this and we believe > that having a mechanism that does not do partial matching would be of > significant benefit to R programmers. To that end, I have written a > patch that modifies the behavior of "[[" as follows: > >1. [[ gains an 'exact' argument with default value NA > >2. Behavior of 'exact' argument: > > exact=NA > partial matching is performed as usual, however, a warning > will be issued when a partial match occurs. This is the > default. > > exact=TRUE > no partial matching is performed. > > exact=FALSE > partial matching is allowed and no warning issued if it > occurs. > > This change has been discussed among R-core members and there appeared > to be a general consensus that this approach was a good way to > proceed. However, we are interested in other suggestions from the > broader R developer community. > > Some additional rationale for our approach: > > Lists are used as the underlying data structures in many R programs > and in these cases the named elements are not a fixed set of things > with a fixed set of names. For these programs, [[ will be used with > an argument that gets evaluated at runtime and partial matching here > is almost always a disaster. Furthermore, dealing with data that has > common prefixes happens often and is not an exceptional circumstance > (a precondition for partial matching issues). This sounds interesting. Do you intend to leave the $ operator alone, so it will continue to do partial matching? I suspect that that is where the majority of partial matching for list names is done. It might be nice to have an option that made x$partial warn so we would fix code that relied on partial matching, but that is lower priority. Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position." __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: adding an 'exact' argument to [[
Bill Dunlap <[EMAIL PROTECTED]> writes: > This sounds interesting. Do you intend to leave the $ > operator alone, so it will continue to do partial > matching? I suspect that that is where the majority > of partial matching for list names is done. The current proposal will not touch $. I agree that most intentional partial matching uses $ (hopefully only during interactive sessions). The main benefit of the our proposed change is more reliable package code. For long lists and certain patterns of use, there are also performance benefits: > kk <- paste("abc", 1:(1e6), sep="") > vv = as.list(1:(1e6)) > names(vv) = kk > system.time(vv[["fooo", exact=FALSE]]) user system elapsed 0.074 0.000 0.074 > system.time(vv[["fooo", exact=TRUE]]) user system elapsed 0.042 0.000 0.042 > It might be nice to have an option that made x$partial warn so we > would fix code that relied on partial matching, but that is lower > priority. I think that could be useful as well. To digress a bit further in discussing $... I think the argument that partial matching is desirable because it saves typing during interactive sessions now has a lot less weight. The recent integration of the completion code gives less typing and complete names. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: adding an 'exact' argument to [[
On Thu, 17 May 2007, Seth Falcon wrote: > Bill Dunlap <[EMAIL PROTECTED]> writes: >> This sounds interesting. Do you intend to leave the $ >> operator alone, so it will continue to do partial >> matching? I suspect that that is where the majority >> of partial matching for list names is done. > > The current proposal will not touch $. I agree that most intentional > partial matching uses $ (hopefully only during interactive sessions). > The main benefit of the our proposed change is more reliable package > code. For long lists and certain patterns of use, there are also > performance benefits: > >> kk <- paste("abc", 1:(1e6), sep="") >> vv = as.list(1:(1e6)) >> names(vv) = kk > >> system.time(vv[["fooo", exact=FALSE]]) > user system elapsed > 0.074 0.000 0.074 > >> system.time(vv[["fooo", exact=TRUE]]) > user system elapsed > 0.042 0.000 0.042 > > >> It might be nice to have an option that made x$partial warn so we >> would fix code that relied on partial matching, but that is lower >> priority. > > I think that could be useful as well. To digress a bit further in > discussing $... I think the argument that partial matching is > desirable because it saves typing during interactive sessions now has > a lot less weight. The recent integration of the completion code > gives less typing and complete names. There is a similar issue with argument partial matching. Since we have the source of R one can pretty easily build a version of R which does not have the feature: I have been doing that in conjunction with 'codetools' to do some checking. In both cases there is traditional partial matching: seq(along=) or seq(length=), and $fitted vs $fitted.values. There are not many uses of seq(along.with=) about and vastly more of seq(along=) (although in R using seq_along() is preferable): even in some packages which do use seq(along.with=) there are more instances of seq(along=). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: adding an 'exact' argument to [[
On 5/17/2007 3:54 PM, Prof Brian Ripley wrote: > On Thu, 17 May 2007, Seth Falcon wrote: > >> Bill Dunlap <[EMAIL PROTECTED]> writes: >>> This sounds interesting. Do you intend to leave the $ >>> operator alone, so it will continue to do partial >>> matching? I suspect that that is where the majority >>> of partial matching for list names is done. >> >> The current proposal will not touch $. I agree that most intentional >> partial matching uses $ (hopefully only during interactive sessions). >> The main benefit of the our proposed change is more reliable package >> code. For long lists and certain patterns of use, there are also >> performance benefits: >> >>> kk <- paste("abc", 1:(1e6), sep="") >>> vv = as.list(1:(1e6)) >>> names(vv) = kk >> >>> system.time(vv[["fooo", exact=FALSE]]) >> user system elapsed >> 0.074 0.000 0.074 >> >>> system.time(vv[["fooo", exact=TRUE]]) >> user system elapsed >> 0.042 0.000 0.042 >> >> >>> It might be nice to have an option that made x$partial warn so we >>> would fix code that relied on partial matching, but that is lower >>> priority. >> >> I think that could be useful as well. To digress a bit further in >> discussing $... I think the argument that partial matching is >> desirable because it saves typing during interactive sessions now has >> a lot less weight. The recent integration of the completion code >> gives less typing and complete names. > > There is a similar issue with argument partial matching. Since we have > the source of R one can pretty easily build a version of R which does not > have the feature: I have been doing that in conjunction with 'codetools' > to do some checking. > > In both cases there is traditional partial matching: seq(along=) or > seq(length=), and $fitted vs $fitted.values. There are not many uses of > seq(along.with=) about and vastly more of seq(along=) (although in R using > seq_along() is preferable): even in some packages which do use > seq(along.with=) there are more instances of seq(along=). Opinions, please: In another thread I think we have agreement to add an extra arg to the vignette() function to limit it to attached packages. By analogy with other similar functions, the arg would be named all.available. However, I suspect most users would abbreviate that to just "all". Should I name it "all.available" for consistency, or "all" in anticipation of a day when exact argument matching will be required? Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Unexpected alteration of data frame column names
Hi, Thanks to both for your answers! Quoting Marc Schwartz <[EMAIL PROTECTED]>: > On Thu, 2007-05-17 at 10:54 +0100, Prof Brian Ripley wrote: > > To add to Marc's detective work. ?"[.data.frame" does say > > > > If '[' returns a data frame it will have unique (and non-missing) > > row names, if necessary transforming the row names using > > 'make.unique'. Similarly, column names will be transformed (if > > columns are selected more than once). > > > > Now, an 'e.g.' in the parenthetical remark might make this clearer (since > > added), but I don't see why this was 'unexpected' (or why this is an issue It all depends whether you care about consistency or not. Personnally I do. Yes documenting inconsistencies is better than nothing but is not always enough to make the language predictable (see below). So, according to ?"[.data.frame", column names will be transformed (if columns are selected more than once). OK. Personnally, I can see ony 2 reasonable semantics for 'df[ ]' or 'df[ , ]': (1) either it makes an exact copy of your data frame (and this is not only true for data frames: unless documented otherwise one can expect x[] to be the same as x), (2) either you consider that it is equivalent to 'df[names(df)]' for the former and to 'df[ , names(df)]' for the latter. So it seems that for 'df[ ]', we have semantic (1): > df=data.frame(aa=LETTERS[1:3],bb=3:5,aa=7:5,check.names=FALSE) > df aa bb aa 1 A 3 7 2 B 4 6 3 C 5 5 > df[] aa bb aa 1 A 3 7 2 B 4 6 3 C 5 5 Since we have duplicated colnames, 'df[names(df)]' will select the first column twice and rename it (as documented): > df[names(df)] aa bb aa.1 1 A 3A 2 B 4B 3 C 5C Good! Now with 'df[ , ]', I still maintain that this is unexpected: > df[ , ] aa bb aa.1 1 A 37 2 B 46 3 C 55 This is a mix of semantic (1) and semantic (2): 3rd column has been renamed but its data are the _original_ data. With semantic (2), you would get this: > df[ , names(df)] aa bb aa.1 1 A 3A 2 B 4B 3 C 5C Also the fact that 'df[something]' doesn't behave like 'df[,something]' is IMHO another inconsistency... Hope you don't mind if I put this back on R-devel which is probably the right place to discuss the language semantic. Cheers, H. > > > for R-devel). > > > > On Tue, 15 May 2007, Marc Schwartz wrote: > > > > > On Mon, 2007-05-14 at 23:59 -0700, Herve Pages wrote: > > >> Hi, > > >> > > >> I'm using data.frame(..., check.names=FALSE), because I want to create > > >> a data frame with duplicated column names (in the real life you can get > such > > >> data frame as the result of an SQL query): > > > > That depends on the interface you are using. > > > > >> > df <- data.frame(aa=1:5, aa=9:5, check.names=FALSE) > > >> > df > > >> aa aa > > >> 1 1 9 > > >> 2 2 8 > > >> 3 3 7 > > >> 4 4 6 > > >> 5 5 5 > > >> > > >> Why is [.data.frame changing my column names? > > >> > > >> > df[1:3, ] > > >> aa aa.1 > > >> 1 19 > > >> 2 28 > > >> 3 37 > > >> > > >> How can this be avoided? Thanks! > > >> > > >> H. > > > > > > Herve, > > > > > > I had not seen a reply to your post, but you can review the code for > > > "[.data.frame" by using: > > > > > > getAnywhere("[.data.frame") > > > > > > and see where there are checks for duplicate column names in the > > > function. > > > > > > That is going to be the default behavior for data frame > > > subsetting/extraction and in fact is noted in the 'ONEWS' file for R > > > version 1.8.0: > > > > > > - Subsetting a data frame can no longer produce duplicate > > > column names. > > > > > > So it has been around for some time (October of 2003). > > > > > > In terms of avoiding it, I suspect that you would have to create your > > > own version of the function, perhaps with an additional argument that > > > enables/disables that duplicate column name checks. > > > > > > I have not however considered the broader functional implications of > > > doing so however, so be vewwy vewwy careful here. > > > > Namespace issues would mean that your version would hardly ever be used. > > I suspected that namespaces might be an issue here, but had not pursued > that line of thinking beyond an initial 'gut feel'. > > Thanks, > > Marc > > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel