date:20070517

Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-17 Thread Petr Savicky

> strncasecmp is not standard C (not even C99), but R does have a substitute 
> for it.  Unfortunately strncasecmp is not usable with multibyte charsets: 
> Linux systems have wcsncasecmp but that is not portable.  In these days of 
> widespread use of UTF-8 that is a blocking issue, I am afraid.

What could help are the functions mbrtowc and towctrans and simple
long integer comparison. Are the functions mbrtowc and towctrans
available under Windows? mbrtowc seems to be available as Rmbrtowc
in src/gnuwin32/extra.c.

I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide
and used in do_tolower. Does this mean that tolower is not usable
with utf-8 under Windows?

> In the case of grep I think all you need is
> 
> grep(tolower(pattern), tolower(x), fixed = TRUE)
> 
> and similarly for regexpr.

Yes. this is correct, but it has disadvantages. It needs more
space and, if value=TRUE, we would have to do something like
   x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)]
This is hard to implement in src/library/base/R/grep.R,
where the call to .Internal(grep(pattern,...)) is the last command
and I think this should be preserved.

> >Ignore case option is not meaningfull in gsub.
> 
> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
> 
> is different from 'ignore.case=FALSE', and I see the meaning as clear.
> So what did you mean?  (Unfortunately the tolower trick does not work for 
> [g]sub.)

The meaning of ignore.case in [g]sub is problematic due to the following.
  sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
produces
  [1] "xyzD" "xyzd"
but the user may in fact need the following
  [1] "XYZD" "xyzd"

It is correct that "xyzD" "xyzd" is produced, but the user
should be aware of the fact that several substitutions like 
  x <- sub("abc", "xyz", c("ABCD", "abcd"))   # ignore.case=FALSE
  sub("ABC", "XYZ", x)  # ignore.case=FALSE
may be more useful.

I have another question concerning the speed of grep. I expected that
fgrep_one function is slower than calling a library routine
for regular expressions. In particular, if the pattern has a lot of
long partial matches in the target string, I expected that it may be much
slower. A short example is
  y <- "ab"
  x <- "aaab"
  grep(y,x)
which requires 110 comparisons (10 comparisons for each of 11 possible
beginnings of y in x). In general, the complexity in the worst case is
O(m*n), where m,n are the lengths of y,x resp. I would expect that
the library function for matching regular expressions needs
time O(m+n) and so will be faster. However, the result obtained
on a larger example is

  > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
  > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
  > y <- paste(c(rep("a", times = 1), x2), collapse = "")
  > z <- rep(y, times = 100)

  > system.time(i <- grep(x1, z, fixed = T))
  [1] 1.970 0.000 1.985 0.000 0.000

  > system.time(i <- grep(x1, z, fixed = F))   # reg. expr. surprisingly slow 
(*)
  [1] 40.374  0.003 40.381  0.000  0.000

  > system.time(i <- grep(x2, z, fixed = T))
  [1] 0.113 0.000 0.113 0.000 0.000

  > system.time(i <- grep(x2, z, fixed = F))  # reg. expr. faster than fgrep_one
  [1] 0.019 0.000 0.019 0.000 0.000

Do you have an explanation of these results, in particular (*)?

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-17 Thread Prof Brian Ripley

On Thu, 17 May 2007, Petr Savicky wrote:

>> strncasecmp is not standard C (not even C99), but R does have a substitute
>> for it.  Unfortunately strncasecmp is not usable with multibyte charsets:
>> Linux systems have wcsncasecmp but that is not portable.  In these days of
>> widespread use of UTF-8 that is a blocking issue, I am afraid.
>
> What could help are the functions mbrtowc and towctrans and simple
> long integer comparison. Are the functions mbrtowc and towctrans
> available under Windows? mbrtowc seems to be available as Rmbrtowc
> in src/gnuwin32/extra.c.
>
> I did not find towctrans defined in R sources, but it is in 
> gnuwin32/Rdll.hide

I don't see it in Rdll.hide.  It is a C99 function (see your unix man 
page).

> and used in do_tolower. Does this mean that tolower is not usable
> with utf-8 under Windows?

UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in 
so far as that makes sense: Chinese chars do not have 'case').

Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is 
not currently active.

>> In the case of grep I think all you need is
>>
>> grep(tolower(pattern), tolower(x), fixed = TRUE)
>>
>> and similarly for regexpr.
>
> Yes. this is correct, but it has disadvantages. It needs more
> space and, if value=TRUE, we would have to do something like
>   x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)]
> This is hard to implement in src/library/base/R/grep.R,
> where the call to .Internal(grep(pattern,...)) is the last command
> and I think this should be preserved.
>
>>> Ignore case option is not meaningfull in gsub.
>>
>> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
>>
>> is different from 'ignore.case=FALSE', and I see the meaning as clear.
>> So what did you mean?  (Unfortunately the tolower trick does not work for
>> [g]sub.)
>
> The meaning of ignore.case in [g]sub is problematic due to the following.
>  sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
> produces
>  [1] "xyzD" "xyzd"
> but the user may in fact need the following
>  [1] "XYZD" "xyzd"

He may, but that is not what 'ignore case' means, more like 'case 
honouring'.

> It is correct that "xyzD" "xyzd" is produced, but the user
> should be aware of the fact that several substitutions like
>  x <- sub("abc", "xyz", c("ABCD", "abcd"))   # ignore.case=FALSE
>  sub("ABC", "XYZ", x)  # ignore.case=FALSE
> may be more useful.
>
> I have another question concerning the speed of grep. I expected that
> fgrep_one function is slower than calling a library routine
> for regular expressions. In particular, if the pattern has a lot of
> long partial matches in the target string, I expected that it may be much
> slower. A short example is
>  y <- "ab"
>  x <- "aaab"
>  grep(y,x)
> which requires 110 comparisons (10 comparisons for each of 11 possible
> beginnings of y in x). In general, the complexity in the worst case is
> O(m*n), where m,n are the lengths of y,x resp. I would expect that
> the library function for matching regular expressions needs
> time O(m+n) and so will be faster. However, the result obtained
> on a larger example is
>
>  > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
>  > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
>  > y <- paste(c(rep("a", times = 1), x2), collapse = "")
>  > z <- rep(y, times = 100)
>
>  > system.time(i <- grep(x1, z, fixed = T))
>  [1] 1.970 0.000 1.985 0.000 0.000
>
>  > system.time(i <- grep(x1, z, fixed = F))   # reg. expr. surprisingly slow 
> (*)
>  [1] 40.374  0.003 40.381  0.000  0.000
>
>  > system.time(i <- grep(x2, z, fixed = T))
>  [1] 0.113 0.000 0.113 0.000 0.000
>
>  > system.time(i <- grep(x2, z, fixed = F))  # reg. expr. faster than 
> fgrep_one
>  [1] 0.019 0.000 0.019 0.000 0.000
>
> Do you have an explanation of these results, in particular (*)?

Yes, there is a comment on the help page to that effect.  But these are 
highly atypical uses. Try perl=TRUE, and be aware that the locale matters 
a lot in such tests (via the charset).

No one is attempting to make R a fast string-processing language and so 
developers resources are spent on performance where it matters to more 
typical usage.  (E.g. reducing duplication in as.double and friends speeds 
up just about every R session, and speeds up some numerical sessions 
dramatically.)

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] RFC: adding an 'exact' argument to [[

2007-05-17 Thread Seth Falcon

Hi all,

One of the things I find most problematic in R is the partial matching
of names in lists.  Robert and I have discussed this and we believe
that having a mechanism that does not do partial matching would be of
significant benefit to R programmers.  To that end, I have written a
patch that modifies the behavior of "[[" as follows:

   1. [[ gains an 'exact' argument with default value NA

   2. Behavior of 'exact' argument:

  exact=NA
  partial matching is performed as usual, however, a warning
  will be issued when a partial match occurs.  This is the
  default.

  exact=TRUE
  no partial matching is performed.

  exact=FALSE
  partial matching is allowed and no warning issued if it
  occurs.

This change has been discussed among R-core members and there appeared
to be a general consensus that this approach was a good way to
proceed.  However, we are interested in other suggestions from the
broader R developer community.

Some additional rationale for our approach:

Lists are used as the underlying data structures in many R programs
and in these cases the named elements are not a fixed set of things
with a fixed set of names.  For these programs, [[ will be used with
an argument that gets evaluated at runtime and partial matching here
is almost always a disaster.  Furthermore, dealing with data that has
common prefixes happens often and is not an exceptional circumstance
(a precondition for partial matching issues).

We have tested a similar patch that simply eliminated partial matching
for [[ on all CRAN and Bioconductor packages and did not see any
obvious failures.

A downside of this approach is that S4 methods on [[ will need to be
modified to accommodate the new signature.  However, by adding an
argument, we are able to move more slowly towards a non-partially
matching [[ (eventually, the default could be exact=TRUE, but that is
a discussion for another day).


+ seth

-- 
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center
http://bioconductor.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] model.matrix bug? Nested factor yields singular design matrix.

2007-05-17 Thread Steven McKinney


Hi all,

I believe this is a bug in the model.matrix function.
I'd like a second opinion before filing a bug report.

If I have a nested covariate B with multiple values for
just one level of A, I can not get a non-singular design
matrix out of model.matrix

> df <- data.frame(A = factor(c("a", "a", "x", "x"), levels = c("x", "a")),
+  B = factor(c("b", "x", "x", "x"), levels = c("x", "b")))
> 
> df
  A B
1 a b
2 a x
3 x x
4 x x


So of course the full design matrix is singular,
this is expected.

> model.matrix(~ A * B, df)
  (Intercept) Aa Bb Aa:Bb
1   1  1  1 1
2   1  1  0 0
3   1  0  0 0
4   1  0  0 0
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"

attr(,"contrasts")$B
[1] "contr.treatment"

I'd like to drop the B main effect column,
but get the unexpected result of a column of zeroes.

> model.matrix(~ A * B - B, df)
  (Intercept) Aa Ax:Bb Aa:Bb
1   1  1 0 1
2   1  1 0 0
3   1  0 0 0
4   1  0 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"

attr(,"contrasts")$B
[1] "contr.treatment"

> 
> 


This does not happen in S-PLUS.
 

> info()
S info file C:\DOCUME~1\kilroy\LOCALS~1\Temp\S0107EB.tmp will be removed at 
session end
$Sinfo:
Enterprise Developer Version 7.0.6  for Microsoft Windows : 2005
SHOME:   
C:/PROGRAMFILES/INSIGHTFUL/splus70
prog.name:   SPLUS.EXE
load.date: Sun Dec 04 23:15:59 2005
date: Thu May 17 07:38:16 PDT 2007
 
> options(contrasts = c("contr.treatment", "contr.poly"))
> df <- data.frame(A = factor(c("a", "a", "x", "x"), levels = c("x", "a")),
+   B = factor(c("b", "x", "x", "x"), levels = c("x", "b")))
> model.matrix(~ A * B - B, df)
  (Intercept) A A:B
1   1 1   1
2   1 1   0
3   1 0   0
4   1 0   0
> 
 
This is what I was expecting to get in R, but can not.

 
Alternate specifications in R continue to yield a singular
design matrix


> 
> model.matrix(~ A/B, df)
  (Intercept) Aa Ax:Bb Aa:Bb
1   1  1 0 1
2   1  1 0 0
3   1  0 0 0
4   1  0 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"

attr(,"contrasts")$B
[1] "contr.treatment"



> model.matrix(~ A + A:B, df)
  (Intercept) Aa Ax:Bb Aa:Bb
1   1  1 0 1
2   1  1 0 0
3   1  0 0 0
4   1  0 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"

attr(,"contrasts")$B
[1] "contr.treatment"

Why is the Ax:Bb column being included?


Have I missed a control parameter or some other way
of specifying to model.matrix not to include this
extra column?


Any feedback appreciated.


Best regards

Steven McKinney

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

email: [EMAIL PROTECTED]

tel: 604-675-8000 x7561

BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C. 
V5Z 1L3
Canada

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] model.matrix bug? Nested factor yields singular design matrix.

2007-05-17 Thread Steven McKinney


Apologies - I forgot the session info.


> sessionInfo()
R version 2.5.0 (2007-04-23) 
powerpc-apple-darwin8.9.1 

locale:
en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] "stats" "graphics"  "grDevices" "utils" "datasets"  "methods"   
"base" 

other attached packages:
   lme4  Matrix lattice 
"0.99875-0" "0.99875-1""0.15-5" 


-Original Message-
From: [EMAIL PROTECTED] on behalf of Steven McKinney
Sent: Thu 5/17/2007 11:41 AM
To: r-devel@r-project.org
Subject: [Rd] model.matrix bug? Nested factor yields singular design matrix.
 

Hi all,

I believe this is a bug in the model.matrix function.
I'd like a second opinion before filing a bug report.

If I have a nested covariate B with multiple values for
just one level of A, I can not get a non-singular design
matrix out of model.matrix

> df <- data.frame(A = factor(c("a", "a", "x", "x"), levels = c("x", "a")),
+  B = factor(c("b", "x", "x", "x"), levels = c("x", "b")))
> 
> df
  A B
1 a b
2 a x
3 x x
4 x x


So of course the full design matrix is singular,
this is expected.

> model.matrix(~ A * B, df)
  (Intercept) Aa Bb Aa:Bb
1   1  1  1 1
2   1  1  0 0
3   1  0  0 0
4   1  0  0 0
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"

attr(,"contrasts")$B
[1] "contr.treatment"

I'd like to drop the B main effect column,
but get the unexpected result of a column of zeroes.

> model.matrix(~ A * B - B, df)
  (Intercept) Aa Ax:Bb Aa:Bb
1   1  1 0 1
2   1  1 0 0
3   1  0 0 0
4   1  0 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"

attr(,"contrasts")$B
[1] "contr.treatment"

> 
> 


This does not happen in S-PLUS.
 

> info()
S info file C:\DOCUME~1\kilroy\LOCALS~1\Temp\S0107EB.tmp will be removed at 
session end
$Sinfo:
Enterprise Developer Version 7.0.6  for Microsoft Windows : 2005
SHOME:   
C:/PROGRAMFILES/INSIGHTFUL/splus70
prog.name:   SPLUS.EXE
load.date: Sun Dec 04 23:15:59 2005
date: Thu May 17 07:38:16 PDT 2007
 
> options(contrasts = c("contr.treatment", "contr.poly"))
> df <- data.frame(A = factor(c("a", "a", "x", "x"), levels = c("x", "a")),
+   B = factor(c("b", "x", "x", "x"), levels = c("x", "b")))
> model.matrix(~ A * B - B, df)
  (Intercept) A A:B
1   1 1   1
2   1 1   0
3   1 0   0
4   1 0   0
> 
 
This is what I was expecting to get in R, but can not.

 
Alternate specifications in R continue to yield a singular
design matrix


> 
> model.matrix(~ A/B, df)
  (Intercept) Aa Ax:Bb Aa:Bb
1   1  1 0 1
2   1  1 0 0
3   1  0 0 0
4   1  0 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"

attr(,"contrasts")$B
[1] "contr.treatment"



> model.matrix(~ A + A:B, df)
  (Intercept) Aa Ax:Bb Aa:Bb
1   1  1 0 1
2   1  1 0 0
3   1  0 0 0
4   1  0 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"

attr(,"contrasts")$B
[1] "contr.treatment"

Why is the Ax:Bb column being included?


Have I missed a control parameter or some other way
of specifying to model.matrix not to include this
extra column?


Any feedback appreciated.


Best regards

Steven McKinney

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

email: [EMAIL PROTECTED]

tel: 604-675-8000 x7561

BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C. 
V5Z 1L3
Canada

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] RFC: adding an 'exact' argument to [[

2007-05-17 Thread Bill Dunlap

On Thu, 17 May 2007, Seth Falcon wrote:

> One of the things I find most problematic in R is the partial matching
> of names in lists.  Robert and I have discussed this and we believe
> that having a mechanism that does not do partial matching would be of
> significant benefit to R programmers.  To that end, I have written a
> patch that modifies the behavior of "[[" as follows:
>
>1. [[ gains an 'exact' argument with default value NA
>
>2. Behavior of 'exact' argument:
>
>   exact=NA
>   partial matching is performed as usual, however, a warning
>   will be issued when a partial match occurs.  This is the
>   default.
>
>   exact=TRUE
>   no partial matching is performed.
>
>   exact=FALSE
>   partial matching is allowed and no warning issued if it
>   occurs.
>
> This change has been discussed among R-core members and there appeared
> to be a general consensus that this approach was a good way to
> proceed.  However, we are interested in other suggestions from the
> broader R developer community.
>
> Some additional rationale for our approach:
>
> Lists are used as the underlying data structures in many R programs
> and in these cases the named elements are not a fixed set of things
> with a fixed set of names.  For these programs, [[ will be used with
> an argument that gets evaluated at runtime and partial matching here
> is almost always a disaster.  Furthermore, dealing with data that has
> common prefixes happens often and is not an exceptional circumstance
> (a precondition for partial matching issues).

This sounds interesting.  Do you intend to leave the $
operator alone, so it will continue to do partial
matching?  I suspect that that is where the majority
of partial matching for list names is done.  It might
be nice to have an option that made x$partial warn so we
would fix code that relied on partial matching, but that
is lower priority.


Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] RFC: adding an 'exact' argument to [[

2007-05-17 Thread Seth Falcon

Bill Dunlap <[EMAIL PROTECTED]> writes:
> This sounds interesting.  Do you intend to leave the $
> operator alone, so it will continue to do partial
> matching?  I suspect that that is where the majority
> of partial matching for list names is done.

The current proposal will not touch $.  I agree that most intentional
partial matching uses $ (hopefully only during interactive sessions).
The main benefit of the our proposed change is more reliable package
code.  For long lists and certain patterns of use, there are also
performance benefits:

> kk <- paste("abc", 1:(1e6), sep="")
> vv = as.list(1:(1e6))
> names(vv) = kk

> system.time(vv[["fooo", exact=FALSE]])
   user  system elapsed 
  0.074   0.000   0.074 

> system.time(vv[["fooo", exact=TRUE]])
   user  system elapsed 
  0.042   0.000   0.042 

> It might be nice to have an option that made x$partial warn so we
> would fix code that relied on partial matching, but that is lower
> priority.

I think that could be useful as well.  To digress a bit further in
discussing $... I think the argument that partial matching is
desirable because it saves typing during interactive sessions now has
a lot less weight.  The recent integration of the completion code
gives less typing and complete names.

+ seth

-- 
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center
http://bioconductor.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] RFC: adding an 'exact' argument to [[

2007-05-17 Thread Prof Brian Ripley

On Thu, 17 May 2007, Seth Falcon wrote:

> Bill Dunlap <[EMAIL PROTECTED]> writes:
>> This sounds interesting.  Do you intend to leave the $
>> operator alone, so it will continue to do partial
>> matching?  I suspect that that is where the majority
>> of partial matching for list names is done.
>
> The current proposal will not touch $.  I agree that most intentional
> partial matching uses $ (hopefully only during interactive sessions).
> The main benefit of the our proposed change is more reliable package
> code.  For long lists and certain patterns of use, there are also
> performance benefits:
>
>> kk <- paste("abc", 1:(1e6), sep="")
>> vv = as.list(1:(1e6))
>> names(vv) = kk
>
>> system.time(vv[["fooo", exact=FALSE]])
>   user  system elapsed
>  0.074   0.000   0.074
>
>> system.time(vv[["fooo", exact=TRUE]])
>   user  system elapsed
>  0.042   0.000   0.042
>
>
>> It might be nice to have an option that made x$partial warn so we
>> would fix code that relied on partial matching, but that is lower
>> priority.
>
> I think that could be useful as well.  To digress a bit further in
> discussing $... I think the argument that partial matching is
> desirable because it saves typing during interactive sessions now has
> a lot less weight.  The recent integration of the completion code
> gives less typing and complete names.

There is a similar issue with argument partial matching.  Since we have 
the source of R one can pretty easily build a version of R which does not 
have the feature: I have been doing that in conjunction with 'codetools' 
to do some checking.

In both cases there is traditional partial matching: seq(along=) or 
seq(length=), and $fitted vs $fitted.values.  There are not many uses of 
seq(along.with=) about and vastly more of seq(along=) (although in R using 
seq_along() is preferable): even in some packages which do use 
seq(along.with=) there are more instances of seq(along=).

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] RFC: adding an 'exact' argument to [[

2007-05-17 Thread Duncan Murdoch

On 5/17/2007 3:54 PM, Prof Brian Ripley wrote:
> On Thu, 17 May 2007, Seth Falcon wrote:
> 
>> Bill Dunlap <[EMAIL PROTECTED]> writes:
>>> This sounds interesting.  Do you intend to leave the $
>>> operator alone, so it will continue to do partial
>>> matching?  I suspect that that is where the majority
>>> of partial matching for list names is done.
>>
>> The current proposal will not touch $.  I agree that most intentional
>> partial matching uses $ (hopefully only during interactive sessions).
>> The main benefit of the our proposed change is more reliable package
>> code.  For long lists and certain patterns of use, there are also
>> performance benefits:
>>
>>> kk <- paste("abc", 1:(1e6), sep="")
>>> vv = as.list(1:(1e6))
>>> names(vv) = kk
>>
>>> system.time(vv[["fooo", exact=FALSE]])
>>   user  system elapsed
>>  0.074   0.000   0.074
>>
>>> system.time(vv[["fooo", exact=TRUE]])
>>   user  system elapsed
>>  0.042   0.000   0.042
>>
>>
>>> It might be nice to have an option that made x$partial warn so we
>>> would fix code that relied on partial matching, but that is lower
>>> priority.
>>
>> I think that could be useful as well.  To digress a bit further in
>> discussing $... I think the argument that partial matching is
>> desirable because it saves typing during interactive sessions now has
>> a lot less weight.  The recent integration of the completion code
>> gives less typing and complete names.
> 
> There is a similar issue with argument partial matching.  Since we have 
> the source of R one can pretty easily build a version of R which does not 
> have the feature: I have been doing that in conjunction with 'codetools' 
> to do some checking.
> 
> In both cases there is traditional partial matching: seq(along=) or 
> seq(length=), and $fitted vs $fitted.values.  There are not many uses of 
> seq(along.with=) about and vastly more of seq(along=) (although in R using 
> seq_along() is preferable): even in some packages which do use 
> seq(along.with=) there are more instances of seq(along=).

Opinions, please:

In another thread I think we have agreement to add an extra arg to the 
vignette() function to limit it to attached packages.  By analogy with 
other similar functions, the arg would be named all.available.  However, 
I suspect most users would abbreviate that to just "all".

Should I name it "all.available" for consistency, or "all" in 
anticipation of a day when exact argument matching will be required?

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Unexpected alteration of data frame column names

2007-05-17 Thread hpages

Hi,

Thanks to both for your answers!

Quoting Marc Schwartz <[EMAIL PROTECTED]>:

> On Thu, 2007-05-17 at 10:54 +0100, Prof Brian Ripley wrote:
> > To add to Marc's detective work. ?"[.data.frame" does say
> > 
> >   If '[' returns a data frame it will have unique (and non-missing)
> >   row names, if necessary transforming the row names using
> >   'make.unique'.  Similarly, column names will be transformed (if
> >   columns are selected more than once).
> > 
> > Now, an 'e.g.' in the parenthetical remark might make this clearer (since 
> > added), but I don't see why this was 'unexpected' (or why this is an issue

It all depends whether you care about consistency or not. Personnally
I do. Yes documenting inconsistencies is better than nothing but is not
always enough to make the language predictable (see below).

So, according to ?"[.data.frame", column names will be transformed (if
columns are selected more than once). OK.

Personnally, I can see ony 2 reasonable semantics for 'df[ ]' or 'df[ , ]':

  (1) either it makes an exact copy of your data frame (and this is not
  only true for data frames: unless documented otherwise one can
  expect x[] to be the same as x),

  (2) either you consider that it is equivalent to 'df[names(df)]' for
  the former and to 'df[ , names(df)]' for the latter.

So it seems that for 'df[ ]', we have semantic (1):
 
> df=data.frame(aa=LETTERS[1:3],bb=3:5,aa=7:5,check.names=FALSE)

> df
  aa bb aa
1  A  3  7
2  B  4  6
3  C  5  5

> df[]
  aa bb aa
1  A  3  7
2  B  4  6
3  C  5  5

Since we have duplicated colnames, 'df[names(df)]' will select
the first column twice and rename it (as documented):

> df[names(df)]
  aa bb aa.1
1  A  3A
2  B  4B
3  C  5C

Good!

Now with 'df[ , ]', I still maintain that this is unexpected:

> df[ , ]
  aa bb aa.1
1  A  37
2  B  46
3  C  55

This is a mix of semantic (1) and semantic (2): 3rd column has been renamed
but its data are the _original_ data. With semantic (2), you would get this:
 
> df[ , names(df)]
  aa bb aa.1
1  A  3A
2  B  4B
3  C  5C

Also the fact that 'df[something]' doesn't behave like 'df[,something]'
is IMHO another inconsistency...

Hope you don't mind if I put this back on R-devel which is probably
the right place to discuss the language semantic.

Cheers,
H.
  
> 
> > for R-devel).
> > 
> > On Tue, 15 May 2007, Marc Schwartz wrote:
> > 
> > > On Mon, 2007-05-14 at 23:59 -0700, Herve Pages wrote:
> > >> Hi,
> > >>
> > >> I'm using data.frame(..., check.names=FALSE), because I want to create
> > >> a data frame with duplicated column names (in the real life you can get
> such
> > >> data frame as the result of an SQL query):
> > 
> > That depends on the interface you are using.
> > 
> > >>  > df <- data.frame(aa=1:5, aa=9:5, check.names=FALSE)
> > >>  > df
> > >> aa aa
> > >>   1  1  9
> > >>   2  2  8
> > >>   3  3  7
> > >>   4  4  6
> > >>   5  5  5
> > >>
> > >> Why is [.data.frame changing my column names?
> > >>
> > >>  > df[1:3, ]
> > >> aa aa.1
> > >>   1  19
> > >>   2  28
> > >>   3  37
> > >>
> > >> How can this be avoided? Thanks!
> > >>
> > >> H.
> > >
> > > Herve,
> > >
> > > I had not seen a reply to your post, but you can review the code for
> > > "[.data.frame" by using:
> > >
> > >  getAnywhere("[.data.frame")
> > >
> > > and see where there are checks for duplicate column names in the
> > > function.
> > >
> > > That is going to be the default behavior for data frame
> > > subsetting/extraction and in fact is noted in the 'ONEWS' file for R
> > > version 1.8.0:
> > >
> > > - Subsetting a data frame can no longer produce duplicate
> > >   column names.
> > >
> > > So it has been around for some time (October of 2003).
> > >
> > > In terms of avoiding it, I suspect that you would have to create your
> > > own version of the function, perhaps with an additional argument that
> > > enables/disables that duplicate column name checks.
> > >
> > > I have not however considered the broader functional implications of
> > > doing so however, so be vewwy vewwy careful here.
> > 
> > Namespace issues would mean that your version would hardly ever be used.
> 
> I suspected that namespaces might be an issue here, but had not pursued
> that line of thinking beyond an initial 'gut feel'.
> 
> Thanks,
> 
> Marc
> 
> 
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

[Rd] RFC: adding an 'exact' argument to [[

[Rd] model.matrix bug? Nested factor yields singular design matrix.

Re: [Rd] model.matrix bug? Nested factor yields singular design matrix.

Re: [Rd] RFC: adding an 'exact' argument to [[

Re: [Rd] RFC: adding an 'exact' argument to [[

Re: [Rd] RFC: adding an 'exact' argument to [[

Re: [Rd] RFC: adding an 'exact' argument to [[

Re: [Rd] Unexpected alteration of data frame column names

10 matches

Site Navigation

Mail list logo

Footer information