Re: [Rd] [R] "[.data.frame" and lapply

2009-03-28 Thread Romain Francois

Wacek Kusnierczyk wrote:

redirected to r-devel, because there are implementational details of
[.data.frame discussed here.  spoiler: at the bottom there is a fairly
interesting performance result.

Romain Francois wrote:
  

Hi,

This is a bug I think. [.data.frame treats its arguments differently
depending on the number of arguments.



you might want to hesitate a bit before you say that something in r is a
bug, if only because it drives certain people mad.  r is a carefully
tested software, and [.data.frame is such a basic function that if what
you talk about were a bug, it wouldn't have persisted until now.
  
I did hesitate, and would be prepared to look the other way of someone 
shows me proper evidence that this makes sense.


> d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> d[ j=1 ]
   x  y  z
1   1  1  1
2   2  2  2
3   3  3  3
4   4  4  4
5   5  5  5
6   6  6  6
7   7  7  7
8   8  8  8
9   9  9  9
10 10 10 10

"If a single index is supplied, it is interpreted as indexing the list 
of columns". Clearly this does not happen here, and this is because 
NextMethod gets confused.


I have not looked your implementation in details, but it misses array 
indexing, as in:


> d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> m <- cbind( 5:7, 1:3 )
> m
[,1] [,2]
[1,]51
[2,]62
[3,]73
> d[m]
[1] 5 6 7
> subdf( d, m )
Error in subdf(d, m) : undefined columns selected

"Matrix indexing using '[' is not recommended, and barely
supported.  For extraction, 'x' is first coerced to a matrix. For
replacement a logical matrix (only) can be used to select the
elements to be replaced in the same way as for a matrix."

You might also want to look at `[<-.data.frame`.

> d[j=2] <- 1:10
Error in `[<-.data.frame`(`*tmp*`, j = 2, value = 1:10) :
 element 1 is empty;
  the part of the args list of 'is.logical' being evaluated was:
  (i)
> d[2] <- 10:1
> d
   x  y  z
1   1 10  1
2   2  9  2
3   3  8  3
4   4  7  4
5   5  6  5
6   6  5  6
7   7  4  7
8   8  3  8
9   9  2  9
10 10  1 10

This is probably less of an issue, because there is very little chance 
for people to use this construct, but for the first one, if not used 
directly, it still has good chances to be used within some fooapply 
call, as in the original post. Although it might have been preferable to 
use subset as the applied function.


Romain

treating the arguments differently depending on their number is actually
(if clearly...) documented:  if there is one index (the 'i'), it selects
columns.  if there are two, 'i' selects rows.

however, not all seems fine, there might be a design flaw:

# dummy data frame
d = structure(names=paste('col', 1:3, sep='.'),
data.frame(row.names=paste('row', 1:3, sep='.'),
   matrix(1:9, 3, 3)))

d[1:2]
# correctly selects two first columns
# 1:2 passed to [.data.frame as i, no j given

d[,1:2]
# correctly selects two first columns
# 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

d[,i=1:2]
# correctly selects two first rows
# 1:2 passed to [.data.frame as i, j given the missing argument
value (note the comma)

d[j=1:2,]
# correctly selects two first columns
# 1:2 passed to [.data.frame as j, i given the missing argument
value (note the comma)

d[i=1:2]
# correctly (arguably) selects the first two columns
# 1:2 passed to [.data.frame as i, no j given
  
d[j=1:2]

# wrong: returns the whole data frame
# does not recognize the index as i because it is explicitly named 'j'
# does not recognize the index as j because there is only one index

i say this *might* be a design flaw because it's hard to judge what the
design really is.  the r language definition (!) [1, sec. 3.4.3 p. 18] says:

"   The most important example of a class method for [ is that used for
data frames. It is not
be described in detail here (see the help page for [.data.frame, but in
broad terms, if two
indices are supplied (even if one is empty) it creates matrix-like
indexing for a structure that is
basically a list of vectors of the same length. If a single index is
supplied, it is interpreted as
indexing the list of columns—in that case the drop argument is ignored,
with a warning."

it does not say what happens when only one *named* index argument is
given.  from the above, it would indeed seem that there is a *bug*
here:  in the last example above only one index is given, and yet
columns are not selected, even though the *language definition* says
they should.  (so it's not a documented feature, it's a
contra-definitional misfeature -- a bug?)

somewhat on the side, the 'matrix-like indexing' above is fairly
misleading;  just try the same patterns of indexing -- one index, two
indices, named indices -- on a data frame and a matrix of the same shape:

m = matrix(1:9, 3, 3)
md = data.frame(m)

md[1]
# the first column
m[1]
# the first element (i.e., m[

Re: [Rd] [R] "[.data.frame" and lapply

2009-03-28 Thread Wacek Kusnierczyk
Romain Francois wrote:
> Wacek Kusnierczyk wrote:
>> redirected to r-devel, because there are implementational details of
>> [.data.frame discussed here.  spoiler: at the bottom there is a fairly
>> interesting performance result.
>>
>> Romain Francois wrote:
>>  
>>> Hi,
>>>
>>> This is a bug I think. [.data.frame treats its arguments differently
>>> depending on the number of arguments.
>>> 
>>
>> you might want to hesitate a bit before you say that something in r is a
>> bug, if only because it drives certain people mad.  r is a carefully
>> tested software, and [.data.frame is such a basic function that if what
>> you talk about were a bug, it wouldn't have persisted until now.
>>   
> I did hesitate, and would be prepared to look the other way of someone
> shows me proper evidence that this makes sense.
>
> > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> > d[ j=1 ]
>x  y  z
> 1   1  1  1
> 2   2  2  2
> 3   3  3  3
> 4   4  4  4
> 5   5  5  5
> 6   6  6  6
> 7   7  7  7
> 8   8  8  8
> 9   9  9  9
> 10 10 10 10
>
> "If a single index is supplied, it is interpreted as indexing the list
> of columns". Clearly this does not happen here, and this is because
> NextMethod gets confused.

obviously.  it seems that there is a bug here, and that it results from
the lack of clear design specification.

>
> I have not looked your implementation in details, but it misses array
> indexing, as in:

yes;  i didn't take it into consideration, but (still without detailed
analysis) i guess it should not be difficult to extend the code to
handle this.



>
> > d <- data.frame( x = 1:10, y = 1:10, z = 1:10 )
> > m <- cbind( 5:7, 1:3 )
> > m
> [,1] [,2]
> [1,]51
> [2,]62
> [3,]73
> > d[m]
> [1] 5 6 7
> > subdf( d, m )
> Error in subdf(d, m) : undefined columns selected

this should be easy to handle by checking if i is a matrix and then
indexing by its first column as i and the second as j.

>
> "Matrix indexing using '[' is not recommended, and barely
> supported.  For extraction, 'x' is first coerced to a matrix. For
> replacement a logical matrix (only) can be used to select the
> elements to be replaced in the same way as for a matrix."

yes, here's how it's done (original comment):

if(is.matrix(i))
return(as.matrix(x)[i])  # desperate measures

and i can easily add this to my code, at virtually no additional expense.

it's probably not a good idea to convert x to a matrix, x would often be
much more data than the index matrix m, so it's presumably much more
efficient, on average, to fiddle with i instead.

there are some potentially confusing issues here:

m = cbind(8:10, 1:3)
   
d[m]
# 3-element vector, as you could expect

d[t(m)]
# 6-element vector

t(m) has dimensionality inappropriate for matrix indexing (it has 3
columns), so it gets flattened into a vector;  however, it does not work
like in the case of a single vector index where columns would be selected:

d[as.vector(t(m))]
# error: undefined columns selected

i think it would be more appropriate to raise an error in a case like
d[t(m)].

furthermore, if a matrix is used in a two-index form, the matrix is
flattened again and is used to select rows (not elements, as in
d[t(m)]).  note also that the help page says that "for extraction, 'x'
is first coerced to a matrix".  it fails to explain that if *two*
indices are used of which at least one is a matrix, no coercion is
done.  that is, the matrix is again flattened into a vector, but here
[.data.frame forgets that it was a matrix (unlike in d[t(m)]):

is(d[m])
# a character vector, matrix indexing

is(d[t(m)])
# a character vector, vector indexing of elements, not columns

is(d[m,])
# a data frame, row indexing
   
and finally, the fact that d[m] in fact converts x (i.e., d) to a matrix
before the indexing means that the types of values in a some columns in
d may get coerced to another type:

d[,2] = as.character(d[,2])
is(d[,1])
# integer vector
is(d[,2])
# character vector

is(d[1:2, 1])
# integer vector
is(d[cbind(1:2, 1)])
# character vector


for all it's worth, i think matrix indexing of data frames should be
dropped:

d[m]
# error: ...

 and if one needs it, it's as simple as

as.matrix(d)[m]

where the conversion of d to a matrix is explicit.

on the side, [.data.frame is able to index matrices:

'[.data.frame'(as.matrix(d), m)
# same as as.matrix(d)[m]

which is, so to speak, nonsense, since '[.data.frame' is designed
specifically to handle data frames;  i'd expect an error to be raised
here (or a warning, at the very least).

to summarize, the fact that subdf does not handle matrix indices is not
an issue.  anyway, thanks for the comment!

best,
vQ

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Recent setClass fails where previous succeeded

2009-03-28 Thread Martin Morgan
These lines of code

setClass("A", representation(x="numeric"))
setMethod(initialize, "A", function(.Object, ...) stop("oops"))
setClass("B", representation("A"))

result in

> setClass("B", representation("A"))
Error in initialize(value, ...) : oops

in

R version 2.9.0 alpha (2009-03-28 r48239)
R version 2.10.0 Under development (unstable) (2009-03-28 r48239) 

but not in r48182. 

In addition, in package code, the error above does NOT lead to removal
of the partially installed package, or of the lock on the package
directory, corrupting the user installation.

For more context, the actual code adds arguments to initialize and
expects them to be provided by calls to 'new'; 'new' is not exposed
directly to the user but via a constructor that always provides
appropriate arguments. A specific example occurs when trying to
install the package Biostrings v 2.11.44 from the Bioconductor devel
repository.

Martin
-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel