Re: [Rd] Documentation examples for lm and glm

2018-12-17 Thread Heinz Tuechler

Dear All,

do you think that use of a data argument is best practice in the example 
below?


regards,

Heinz

### trivial example
plotwithline <- function(x, y) {
plot(x, y)
abline(lm(y~x)) ## data argument?
}

set.seed(25)
df0 <- data.frame(x=rnorm(20), y=rnorm(20))

plotwithline(df0[['x']], df0[['y']])



Fox, John wrote/hat geschrieben on/am 17.12.2018 15:21:

Dear Martin,

I think that everyone agrees that it’s generally preferable to use the data 
argument to lm() and I have nothing significant to add to the substance of the 
discussion, but I think that it’s a mistake not to add to the current examples, 
for the following reasons:

(1) Relegating examples using the data argument to “see also” doesn’t suggest 
that using the argument is a best practice. Most users won’t bother to click 
the links.

(2) In my opinion, an new initial example using the data argument would more 
clearly suggest that this is the normally the best option.

(3) I think that it would also be desirable to add a remark to the explanation 
of the data argument, something like, “Although the argument is optional, it's 
generally preferable to specify it explicitly.” And similarly on the help page 
for glm().

My two (or three) cents.

John

  -
  John Fox, Professor Emeritus
  McMaster University
  Hamilton, Ontario, Canada
  Web: http::/socserv.mcmaster.ca/jfox


On Dec 17, 2018, at 3:05 AM, Martin Maechler  wrote:


David Hugh-Jones
   on Sat, 15 Dec 2018 08:47:28 +0100 writes:



I would argue examples should encourage good
practice. Beginners ought to learn to keep data in data
frames and not to overuse attach().


Note there's no attach() there in any of these examples!


otherwise at their own risk, but they have less need of
explicit examples.


The glm examples are nice in sofar they show both uses.

I agree the lm() example(s) are  "didactically misleading" by
not using data frames at all.

I disagree that only data frame examples should be shown.
If  lm()  is one of the first R functions a beginneR must use --
because they are in a basic stats class, say --  it may be
*better* didactically to focus on lm()  in the very first
example, and use data frames in a next one ...
 and instead of next one, we have the pretty clear comment

 ### less simple examples in "See Also" above

I'm not convinced (but you can try more) we should change those
examples or add more there.

Martin


On Fri, 14 Dec 2018 at 14:51, S Ellison
 wrote:



FWIW, before all the examples are changed to data frame
variants, I think there's fairly good reason to have at
least _one_ example that does _not_ place variables in a
data frame.

The data argument in lm() is optional. And there is more
than one way to manage data in a project. I personally
don't much like lots of stray variables lurking about,
but if those are the only variables out there and we can
be sure they aren't affected by other code, it's hardly
essential to create a data frame to hold something you
already have.  Also, attach() is still part of R, for
those folk who have a data frame but want to reference
the contents across a wider range of functions without
using with() a lot. lm() can reasonably omit the data
argument there, too.

So while there are good reasons to use data frames, there
are also good reasons to provide examples that don't.

Steve Ellison



-Original Message- > From: R-devel

[mailto:r-devel-boun...@r-project.org] On Behalf Of Ben >
Bolker > Sent: 13 December 2018 20:36 > To:
r-devel@r-project.org > Subject: Re: [Rd] Documentation
examples for lm and glm



Agree.  Or just create the data frame with those

variables in it > directly ...


On 2018-12-13 3:26 p.m., Thomas Yee wrote: > > Hello,


something that has been on my mind for a decade or

two has > > been the examples for lm() and glm(). They
encourage poor style > > because of mismanagement of data
frames. Also, having the > > variables in a data frame
means that predict() > > is more likely to work properly.


For lm(), the variables should be put into a data

frame.  > > As 2 vectors are assigned first in the
general workspace they > > should be deleted afterwards.


For the glm(), the data frame d.AD is constructed but

not used. Also, > > its 3 components were assigned first
in the general workspace, so they > > float around
dangerously afterwards like in the lm() example.


Rather than attached improved .Rd files here, they

are put at > > www.stat.auckland.ac.nz/~yee/Rdfiles > >
You are welcome to use them!


Best,

Thomas

__ > >

R-devel@r-project.org mailing list > >
https://stat.ethz.ch/mailman/listinfo/r-devel


__ >

R-devel@r-project.org mailing list >
https://stat.ethz.ch/mailman/listinfo/r-devel


***
This email and any attachments are confidential. Any
u...{{dropped:12}}




Re: [Rd] Documentation examples for lm and glm

2018-12-17 Thread Heinz Tuechler

Dear John,

fully agreed! In the global environment I always keep my 
"data-variables" in a data.frame. However, if I look in help I like 
examples that start with the particular aspects of a function. It is 
important to know, if a function offers a data argument, but in the 
first line I don't need an example for the use of a data argument each 
time I look in help.


best,
Heinz

Fox, John wrote/hat geschrieben on/am 17.12.2018 16:23:

Dear Heinz,

  --

On Dec 17, 2018, at 10:19 AM, Heinz Tuechler  wrote:

Dear All,

do you think that use of a data argument is best practice in the example below?


No, but it is *normally* or *usually* the best option, in my opinion.

Best,
 John



regards,

Heinz

### trivial example
plotwithline <- function(x, y) {
   plot(x, y)
   abline(lm(y~x)) ## data argument?
}

set.seed(25)
df0 <- data.frame(x=rnorm(20), y=rnorm(20))

plotwithline(df0[['x']], df0[['y']])



Fox, John wrote/hat geschrieben on/am 17.12.2018 15:21:

Dear Martin,

I think that everyone agrees that it’s generally preferable to use the data 
argument to lm() and I have nothing significant to add to the substance of the 
discussion, but I think that it’s a mistake not to add to the current examples, 
for the following reasons:

(1) Relegating examples using the data argument to “see also” doesn’t suggest 
that using the argument is a best practice. Most users won’t bother to click 
the links.

(2) In my opinion, an new initial example using the data argument would more 
clearly suggest that this is the normally the best option.

(3) I think that it would also be desirable to add a remark to the explanation 
of the data argument, something like, “Although the argument is optional, it's 
generally preferable to specify it explicitly.” And similarly on the help page 
for glm().

My two (or three) cents.

John

 -
 John Fox, Professor Emeritus
 McMaster University
 Hamilton, Ontario, Canada
 Web: http::/socserv.mcmaster.ca/jfox


On Dec 17, 2018, at 3:05 AM, Martin Maechler  wrote:


David Hugh-Jones
  on Sat, 15 Dec 2018 08:47:28 +0100 writes:



I would argue examples should encourage good
practice. Beginners ought to learn to keep data in data
frames and not to overuse attach().


Note there's no attach() there in any of these examples!


otherwise at their own risk, but they have less need of
explicit examples.


The glm examples are nice in sofar they show both uses.

I agree the lm() example(s) are  "didactically misleading" by
not using data frames at all.

I disagree that only data frame examples should be shown.
If  lm()  is one of the first R functions a beginneR must use --
because they are in a basic stats class, say --  it may be
*better* didactically to focus on lm()  in the very first
example, and use data frames in a next one ...
 and instead of next one, we have the pretty clear comment

### less simple examples in "See Also" above

I'm not convinced (but you can try more) we should change those
examples or add more there.

Martin


On Fri, 14 Dec 2018 at 14:51, S Ellison
 wrote:



FWIW, before all the examples are changed to data frame
variants, I think there's fairly good reason to have at
least _one_ example that does _not_ place variables in a
data frame.

The data argument in lm() is optional. And there is more
than one way to manage data in a project. I personally
don't much like lots of stray variables lurking about,
but if those are the only variables out there and we can
be sure they aren't affected by other code, it's hardly
essential to create a data frame to hold something you
already have.  Also, attach() is still part of R, for
those folk who have a data frame but want to reference
the contents across a wider range of functions without
using with() a lot. lm() can reasonably omit the data
argument there, too.

So while there are good reasons to use data frames, there
are also good reasons to provide examples that don't.

Steve Ellison



-Original Message- > From: R-devel

[mailto:r-devel-boun...@r-project.org] On Behalf Of Ben >
Bolker > Sent: 13 December 2018 20:36 > To:
r-devel@r-project.org > Subject: Re: [Rd] Documentation
examples for lm and glm



Agree.  Or just create the data frame with those

variables in it > directly ...


On 2018-12-13 3:26 p.m., Thomas Yee wrote: > > Hello,


something that has been on my mind for a decade or

two has > > been the examples for lm() and glm(). They
encourage poor style > > because of mismanagement of data
frames. Also, having the > > variables in a data frame
means that predict() > > is more likely to work properly.


For lm(), the variables should be put into a data

frame.  > > As 2 vectors are assigned first in the
general workspace they > > should be delete

Re: [Rd] rbind on data.frame that contains a column that is also a data.frame

2010-08-06 Thread Heinz Tuechler
Also Surv objects are matrices and they share the same problem when 
rbind-ing data.frames.
If contained in a data.frame, Surv objects loose their class after 
rbind and therefore do not more represent Surv objects afterwards.
Using rbind with Surv objects outside of data.frames shows a similar 
problem, but not the same column names.
In conclusion, yes, matrices are common in data.frames, but not 
without problems.


Heinz

## example
library(survival)
## create example data
starttime <- rep(0,5)
stoptime  <- 1:5
event <- c(1,0,1,1,1)
group <- c(1,1,1,2,2)

## build Surv object
survobj <- Surv(starttime, stoptime, event)

## build data.frame with Surv object
df.test <- data.frame(survobj, group)
df.test

## rbind data.frames
rbind(df.test, df.test)

## rbind Surv objects
rbind(survobj, survobj)



At 06.08.2010 09:34 -0700, William Dunlap wrote:

> -Original Message-
> From: r-devel-boun...@r-project.org
> [mailto:r-devel-boun...@r-project.org] On Behalf Of Nicholas
> L Crookston
> Sent: Friday, August 06, 2010 8:35 AM
> To: Michael Lachmann
> Cc: r-devel-boun...@r-project.org; r-devel@r-project.org
> Subject: Re: [Rd] rbind on data.frame that contains a column
> that is also a data.frame
>
> OK...I'll put in my 2 cents worth.
>
> It seems to me that the problem is with this line:
>
> b$a=a , where "s" is something other than a vector with
> length equal to nrow(b).
>
> I had no idea that a dataframe could hold a dataframe. It is not just
> rbind(b,b) that fails, apply(b,1,sum) fails and so does plot(b). I'll
> bet other R commands fail as well.
>
> My point of view is that a dataframe is a list of vectors
> of equal length and various types (this is not exactly what the help
> page says, but it is what it suggests to me).
>
> Hum, I wonder how much code is based on the idea that a
> dataframe can hold
> a dataframe.

I used to think that non-vectors in data.frames were
pretty rare things but when I started looking into
the details of the modelling code I discovered that
matrices in data.frames are common.  E.g.,
  > library(splines)
  > sapply(model.frame(data=mtcars, mpg~ns(hp)+poly(disp,2)), class)
  $mpg
  [1] "numeric"

  $`ns(hp)`
  [1] "ns" "basis"  "matrix"

  $`poly(disp, 2)`
  [1] "poly"   "matrix"
You may not see these things because you don't call model.frame()
directly, but most modelling functions (e.g., lm() and glm())
do call it and use the grouping provided by the matrices to encode
how the columns of the design matrix are related to one another.

If matrices are allowed, shouldn't data.frames be allowed as well?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> 15 years of using R just isn't enough! But, I can
> say that not
> one
> line of code I've written expects a dataframe to hold a dataframe.
>
> > Hi,
>
> > The following was already a topic on r-help, but after
> understanding
> what is
> > going on, I think it fits better in r-devel.
>
> > The problem is this:
> > When a data.frame has another data.frame in it, rbind
> doesn't work well.
> > Here is an example:
> > --
> > > a=data.frame(x=1:10,y=1:10)
> > > b=data.frame(z=1:10)
> > > b$a=a
> > > b
> > z a.x a.y
> > 1   1   1   1
> > 2   2   2   2
> > 3   3   3   3
> > 4   4   4   4
> > 5   5   5   5
> > 6   6   6   6
> > 7   7   7   7
> > 8   8   8   8
> > 9   9   9   9
> > 10 10  10  10
> > > rbind(b,b)
> > Error in `row.names<-.data.frame`(`*tmp*`, value = c("1",
> "2", "3", "4",
>  :
> > duplicate 'row.names' are not allowed
> > In addition: Warning message:
> > non-unique values when setting 'row.names': ?1?, ?10?, ?2?,
> ?3?, ?4?,
> ?5?,
> > ?6?, ?7?, ?8?, ?9?
> > --
>
> >
> > Looking at the code of rbind.data.frame, the error comes from the
> > lines:
> > --
> > xij <- xi[[j]]
> > if (has.dim[jj]) {
> > value[[jj]][ri, ] <- xij
> > rownames(value[[jj]])[ri] <- rownames(xij)   # <--  problem is here
> > }
> > --
> > if the rownames() line is dropped, all works well. What this line
> > tries to do is to join the rownames of internal elements of the
> > data.frames I try to rbind. So the result, in my case should have a
> > column 'a', whose rownames are the rownames of the original
> column 'a'.
> It
> > isn't totally clear to me why this is needed. When would a
> data.frame
> > have different rownames on the inside vs. the outside?
>
> > Notice also that rbind takes into account whether the
> rownames of the
> > data.frames to be joined are simply 1:n, or they are something else.
> > If they are 1:n, then the result will have rownames 1:(n+m). If not,
> > then the rownames might be kept.
>
> > I think, more consistent would be to replace the lines above with
> > something like:
> > if (has.dim[jj]) {
> > value[[jj]][ri, ] <- xij
> > rnj = rownames(value[[jj]])
> > rnj[ri] = rownames(xij)
> > rnj = make.unique(as.character(unlist(rnj)), sep = "")
> > rownames(value[[jj]]) <- rnj
> > }
>
> > In this case, the rownames of inside elements will also be
> joined, but
> > in case they overlap, they will

Re: [Rd] Easily switchable factor levels

2011-02-23 Thread Heinz Tuechler
To me this is a common situation, especially to switch between two 
languages. I solve it by separating the coding of values and their 
labels. Values are coded numerically or as character, and their 
labels are attached by a value.label attribute. When needed a 
modified factor function transforms these variable into a factor 
using the value.labels as labels for the factor.
It's, however, no nice code and a drawback is that the value.label 
attribute has to be copied on subsetting.


best regards,

Heinz

At 23.02.2011 22:23 +, Barry Rowlingson wrote:

I've recently been working with some California county-level data. The
counties can be referred to as either FIPS codes, eg F060102, friendly
names such as "Del Norte County", names without 'County' on the end,
names with 'CA' on the end ("Del Norte County, CA"). Different data
sets use slightly different forms and putting them all together is a
pain.

 So I was wondering about ways to attach multiple sets of level codes
to a factor. It would work something like this:

 > foo=multifactor(sample(letters,5),levels=letters,levelname="lower")
 > foo
 [1] m u i z b
 Levels: a b c d ... y z
 > levels(foo,"upper") = LETTERS
 > uselevels(foo,"upper")
 > foo
 [1] M U I Z B
  Levels: A B C D E FZ
 > uselevels(foo,"lower")
 > foo
 [1] m u i z b
  Levels: a b c d z

In this way you could easily switch your levels from M and F to Male
and Female, or Hommes et Dames, without having to do levels(foo) =
something and hope to get the ordering right every time. Just do it
once, keep the multiple sets of level lables in the object.

I'd even throw in a function to print out all the level codes:

 > levels(foo,all=TRUE)
   upper  lower
[1] A  a
[2] B  b

etc

I can see assorted problems coding this up to cope with dropping
levels when making subsets... and possibly problems when code does
character matching of levels and expects them to be unchanged...

Has anyone bothered to write anything like this yet? Or is the
application a bit too rare to be worth it?

Barry

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Suggestion: Dimension-sensitive attributes

2009-07-09 Thread Heinz Tuechler

At 10:01 09.07.2009, SIES 73 wrote:
I've also had several use cases where I needed 
"cell-like" attributes, that is, attributes that 
have the same dimensions as the original array 
and are subsetted in the same way --along all its dimensions.


So we're talking about a way to add metadata to 
matrices/arrays at 3 possible levels:


1) at the "whole object" level: 
attributes that are not dropped on subsetting
2) at the "dimension" level: attributes 
that behave like "dimnames", i.e. subsetted along each dimension
3) at the "cell" level: attributes that 
are subsetted in the same way as the original array


My proposal would be simpler that Tony's 
suggestion: like "dimnames", just have reserved 
attribute names for each case, say "objdata", 
"dimdata", and "celldata" (or "objattr", "dimattr" and "cellattr").


If "objattr", "dimattr" and "cellattr" are lists, 
they would offer save places for all attributes 
that should be kept on subsetting. In my view 
this would be very useful, because that way a 
general solution for data description, like 
variabel names, variable labels, units, ... could be reached.



On the other hand, Tony's pattern would allow as 
many attributes of each type as necessary (some 
multiplicity is already possible with the 
simpler design as dimdata or celldata could be 
lists of lists), at the cost of a more complex 
scheme of attributes that needs to be "parsed" each time.


On Tony's suggestion, "attr.keep.on.subset" and 
"attr.dimname.like" (and possible 
"attr.cell.like") could be kept on a single list 
with 3 elements, something like:


> attr(x, "attr.subset.with") <- list(object=..., dims=..., cells=...)

Would something like this make sense for R-core 
--either for standard arrays or as a new class-- 
or would it be better implemented in a package?


Enrique

-Original Message-
From: Tony Plate [mailto:tpl...@acm.org]
Sent: miércoles, 08 de julio de 2009 18:01
To: r-devel@r-project.org
Cc: Bengoechea Bartolomé Enrique (SIES 73); Henrik Bengtsson
Subject: Re: [Rd] Suggestion: Dimension-sensitive attributes

There have been times when I've thought this could be useful too.

One way to go about it could be to introduce a 
special attribute that controls how attributes 
are dealt with in subsetting, e.g., 
"attr.dimname.like".  The contents of this would 
be character data; on subsetting, any attribute 
that had a name appearing in this vector would 
be treated as a dimension.  At the same time, it 
might be nice to also introduce 
"attr.keep.on.subset", which would specify which 
attributes should be kept on the result of a 
subsetting operation (could be useful for 
attributes that specify units).  This of course 
could be a way of implementing Henrik's 
suggestion: dimattr(x, "misc") <- value would 
add "misc" to the "attr.dimname.like" attribute and also set the attribute

"misc".  The tricky part would be modifying the "[" methods.   However,
the most useful would probably be the one for 
ordinary matrices and arrays, and others could 
be modified when and if their maintainers see the need.


-- Tony Plate

Bengoechea Bartolomé Enrique (SIES 73) wrote:
> Hi,
>
> I agree with Henrik that his suggestion to 
have "dimension vector attributes" working like 
dimnames (see below) would be an extremely useful infrastructure adittion to R.

>
> If this is not considered for R-core, I am 
happy to try to implement this in a package, as 
a new class. And possibly do the same thing for 
data frames. Should you have any comments, 
ideas or suggestions about it, please share!

>
> Best,
>
> Enrique
>
> --
> ---
> Subject:
> From: Henrik Bengtsson  Date: Sun, 07 Jun 2009 14:42:08 -0700
>
> Hi,
>
> maybe this has been suggested before, but 
would it be possible, without not breaking too 
much existing code, to add other "dimension 
vector attributes" in addition to 'dimnames'? 
These attributes would then be subsetted just like dimnames.

>
> Something like this:
>
>
>> x <- array(1:30, dim=c(2,3,5))
>> dimnames(x) <- list(c("a", "b"), c("a1", "a2", "a3"), NULL);
>> dimattr(x, "misc") <- list(1:2, list(x=1:5, y=letters[1:8], z=NA),
>> letters[1:5]);
>>
>
>
>
>> y <- x[,1:2,2:3]
>> str(dimnames(y))
>>
>
> List of 3
>
>  $ : chr [1:2] "a" "b"
>  $ : chr [1:2] "a1" "a2"
>  $ : NULL
>
>
>
>> str(dimattr(x, "misc"))
>>
>
> List of 3
>  $ : int [1:2] 1 2
>  $ :List of 2
>   ..$ x: int [1:5] 1 2 3 4 5
>   ..$ y: chr [1:8] "a" "b" "c" "d" ...
>  $ : chr [1:2] "b" "c"
>
>  I can imagine this needs to be added in 
several places and functions such as 
is.vector() needs to be updated etc. It is not 
a quick migration, but is it something worth considering for the future?

>
> /Henrik
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

__
R-devel@r-project.org mail

Re: [Rd] Suggestion: Dimension-sensitive attributes

2009-07-09 Thread Heinz Tuechler

At 11:14 09.07.2009, SIES 73 wrote:
> If "objattr", "dimattr" and "cellattr" are 
lists, they would offer save places for all 
attributes that should be kept on subsetting.


My proposed design would be that:

* "objattr" would be a list of 
attributes (just preserved on subsetting)
* "dimattr" would be a list with as 
many elements as array dimensions. Each element 
can be any object whose length matches the 
corresponding array dimension's length and that 
can be itself subsetted with "[": so it could 
be a vector, a list, a data frame...
* "cellattr" would be any object whose 
dimensions match the array dimensions: another array, a data frame...


> In my view this would be very useful, because 
that way a general solution for data 
description, like variabel names, variable labels, units, ... could be reached.


Indeed, that's the objective: attaching 
user-defined metadata that is automatically 
synchronized with subsetting operations to the actual data.


I've had dozens of use cases on my own R 
programs that needed this type of pattern, and 
seen it implemented in different ways in several 
classes (xts, timeSeries, AnnotatedDataFrame, 
etc.) As you point, this could offer a unified design for a common need.


Enrique



For my personal use it was sufficient to create a 
class called "documented" with a corresponding 
subsetting method and one attribute, also called 
"documented". This attribute may contain 
'varlabel', 'varname', 'value.labels', 
'missing.values', 'code.ordered', 'comment', ...

It is copied on subsetting.
I think attributes concerning e.g. dimensions, 
i.e. parts of an object should stay in this 
object-related attribute and be extracted on 
subsetting. Since subsetting an object leads to a 
new object, this could then have its own, new persisting attribute.

The more difficult part may to be the binding of objects.

Heinz





-Original Message-
From: Heinz Tuechler [mailto:tuech...@gmx.at]
Sent: jueves, 09 de julio de 2009 10:56
To: Bengoechea Bartolomé Enrique (SIES 73); Tony Plate; r-devel@r-project.org
Cc: Henrik Bengtsson
Subject: Re: [Rd] Suggestion: Dimension-sensitive attributes

At 10:01 09.07.2009, SIES 73 wrote:
>I've also had several use cases where I needed "cell-like" attributes,
>that is, attributes that have the same dimensions as the original array
>and are subsetted in the same way --along all its dimensions.
>
>So we're talking about a way to add metadata to matrices/arrays at 3
>possible levels:
>
> 1) at the "whole object" level:
> attributes that are not dropped on subsetting
> 2) at the "dimension" level: attributes that behave like
> "dimnames", i.e. subsetted along each dimension
> 3) at the "cell" level: attributes that are subsetted in the
> same way as the original array
>
>My proposal would be simpler that Tony's
>suggestion: like "dimnames", just have reserved attribute names for
>each case, say "objdata", "dimdata", and "celldata" (or "objattr",
>"dimattr" and "cellattr").

If "objattr", "dimattr" and "cellattr" are 
lists, they would offer save places for all 
attributes that should be kept on subsetting. In 
my view this would be very useful, because that 
way a general solution for data description, 
like variabel names, variable labels, units, ... could be reached.



>On the other hand, Tony's pattern would allow as many attributes of
>each type as necessary (some multiplicity is already possible with the
>simpler design as dimdata or celldata could be lists of lists), at the
>cost of a more complex scheme of attributes that needs to be "parsed"
>each time.
>
>On Tony's suggestion, "attr.keep.on.subset" and "attr.dimname.like"
>(and possible
>"attr.cell.like") could be kept on a single list with 3 elements,
>something like:
>
> > attr(x, "attr.subset.with") <- list(object=..., dims=..., cells=...)
>
>Would something like this make sense for R-core --either for standard
>arrays or as a new class-- or would it be better implemented in a
>package?
>
>Enrique
>


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Unexpected result of as.character() and unlist() appliedto a data frame

2007-03-28 Thread Heinz Tuechler
At 17:25 27.03.2007 +0200, Martin Maechler wrote:
>> "Herve" == Herve Pages <[EMAIL PROTECTED]>
>> on Mon, 26 Mar 2007 20:48:33 -0700 writes:
>
>Herve> Hi,
>>> dd <- data.frame(A=c("b","c","a"), B=3:1) dd
>Herve>   A B 1 b 3 2 c 2 3 a 1
>>> unlist(dd)
>Herve> A1 A2 A3 B1 B2 B3 2 3 1 3 2 1
>
>Herve> Someone else might get something different. It all
>Herve> depends on the values of its 'stringsAsFactors'  option:
>
>yes, and I don't like that (last) fact either.
>IMO, an option should never be allowed to influence such a basic
>function as  data.frame().
>
>I know I would have had time earlier to start discussing this,
>but for some (probably good) reasons, I didn't get to it at the
>time. 
>As Andy comments, everything is behaving as it should / is documented,
>including the  'stringsAsFactors' option;
>but personally, I really would want to consider changing
>the default for  data.frame()s stringAsFactors back (as
>pre-R-2.4.0) to 'TRUE' instead of  default.stringsAsFactors()
>which is a smart version of getOption("stringsAsFactors"). 
>I find it ok ("acceptable") if its influencing  read.table()
>but feel differently for data.frame().
>
>Martin
>
Martin!

I see the problem with options influencing "such a basic function as
data.frame().", but in my view the difficulty starts earlier. In my
understanding data.frame() is _the_ basic way to store empirical source
data in R and I found the earlier default behaviour, to change character
variables to factors, problematic.
If changing character variables to factors were only an internal process,
not visible to the user, I would not mind, but to include a character
variable in a data frame and get a factor out of it, is somewhat disturbing.
A naive user like me was especially confused by the fact that I could read
an SPSS file with spss.get (default: charfactor=FALSE) and get a character
variable in a data.frame as a character variable but then putting it in a
different data.frame it changed to factor.
I would wish a data.frame() function that behaves as a "data container"
with the idea of rows(=cases) and columns(=variables) but without changing
the mode/class of the objects.

Heinz

>
>
>
>
>>> dd2 <- data.frame(A=c("b","c","a"), B=3:1,
>>>   stringsAsFactors=FALSE)
>>> dd2
>Herve>   A B 1 b 3 2 c 2 3 a 1
>>> unlist(dd2)
>Herve>  A1 A2 A3 B1 B2 B3 "b" "c" "a" "3" "2" "1"
>
>Herve> Same thing with as.character:
>
>>> as.character(dd)
>Herve> [1] "c(2, 3, 1)" "c(3, 2, 1)"
>>> as.character(dd2)
>Herve> [1] "c(\"b\", \"c\", \"a\")" "c(3, 2, 1)"
>
>Herve> Bug or "feature"?
>
>Herve> Note that as.character applied directly on dd$A
>Herve> doesn't have this "feature":
>
>>> as.character(dd$A)
>Herve> [1] "b" "c" "a"
>>> as.character(dd2$A)
>Herve> [1] "b" "c" "a"
>
>Herve> Cheers, H.
>
>Herve> __
>Herve> R-devel@r-project.org mailing list
>Herve> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>__
>R-devel@r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] raw data documentation

2006-07-18 Thread Heinz Tuechler
Dear Developers,

after several discussions on r-help I got the impression that the
"standard" R distribution, including the recommended packages, does not
offer much to document raw data, imported into R.
Hmisc has some functionality in this respect, and others like Richard
Heiberger solved some other aspects, but I think there could be a more
unified approach, which, of course needs the support from the core
developers to become "standard".
In particular I am looking for is a possibility to label variables, label
values and add other information. Of course, all this is possible by adding
attributes, but most attributes are lost when indexing/subsetting.
>From many helpful suggestions of others I learned that this can be resolved
by defining a class and corresponding methods, as is done for variable
labels in Hmisc.
For now I drafted something more general for my personal use, but before
continuing on this I want to know, if there is some intention at the core
developer team to work on the question of raw data documentation.

Greetings,

Heinz Tüchler

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel