Re: [Rd] Citation for R

2005-06-13 Thread Gordon Smyth

>Note also that R does have a User Guide, i.e., while there is plenty of 
>excellent documentation,
>there is no single document which is a guide to the whole project.

Oops, I meant to write "R does not have a User Guide".

Just to explain this further, the citation() function asks me to cite a 
"Manual" with the title "R: A language and environment for statistical 
computing". Although R comes with excellent documentation, including at 
least 6 manuals on different aspects of the software, no manual or document 
with that title actually exists, as far as I know.

Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Citation for R

2005-06-14 Thread Gordon Smyth
At 05:22 PM 14/06/2005, [EMAIL PROTECTED] wrote:
> > On Tue, 14 Jun 2005 08:42:59 +1000 (EST),
> > Gordon K Smyth (GKS) wrote:
>
>   > On Tue, June 14, 2005 12:49 am, Thomas Lumley said:
>   >> On Mon, 13 Jun 2005, Gordon K Smyth wrote:
>   >>
>   >>> This is just a note that R would get a lot more citations if the
>   >>> recommended citation was an article in a recognised journal or from a
>   >>> recognised publisher.
>   >>>
>   >>
>   >> This is unfortunately true, but R is *not* an article or a book, it is a
>   >> piece of software.  I don't think I'm the only person who thinks it is
>   >> counterproductive in the long run to encourage users to cite an article
>   >> that they probably haven't read instead of citing the software they
>   >> actually used.
>   >>
>   >> Jan's suggestion of the Journal of Statistical Software might provide a
>   >> solution, since JSS *does* publish software.
>   >>
>   >> -thomas
>
>   > In the biology world, it is common to publish an article
>   > announcing a software project, and to cite that.  The referees of
>   > the article are expected to try out and comment on the software.
>   > This gives the authors credit, and ensures that both the article
>   > and the software have been peer refereed, at least to a limited
>   > extent.
>
>How do you cite books in this world, or to but the question in another
>way: How do you make sure a book is peer-reviewd? After all it is
>quite easy to become a "publisher" and publish ones own books. Many
>university departments I know are registered ISBN publishers
>(including our department).  Must be hard to distinguish "real" books
>from others, I guess.
>
>Fritz

Books are cited as in the statistics literature but, naturally, there is a 
tendency to prefer references from more reputable sources. Hence a Wiley & 
Son book would be prefered, other things being equal, to a book from a 
minor university press, which is turn would be prefered to a self-published 
book. Self-published electronic books are pretty much at the bottom of the 
pile. This doesn't mean that important references of this form can't be 
cited, but it doesn mean that one is pushing uphill.

Self-published software manuals are usually cited in-text, as are many 
other tools and technologies mentioned in the methods section, rather than 
in the references.

Thanks for pointing out, in a separate posting, that citation() refers to 
the "R Reference Manual".

Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] boxplot() defaults {was "boxplot in extreme cases"}

2005-07-21 Thread Gordon Smyth

>[Rd] boxplot() defaults {was "boxplot in extreme cases"}
>Martin Maechler maechler at stat.math.ethz.ch
>Mon Nov 8 10:36:42 CET 2004
>
> AndyL> Try:
>
> AndyL> x <- list(x1=rep(c(0,1,2),c(10,20,40)), 
> x2=rep(c(0,1,2),c(10,40,20)))
> AndyL> boxplot(x, pars=list(medpch=20, medcex=3))
>
> AndyL> (Cf ?bxp, pointed to from ?boxplot.)
>
>Good! Thank you, Andy.
>
>However,
>this is not the first time it had crossed my mind that R's
>default settings of drawing boxplot()s are not quite ok -- and
>that's why I've diverted to R-devel.
>
>Keeping Tufte's considerations in mind, (and me not really wanting
>to follow S-plus), shouldn't we consider to slightly change R's
>boxplot()ing such that
>
>boxplot(list(x1=rep(c(0,1,2),c(10,20,40)), x2=rep(c(0,1,2),c(10,40,20
>
>will *not* give too identically looking boxplots?
>Also, the median should be emphasized more by default anyway.
>{The lattice function  bwplot() does it by only drawing a large
>  black ball as in Andy's example (and not drawing a line at all)}
>
>One possibility I'd see is to use a default 'medlwd = 3'
>either in boxplot() or in bxp(.) and hence, what you currently get by
>
>boxplot(list(x1=rep(c(0,1,2),c(10,20,40)), x2=rep(c(0,1,2),c(10,40,20))),
>medlwd=3)
>
>would become the default plotting in boxplot().
>Of course a smaller value "medlwd=2" would work too, but I'd
>prefer a bit more (3).
>
>Martin

Hi Martin,

I'm not sure this innovation (medlwd=3 default) is a good idea. Boxplots 
are designed to display many samples simultaneously on a graph, and it is 
important they be as clean and as simple as possible. To my eye, and to 
everyone in my lab, the thickened median line is rather distracting and 
makes the boxplots look more cluttered ("ugly" one of my postdocs said). 
The thickened line also goes against Tufte's principle of using minimum ink 
to represent the message.

Yours and Erich's point about distinguishing the median==1st quartile case 
from the median==3rd quartile case is well taken. How about making medlwd=3 
(or medlwd=2) the default behaviour only when the median coincides with one 
of the quartiles? That might satisfy everyone?

I notice that there wasn't any follow up discusssion of this post of the 
r-devel list. Did this suggestion get any support? The boxplots have been 
so well accepted in their current form for many, many years, decades even, 
so one should be especially cautious of making changes without some sort of 
consensus.

Best
Gordon

> > From: Erich Neuwirth
> >
> > I noticed the following:
> > the 2 datasets
> > rep(c(0,1,2),c(10,20,40)) and
> > rep(c(0,1,2),c(10,40,20))
> > produce identical boxplots despite the fact that the medians are
> > different. The reason is that the median in one case
> > coincides with the
> > first quartile, and in the second case with the third quartile.
> > Is there a recommended way of displaying the median visibly in these
> > cases? Setting notch=TRUE displays the median, but does look strange.

---
Dr Gordon K Smyth, Senior Research Scientist, Bioinformatics,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3050, Australia
Tel: (03) 9345 2326, Fax (03) 9347 0852,
Email: [EMAIL PROTECTED], www: http://www.statsci.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.1.1: read.table processes C-style escapes (PR#8037)

2005-07-26 Thread Gordon Smyth

>To: [EMAIL PROTECTED]
>From: Gordon Smyth <[EMAIL PROTECTED]>
>Subject: R 2.1.1: read.table processes C-style escapes
>Date: Wed, 27 Jul 2005 12:51:45 +1000
>
>In R 2.1.1, the default behaviour of scan() was changed to process all 
>C-style escapes, even when a delimiter was specified using the 'sep' 
>argument. A new argument 'allowEscapes' was introduced to turn this 
>processing off.
>
>Because read.table() calls scan(), read.table() inherits the new default 
>behaviour of scan() but without a way to turn it off. For example, reading 
>a file testdata.txt' containing
>
>X
>A
>\0
>C
>
>produces
>
> > read.delim("testdata.txt")
>[1] X
><0 rows> (or 0-length row.names)
>
>It seems that all the occurances of scan() within read.table() need to 
>have 'allowEscapes=FALSE' added to the argument string.

Or, alternatively and perhaps better, scan() could regain some of its 
earlier behaviour, to process C-style escapes by default only when 'sep' is 
NULL or empty. It seems, to me at least, that C-style escape sequences make 
sense only in some sort of source code, and delimited text can't be source 
code.

Gordon

>Gordon
>
> > version
>  _
>platform i386-pc-mingw32
>arch i386
>os   mingw32
>system   i386, mingw32
>status   Patched
>major2
>minor1.1
>year 2005
>month07
>day  22
>language R

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.1.1: read.table processes C-style escapes (PR#8037)

2005-07-27 Thread Gordon Smyth
Thanks for the reply. Apologies for checking only R patched rather than 
R-devel.

I guess that this means that someone must have a use for allowEscapes=TRUE 
when read reading a file in table format into a data frame. It is hard to 
imagine.

Gordon

At 04:59 PM 27/07/2005, Prof Brian Ripley wrote:
>This seems of historical interest only.
>
>allowEscapes is currently (in R-devel, where development happens) an 
>argument to read.table.
>
>We do ask people to check the current version!

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.1.1: read.table processes C-style escapes (PR#8037)

2005-07-27 Thread Gordon Smyth
At 05:21 PM 27/07/2005, Prof Brian Ripley wrote:
>On Wed, 27 Jul 2005, Gordon Smyth wrote:
>
>>Thanks for the reply. Apologies for checking only R patched rather than 
>>R-devel.
>>
>>I guess that this means that someone must have a use for 
>>allowEscapes=TRUE when read reading a file in table format into a data 
>>frame. It is hard to imagine.
>
>Octal and hex escapes are a common way to cope with different encodings.

Thanks.

>I would not want R limited by the imagination of a single user, let alone 
>a Windows-only user.

Oct and hex escapes are not the only uses of backslashes in the computing 
world, and not only in Windows. TeX markup is one example. In the 
bioinformatics world, a great variety of ways are used to represent gene 
annotation information in text files, some of them including backlashes. I 
find that people find ways to throw all sorts of things at my code.

If 'allowEscapes=FALSE' was the default in read.table, then all legacy code 
using read.table() would continue to behave as in the past. With 
'allowEscapes=TRUE', all code ever written which uses read.table(), and 
which has to cope with backslashes which are not octal or hex escapes, will 
now be broken. This is not in any sense a complaint, nor a wish that R 
should remain static or be limited by my imagination. Just a plea that, 
other things being equal, not breaking legacy code will be a consideration.

Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.table with more cols than headers

2006-08-01 Thread Gordon Smyth
I am trying to understand the behaviour of read.table() reading 
delimited files (with header=TRUE and fill=TRUE) when there are more 
(possibly spurious) columns than headings.  I give below four small 
data files, all of which have one or two extra columns added to one 
line.  Reading the first file produces an error message, the second 
produces a column of NA, the third adds an extra row, the fourth 
ignores the extra columns with no message and no NA.  Most 
unintuitive!  Here are my attempts to understand this, with questions 
interpolated.

The behaviour on the first file seems self-explanatory.  The number 
of headings determines the number of columns, and extra data columns 
are not allowed.  (On the other hand, the help ?read.table says that 
the number of columns is determined from the first five rows, which 
suggests that the header line is not the only determiner.  If 
headers, when present, are indeed the only determiner, perhaps this 
should be mentioned in the help.  Are headers actually equivalent to 
specifying the same set of names using the col.names argument?)

For the second file, the first column is being taken as row 
names.  This agrees with the help which says if "the header line has 
one less entry than the number of columns, the first column is taken 
to be the row names".  OK, perhaps not the ideal solution for this 
data file, but clearly documented behaviour.

In the third file, the extra columns are being taken to be a new 
row.  This seems wrong, because the help says that cases correspond 
to lines.  There is no suggestion in the documentation that a line of 
the file could contain multiple cases.  This is the result I have 
most trouble with.  I guess could prevent this behaviour by flush=TRUE.

File 4 is curious.  Here the number of columns has been determined, 
using the first 5 rows of the file, to be two.  The extra column on 
line 6 can't change this, so the first column doesn't become row 
names.  But in that case, shouldn't the extra column found on line 6 
produce an error message, same as for file 1?

Specifying colClasses to be a vector of length more than 2 when 
reading file 3 will produce a result similar to file 4, but with a 
warning.  It is not clear to me why colClasses should have an 
influence, since it doesn't change the determination of the number of 
columns.  Why a warning here, but an error for file 1 and no message 
for file 4?

Any comments gratefully received.
Gordon

X,Y
a,2
b,4,,
c,6

X,Y
a,2
b,4,
c,6

X,Y
a,2
b,4
c,6
d,8
e,10,,
f,12

X,Y
a,2
b,4
c,6
d,8
e,10,
f,12

 > read.csv("test1.txt")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
 more columns than column names
 > read.csv("test2.txt")
   X  Y
a 2 NA
b 4 NA
c 6 NA
 > read.csv("test3.txt")
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6   NA
7 f 12
 > read.csv("test4.txt")
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6 f 12
 > read.csv("test3.txt",colClasses=c(NA,NA))
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6   NA
7 f 12
 > read.csv("test3.txt",colClasses=c(NA,NA,NA,NA))
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6 f 12
Warning message:
cols = 2 != length(data) = 4 in: read.table(file = file, header = 
header, sep = sep, quote = quote,

 > sessionInfo()
R version 2.4.0 Under development (unstable) (2006-07-25 r38698)
i386-pc-mingw32

locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252

attached base packages:
[1] "methods"   "stats" "graphics"  "grDevices" 
"utils" "datasets"  "base"

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] S4 Classes

2006-08-10 Thread Gordon Smyth
Is there a capability that you would like for the package which could 
be achieved only if the package was transitioned to S4? If so, 
explain this to the author. If not, why ask them to change?

Gordon

>[Rd] S4 Classes
>Daniel Gerlanc dgerlanc at gmail.com
>Thu Aug 10 23:37:15 CEST 2006
>
>Hello All,
>
>I'm trying to convince someone that they should transition a large project
>to use S4 instead of S3 classes.  Does anyone have any good citations?
>Thanks!
>
>-- Dan Gerlanc
>
>--
>Daniel Gerlanc
>Williams College '07

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel