[Rd] R 2.1.1: read.table processes C-style escapes (PR#8037)

2005-07-26 Thread smyth
In R 2.1.1, the default behaviour of scan() was changed to process all 
C-style escapes, even when a delimiter was specified using the 'sep' 
argument. A new argument 'allowEscapes' was introduced to turn this 
processing off.

Because read.table() calls scan(), read.table() inherits the new default 
behaviour of scan() but without a way to turn it off. For example, reading 
a file testdata.txt' containing

X
A
\0
C

produces

 > read.delim("testdata.txt")
[1] X
<0 rows> (or 0-length row.names)

It seems that all the occurances of scan() within read.table() need to have 
'allowEscapes=FALSE' added to the argument string.

Gordon

 > version
  _
platform i386-pc-mingw32
arch i386
os   mingw32
system   i386, mingw32
status   Patched
major2
minor1.1
year 2005
month07
day  22
language R
-------
Dr Gordon K Smyth, Senior Research Scientist, Bioinformatics,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3050, Australia
Tel: (03) 9345 2326, Fax (03) 9347 0852,
Email: [EMAIL PROTECTED], www: http://www.statsci.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.table produces extra rows when file contains extra columns on (PR#9128)

2006-08-05 Thread smyth
Reading the following delimited file with read.csv() or read.table()

file1:
X,Y
1,2
2,4
3,6
4,8
5,10,,
6,12

produces a data.frame with 7 rows instead of 6 because the two extra values on 
line 6 of the file
are pushed into a new row of the data.frame.  In other words, the extra columns 
on line 6 are
interpreted as a second case on the same line.  This contradicts the help 
?read.table which states
that cases correspond to lines.

A desirable behaviour might be to ignore the extra columns with a warning.  It 
would be nice
though to be consistent with the behaviour reading the shorter file

file2:
X,Y
1,2
2,4,,
3,6

which currently produces an error.

Gordon


> read.csv("file1.csv")
   X  Y
1  1  2
2  2  4
3  3  6
4  4  8
5  5 10
6 NA NA
7  6 12
> read.csv("file2.csv")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
more columns than column names
> sessionInfo()
Version 2.3.1 (2006-06-01)
i386-pc-mingw32

attached base packages:
[1] "methods"   "stats" "graphics"  "grDevices" "utils" "datasets"  
"base"

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Citation for R

2005-06-13 Thread Gordon Smyth

>Note also that R does have a User Guide, i.e., while there is plenty of 
>excellent documentation,
>there is no single document which is a guide to the whole project.

Oops, I meant to write "R does not have a User Guide".

Just to explain this further, the citation() function asks me to cite a 
"Manual" with the title "R: A language and environment for statistical 
computing". Although R comes with excellent documentation, including at 
least 6 manuals on different aspects of the software, no manual or document 
with that title actually exists, as far as I know.

Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Citation for R

2005-06-14 Thread Gordon Smyth
At 05:22 PM 14/06/2005, [EMAIL PROTECTED] wrote:
> >>>>> On Tue, 14 Jun 2005 08:42:59 +1000 (EST),
> >>>>> Gordon K Smyth (GKS) wrote:
>
>   > On Tue, June 14, 2005 12:49 am, Thomas Lumley said:
>   >> On Mon, 13 Jun 2005, Gordon K Smyth wrote:
>   >>
>   >>> This is just a note that R would get a lot more citations if the
>   >>> recommended citation was an article in a recognised journal or from a
>   >>> recognised publisher.
>   >>>
>   >>
>   >> This is unfortunately true, but R is *not* an article or a book, it is a
>   >> piece of software.  I don't think I'm the only person who thinks it is
>   >> counterproductive in the long run to encourage users to cite an article
>   >> that they probably haven't read instead of citing the software they
>   >> actually used.
>   >>
>   >> Jan's suggestion of the Journal of Statistical Software might provide a
>   >> solution, since JSS *does* publish software.
>   >>
>   >> -thomas
>
>   > In the biology world, it is common to publish an article
>   > announcing a software project, and to cite that.  The referees of
>   > the article are expected to try out and comment on the software.
>   > This gives the authors credit, and ensures that both the article
>   > and the software have been peer refereed, at least to a limited
>   > extent.
>
>How do you cite books in this world, or to but the question in another
>way: How do you make sure a book is peer-reviewd? After all it is
>quite easy to become a "publisher" and publish ones own books. Many
>university departments I know are registered ISBN publishers
>(including our department).  Must be hard to distinguish "real" books
>from others, I guess.
>
>Fritz

Books are cited as in the statistics literature but, naturally, there is a 
tendency to prefer references from more reputable sources. Hence a Wiley & 
Son book would be prefered, other things being equal, to a book from a 
minor university press, which is turn would be prefered to a self-published 
book. Self-published electronic books are pretty much at the bottom of the 
pile. This doesn't mean that important references of this form can't be 
cited, but it doesn mean that one is pushing uphill.

Self-published software manuals are usually cited in-text, as are many 
other tools and technologies mentioned in the methods section, rather than 
in the references.

Thanks for pointing out, in a separate posting, that citation() refers to 
the "R Reference Manual".

Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] boxplot() defaults {was "boxplot in extreme cases"}

2005-07-21 Thread Gordon Smyth

>[Rd] boxplot() defaults {was "boxplot in extreme cases"}
>Martin Maechler maechler at stat.math.ethz.ch
>Mon Nov 8 10:36:42 CET 2004
>
> AndyL> Try:
>
> AndyL> x <- list(x1=rep(c(0,1,2),c(10,20,40)), 
> x2=rep(c(0,1,2),c(10,40,20)))
> AndyL> boxplot(x, pars=list(medpch=20, medcex=3))
>
> AndyL> (Cf ?bxp, pointed to from ?boxplot.)
>
>Good! Thank you, Andy.
>
>However,
>this is not the first time it had crossed my mind that R's
>default settings of drawing boxplot()s are not quite ok -- and
>that's why I've diverted to R-devel.
>
>Keeping Tufte's considerations in mind, (and me not really wanting
>to follow S-plus), shouldn't we consider to slightly change R's
>boxplot()ing such that
>
>boxplot(list(x1=rep(c(0,1,2),c(10,20,40)), x2=rep(c(0,1,2),c(10,40,20
>
>will *not* give too identically looking boxplots?
>Also, the median should be emphasized more by default anyway.
>{The lattice function  bwplot() does it by only drawing a large
>  black ball as in Andy's example (and not drawing a line at all)}
>
>One possibility I'd see is to use a default 'medlwd = 3'
>either in boxplot() or in bxp(.) and hence, what you currently get by
>
>boxplot(list(x1=rep(c(0,1,2),c(10,20,40)), x2=rep(c(0,1,2),c(10,40,20))),
>medlwd=3)
>
>would become the default plotting in boxplot().
>Of course a smaller value "medlwd=2" would work too, but I'd
>prefer a bit more (3).
>
>Martin

Hi Martin,

I'm not sure this innovation (medlwd=3 default) is a good idea. Boxplots 
are designed to display many samples simultaneously on a graph, and it is 
important they be as clean and as simple as possible. To my eye, and to 
everyone in my lab, the thickened median line is rather distracting and 
makes the boxplots look more cluttered ("ugly" one of my postdocs said). 
The thickened line also goes against Tufte's principle of using minimum ink 
to represent the message.

Yours and Erich's point about distinguishing the median==1st quartile case 
from the median==3rd quartile case is well taken. How about making medlwd=3 
(or medlwd=2) the default behaviour only when the median coincides with one 
of the quartiles? That might satisfy everyone?

I notice that there wasn't any follow up discusssion of this post of the 
r-devel list. Did this suggestion get any support? The boxplots have been 
so well accepted in their current form for many, many years, decades even, 
so one should be especially cautious of making changes without some sort of 
consensus.

Best
Gordon

> > From: Erich Neuwirth
> >
> > I noticed the following:
> > the 2 datasets
> > rep(c(0,1,2),c(10,20,40)) and
> > rep(c(0,1,2),c(10,40,20))
> > produce identical boxplots despite the fact that the medians are
> > different. The reason is that the median in one case
> > coincides with the
> > first quartile, and in the second case with the third quartile.
> > Is there a recommended way of displaying the median visibly in these
> > cases? Setting notch=TRUE displays the median, but does look strange.

---
Dr Gordon K Smyth, Senior Research Scientist, Bioinformatics,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3050, Australia
Tel: (03) 9345 2326, Fax (03) 9347 0852,
Email: [EMAIL PROTECTED], www: http://www.statsci.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.1.1: read.table processes C-style escapes (PR#8037)

2005-07-26 Thread Gordon Smyth

>To: [EMAIL PROTECTED]
>From: Gordon Smyth <[EMAIL PROTECTED]>
>Subject: R 2.1.1: read.table processes C-style escapes
>Date: Wed, 27 Jul 2005 12:51:45 +1000
>
>In R 2.1.1, the default behaviour of scan() was changed to process all 
>C-style escapes, even when a delimiter was specified using the 'sep' 
>argument. A new argument 'allowEscapes' was introduced to turn this 
>processing off.
>
>Because read.table() calls scan(), read.table() inherits the new default 
>behaviour of scan() but without a way to turn it off. For example, reading 
>a file testdata.txt' containing
>
>X
>A
>\0
>C
>
>produces
>
> > read.delim("testdata.txt")
>[1] X
><0 rows> (or 0-length row.names)
>
>It seems that all the occurances of scan() within read.table() need to 
>have 'allowEscapes=FALSE' added to the argument string.

Or, alternatively and perhaps better, scan() could regain some of its 
earlier behaviour, to process C-style escapes by default only when 'sep' is 
NULL or empty. It seems, to me at least, that C-style escape sequences make 
sense only in some sort of source code, and delimited text can't be source 
code.

Gordon

>Gordon
>
> > version
>  _
>platform i386-pc-mingw32
>arch i386
>os   mingw32
>system   i386, mingw32
>status   Patched
>major2
>minor1.1
>year 2005
>month07
>day  22
>language R

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.1.1: read.table processes C-style escapes (PR#8037)

2005-07-27 Thread Gordon Smyth
Thanks for the reply. Apologies for checking only R patched rather than 
R-devel.

I guess that this means that someone must have a use for allowEscapes=TRUE 
when read reading a file in table format into a data frame. It is hard to 
imagine.

Gordon

At 04:59 PM 27/07/2005, Prof Brian Ripley wrote:
>This seems of historical interest only.
>
>allowEscapes is currently (in R-devel, where development happens) an 
>argument to read.table.
>
>We do ask people to check the current version!

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.1.1: read.table processes C-style escapes (PR#8037)

2005-07-27 Thread Gordon Smyth
At 05:21 PM 27/07/2005, Prof Brian Ripley wrote:
>On Wed, 27 Jul 2005, Gordon Smyth wrote:
>
>>Thanks for the reply. Apologies for checking only R patched rather than 
>>R-devel.
>>
>>I guess that this means that someone must have a use for 
>>allowEscapes=TRUE when read reading a file in table format into a data 
>>frame. It is hard to imagine.
>
>Octal and hex escapes are a common way to cope with different encodings.

Thanks.

>I would not want R limited by the imagination of a single user, let alone 
>a Windows-only user.

Oct and hex escapes are not the only uses of backslashes in the computing 
world, and not only in Windows. TeX markup is one example. In the 
bioinformatics world, a great variety of ways are used to represent gene 
annotation information in text files, some of them including backlashes. I 
find that people find ways to throw all sorts of things at my code.

If 'allowEscapes=FALSE' was the default in read.table, then all legacy code 
using read.table() would continue to behave as in the past. With 
'allowEscapes=TRUE', all code ever written which uses read.table(), and 
which has to cope with backslashes which are not octal or hex escapes, will 
now be broken. This is not in any sense a complaint, nor a wish that R 
should remain static or be limited by my imagination. Just a plea that, 
other things being equal, not breaking legacy code will be a consideration.

Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] fuzzy c-means algorithm in e1071

2011-07-15 Thread Tim Smyth
I was wondering if it was possible to add an extra option to the cmeans
function please?  In a recent paper by Moore et al.,(2009) they state:

"The fuzzy membership function is: 1 - F(Z2), where Z2 is a squared
Mahalanobis distance and F is a cumulative chi-square distribution.  The
Mahalanobis distance is the multivariate equivalent of the standardised
random variable Z = (X - M)/S, which is the distance of the univariate
random variable X from its mean M normalised by the standard deviation S. 
In other words, the Mahalanobis distance is a weighted form of the
Euclidean, and is preferable because it incorporates the shape of the the
distribution of points around the cluster centre (i.e., the geometric shape
of the point cloud expressed in terms of variance). The fuzzy membership
ranges from 0 to 1 ..."

so as well as having a dist = "euclidean" option, would it be possible to
have a dist = "mahalanobis" option please?  I realise that this is quite an
ask - you could even point me in the right direction of where to start
editing code and getting it compiled on my own machine.

Tim Smyth

ref: T.S. Moore et al. Remote Sensing of Environment 113 (2009) 2424-2430 

--
View this message in context: 
http://r.789695.n4.nabble.com/fuzzy-c-means-algorithm-in-e1071-tp3669374p3669374.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.table with more cols than headers

2006-08-01 Thread Gordon Smyth
I am trying to understand the behaviour of read.table() reading 
delimited files (with header=TRUE and fill=TRUE) when there are more 
(possibly spurious) columns than headings.  I give below four small 
data files, all of which have one or two extra columns added to one 
line.  Reading the first file produces an error message, the second 
produces a column of NA, the third adds an extra row, the fourth 
ignores the extra columns with no message and no NA.  Most 
unintuitive!  Here are my attempts to understand this, with questions 
interpolated.

The behaviour on the first file seems self-explanatory.  The number 
of headings determines the number of columns, and extra data columns 
are not allowed.  (On the other hand, the help ?read.table says that 
the number of columns is determined from the first five rows, which 
suggests that the header line is not the only determiner.  If 
headers, when present, are indeed the only determiner, perhaps this 
should be mentioned in the help.  Are headers actually equivalent to 
specifying the same set of names using the col.names argument?)

For the second file, the first column is being taken as row 
names.  This agrees with the help which says if "the header line has 
one less entry than the number of columns, the first column is taken 
to be the row names".  OK, perhaps not the ideal solution for this 
data file, but clearly documented behaviour.

In the third file, the extra columns are being taken to be a new 
row.  This seems wrong, because the help says that cases correspond 
to lines.  There is no suggestion in the documentation that a line of 
the file could contain multiple cases.  This is the result I have 
most trouble with.  I guess could prevent this behaviour by flush=TRUE.

File 4 is curious.  Here the number of columns has been determined, 
using the first 5 rows of the file, to be two.  The extra column on 
line 6 can't change this, so the first column doesn't become row 
names.  But in that case, shouldn't the extra column found on line 6 
produce an error message, same as for file 1?

Specifying colClasses to be a vector of length more than 2 when 
reading file 3 will produce a result similar to file 4, but with a 
warning.  It is not clear to me why colClasses should have an 
influence, since it doesn't change the determination of the number of 
columns.  Why a warning here, but an error for file 1 and no message 
for file 4?

Any comments gratefully received.
Gordon

X,Y
a,2
b,4,,
c,6

X,Y
a,2
b,4,
c,6

X,Y
a,2
b,4
c,6
d,8
e,10,,
f,12

X,Y
a,2
b,4
c,6
d,8
e,10,
f,12

 > read.csv("test1.txt")
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
 more columns than column names
 > read.csv("test2.txt")
   X  Y
a 2 NA
b 4 NA
c 6 NA
 > read.csv("test3.txt")
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6   NA
7 f 12
 > read.csv("test4.txt")
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6 f 12
 > read.csv("test3.txt",colClasses=c(NA,NA))
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6   NA
7 f 12
 > read.csv("test3.txt",colClasses=c(NA,NA,NA,NA))
   X  Y
1 a  2
2 b  4
3 c  6
4 d  8
5 e 10
6 f 12
Warning message:
cols = 2 != length(data) = 4 in: read.table(file = file, header = 
header, sep = sep, quote = quote,

 > sessionInfo()
R version 2.4.0 Under development (unstable) (2006-07-25 r38698)
i386-pc-mingw32

locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252

attached base packages:
[1] "methods"   "stats" "graphics"  "grDevices" 
"utils" "datasets"  "base"

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] S4 Classes

2006-08-10 Thread Gordon Smyth
Is there a capability that you would like for the package which could 
be achieved only if the package was transitioned to S4? If so, 
explain this to the author. If not, why ask them to change?

Gordon

>[Rd] S4 Classes
>Daniel Gerlanc dgerlanc at gmail.com
>Thu Aug 10 23:37:15 CEST 2006
>
>Hello All,
>
>I'm trying to convince someone that they should transition a large project
>to use S4 instead of S3 classes.  Does anyone have any good citations?
>Thanks!
>
>-- Dan Gerlanc
>
>--
>Daniel Gerlanc
>Williams College '07

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Citation for R

2005-06-12 Thread Gordon K Smyth
This is just a note that R would get a lot more citations if the recommended 
citation was an
article in a recognised journal or from a recognised publisher.

I use R in work leading to publications often, and I strongly want to give the 
R core team credit
for their work.  However I find that I can't persuade my biological 
collaborators to include the
current R citation (below) in their reference lists, because it is not an 
article in a recognised
journal nor from a recognised publisher.  I can cite the 1996 paper by Ihaka 
and Gentleman, and
sometimes this what I do, but I'd really like to give credit to the other R 
core members as well,
for example the CRAN people and those involved in the Windows version.

I know this is more work for the R team, like everything else, but an article 
on the story of R
since the creation of the core team would be really nice to see.

> citation()

To cite R in publications use:

  R Development Core Team (2005). R: A language and environment for statistical
  computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN
  3-900051-07-0, URL http://www.R-project.org.


Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Citation for R

2005-06-13 Thread Gordon K Smyth
On Tue, June 14, 2005 1:30 am, Ted Harding said:
> Is a journal reference necessary? I have seen many articles where
> the statistical software (S-Plus, SPSS, SAS, etc.) was "cited" as
> the User Manual, usually only available from the supplier of the
> software, sometimes with a WWW URL. Such cases provide a precedent
> for R, surely. I have also seen cases where the "citation" was
> simply the name of the company (with location, version, date etc.)
>
> As an example which, as software, is closer to home, consider the
> following quotation, and the corresponding citation:
>
> Quotation:
>   "This is essentially a four-level hierarchical model and
>is easily implemented in, say, WinBUGS (Spiegelhalter,
>Thomas and Best, 2000)." [p. 13 of Source reference below]
>
> Citation Reference:
>   Spiegelhalter, D. J., Thomas, A. and Best, N. G. (2000)
> WinBUGS Version 1.3 User Manual. Cambridge: Medical Research
> Council Biostatistics Unit.
> (Available from www.mrc-bsu.cam.ac.uk/bugs.)
>
> Source:
>   David J. Spiegelhalter, Paul Aylin, Nicola G. Best,
>   Stephen J. W. Evans, Gordon D. Murray (2002).
> Commissioned analysis of surgical performance by using
> routine data: lessons from the Bristol inquiry.
>   J. R. Statist. Soc. A (2002) 165, Part 2, pp. 1-31)
>
> Surely this would do? Does R need more justification than
> WinBUGS? Are JRSS citations less canonical then other journals?

Yes, JRSSB citations are less canonical that citations in medical biology 
journals.  Citations are
treated very seriously in medical biology world, where impact factors and 
citations counts are
quoted in promotion and grant applications, and there is a reluctance to cite 
non-refereed
publications.  An article in Nature would probably not include WinBUGS in the 
reference list but
would simply say in the text that computations were done using "WinBUGS 
software (Medical Research
Council Biostatistics Unit, Cambridge, www.mrc-bsu.cam.ac.uk/bugs)".  This 
might be good enough
for the WinBUGS people, but it would not get the WinBUGS manual into the 
citation system.

Note also that R does have a User Guide, i.e., while there is plenty of 
excellent documentation,
there is no single document which is a guide to the whole project.

Gordon

> Best wishes to all,
> Ted.
>
>
> 
> E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
> Fax-to-email: +44 (0)870 094 0861
> Date: 13-Jun-05   Time: 16:30:35
> -- XFMail --

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Citation for R

2005-06-13 Thread Gordon K Smyth
On Tue, June 14, 2005 12:49 am, Thomas Lumley said:
> On Mon, 13 Jun 2005, Gordon K Smyth wrote:
>
>> This is just a note that R would get a lot more citations if the
>> recommended citation was an article in a recognised journal or from a
>> recognised publisher.
>>
>
> This is unfortunately true, but R is *not* an article or a book, it is a
> piece of software.  I don't think I'm the only person who thinks it is
> counterproductive in the long run to encourage users to cite an article
> that they probably haven't read instead of citing the software they
> actually used.
>
> Jan's suggestion of the Journal of Statistical Software might provide a
> solution, since JSS *does* publish software.
>
>   -thomas

In the biology world, it is common to publish an article announcing a software 
project, and to
cite that.  The referees of the article are expected to try out and comment on 
the software.  This
gives the authors credit, and ensures that both the article and the software 
have been peer
refereed, at least to a limited extent.

If the issue of formal citation doesn't worry the R core team, I won't worry 
about it either.  I'm
currently revising an article going to Nature which will simply say in the text 
that quantities
"were calculated using R statistical software (http://www.r-project.org)".

Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Standardized Pearson residuals (and score tests)

2011-03-16 Thread Gordon K Smyth

Hi Peter and others,

If it helps, I wrote a small function glm.scoretest() for the statmod 
package on CRAN to compute score tests from glm fits.  The score test for 
adding a covariate, or any set of covariates, can be extracted very neatly 
from the standard glm output, although you probably already know that.


Regards
Gordon

-
Professor Gordon K Smyth,
NHMRC Senior Research Fellow,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
sm...@wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth


Date: Tue, 15 Mar 2011 12:17:46 +0100
From: peter dalgaard 
To: Brett Presnell 
Cc: r-devel@r-project.org
Subject: Re: [Rd] Standardized Pearson residuals


On Mar 15, 2011, at 04:40 , Brett Presnell wrote:

Background: I'm currently teaching an undergrad/grad-service course 
from Agresti's "Introduction to Categorical Data Analysis (2nd edn)" 
and deviance residuals are not used in the text.  For now I'll just 
provide the students with a simple function to use, but I prefer to 
use R's native capabilities whenever possible.


Incidentally, chisq.test will have a stdres component in 2.13.0 for 
much the same reason.


Thank you.  That's one more thing I won't have to provide code for 
anymore.  Coincidentally, Agresti mentioned this to me a week or two 
ago as something that he felt was missing, so that's at least two 
people who will be happy to see this added.




And of course, I was teaching a course based on Agresti & Franklin: 
"Statistics, The Art and Science of Learning from Data", when I realized 
that R was missing standardized residuals.



It would also be nice for teaching purposes if glm or summary.glm had a 
"pearsonchisq" component and a corresponding extractor function, but I 
can imagine that there might be arguments against it that haven't 
occured to me.  Plus, I doubt that anyone wants to touch glm unless 
it's to repair a bug. If I'm wrong about all that though, ...


Hmm, how would that work? If there was one, I'd worry that people would 
start subtracting them which is usually not the right thing to do. I do 
miss having a test on the residual deviance occasionally (even though it 
is only sometimes meaningful), having to fit a saturated model 
explicitly can be a bit silly. E.g. in this case (homogeneity of birth 
rates):



anova(glm(births~month,poisson,data=bb), test="Chisq")

...
 Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 11 225.98
month 11   225.98 0   0.00 < 2.2e-16 ***

anova(glm(births~1,poisson,data=bb), test="Chisq")

...
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL11 225.98

Notice that the latter version gives me the correct deviance but no 
p-value.



A better support for generic score tests could be desirable too. I 
suspect that this would actually be the Pearson Chi-square in the 
interesting cases.


--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com


__
The information in this email is confidential and intend...{{dropped:4}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Standardized Pearson residuals (and score tests)

2011-03-17 Thread Gordon K Smyth

On Thu, 17 Mar 2011, peter dalgaard wrote:


On Mar 16, 2011, at 23:29 , Gordon K Smyth wrote:


Hi Peter and others,

If it helps, I wrote a small function glm.scoretest() for the statmod 
package on CRAN to compute score tests from glm fits.  The score test 
for adding a covariate, or any set of covariates, can be extracted very 
neatly from the standard glm output, although you probably already know 
that.


Thanks Gordon,

I'll have a look. It's the kind of think where you _strongly suspect_ 
that a neat solution exists, but where you can't just write it down 
immediately. Looks like your code needs some elaboration to handle 
factor terms and more general model reductions, though.


Yes, the glm.scoretest() function is very basic.  At the moment it tests 
for adding a single covariate at a time, i.e., a 1 df test, or several 1 
df tests.  If you like, I could add a multiple column version that would 
work for factors etc, it would be only more line or so.  I figure you'd 
want to pull code out of glm.scoretest() rather than call it explicitly 
anyway.


Gordon


-pd



Regards
Gordon

-
Professor Gordon K Smyth,
NHMRC Senior Research Fellow,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
sm...@wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth



--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com




__
The information in this email is confidential and intend...{{dropped:4}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Will there be 2016 issues of The R Journal?

2016-08-05 Thread Gordon K Smyth
The R Journal home page doesn't make any promises about how frequently the 
journal will be published.  Historically, though, there have been issues in 
June and December of each year.  The June issue has always appeared by this 
time (6 August) in previous years.


Has there been a change in the publication schedule?  Are there still plans 
for a June 2016 issue?


Thanks
Gordon

-
Professor Gordon K Smyth,
Head, Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
http://www.statsci.org/smyth
__

The information in this email is confidential and intended solely for the 
addressee.
You must not disclose, forward, print or use it without the permission of the 
sender.
__
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Changing style for the Sweave vignettes

2014-11-14 Thread Gordon K Smyth

Date: Thu, 13 Nov 2014 12:09:47 +0100
From: January Weiner 
To: r-devel 
Subject: [Rd] Changing style for the Sweave vignettes

As a user, I am always annoyed beyond measure that Sweave vignettes
precede the code by a command line prompt. It makes running examples
by simple copying of the commands from the vignette to the console a
pain. I know the idea is that it is clear what is the command, and
what is the output, but I'd rather precede the output with some kind
of marking.

Is there any other solution possible / allowed in vignettes? I would
much prefer to make my vignettes easier to use for people like me.

Kind regards,
j.


There are different types of people, and some find the default Sweave 
format easier.


For a beginner, it is best that the code and output should look the same 
in the Sweave vignette as it would look on the screen during an actual 
session.


Windows users have access to "paste commands only".  For them the default 
Sweave vignette format is perfectly convenient -- one can easily cut and 
paste whole pages straight from pdf into the R session.


Perhaps a "paste commands only" app for Unix would keep everyone happy :)

Gordon

__
The information in this email is confidential and intend...{{dropped:4}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Using a function from splines.c in our package

2012-07-09 Thread Gordon K Smyth

Dear all,

I'm writing to ask for advice as to best practice.  A PhD student working 
with me is writing C++ code that we hope to make public as src code in our 
Bioconductor package edgeR.  He wants to call the function fmm_spline, 
which is part of the source code for the stats package


http://svn.r-project.org/R/trunk/src/library/stats/src/splines.c,

from his C++ code.  This function isn't one of the entry points for C code 
documented in Chapter 6 of the "Writing R extensions" manual.


We haven't figured out a way to call the fmm_spline function directly from 
our C++ code.  Is there a way that we have missed?


Can we simply copy the fmm_spline function into our C++ code and declare 
where we got it from?  Should we include the license declaration from the 
header of splines.c?  Anything else we need to do satisfying copyright and 
be good citizens?


Thanks a lot
Gordon

-
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
http://www.statsci.org/smyth

__
The information in this email is confidential and intend...{{dropped:4}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Using a function from splines.c in our package

2012-07-09 Thread Gordon K Smyth
Thanks.  Our package has LGPL (>= 2) licence, so we'll copy with header 
and proper acknowledgements.


Gordon

On Mon, 9 Jul 2012, Prof Brian Ripley wrote:


On Mon, 9 Jul 2012, Gordon K Smyth wrote:


Dear all,

I'm writing to ask for advice as to best practice.  A PhD student working 
with me is writing C++ code that we hope to make public as src code in our 
Bioconductor package edgeR.  He wants to call the function fmm_spline, 
which is part of the source code for the stats package


http://svn.r-project.org/R/trunk/src/library/stats/src/splines.c,

from his C++ code.  This function isn't one of the entry points for C code 
documented in Chapter 6 of the "Writing R extensions" manual.


None of those are in a package.

We haven't figured out a way to call the fmm_spline function directly from 
our C++ code.  Is there a way that we have missed?


Not a very portable way, but there are some ideas in 'Writing R Extensions'. 
I don't think they are enough for a CRAN or BioC package, though.


Can we simply copy the fmm_spline function into our C++ code and declare 
where we got it from?  Should we include the license declaration from the 
header of splines.c?  Anything else we need to do satisfying copyright and 
be good citizens?


If your package has a compatible licence, you can (and you must include the 
whole header when you copy).   CRAN's policies say that when you copy code 
from elsewhere you must include that code's authors in the Authors field of 
your package, and that clearly is good practice.



Thanks a lot
Gordon

-----
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
http://www.statsci.org/smyth



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595



__
The information in this email is confidential and intend...{{dropped:4}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] documentation cross references under R 2.10.0dev for Windows

2009-09-27 Thread Gordon K Smyth
inkS4class{} expands to
\link[=-class]{}.  From (B) it then follows that
there must be an \alias{-class} and a \name{}.


Q1. Is that correct?  To me it look a bit inconsistent.


No, \name{} is irrelevant for links.  It's the filename that matters in the 
3rd form.




Q2. Are there more?

Q3. Will there be more?

Q4. What about

\link[=]{}
\link{:]{}

where  can be (almost) any string?


The first is what the 2nd form refers to.  "Name" there is what is displayed 
in the file making the link.


The second is new, as of 2.10.0, and is the fallback if a filename matching 
 is not found.


Q4. Are (A) and (B) only supposed to be used for linking within a
package, or can it be used to link to "wherever"  exist?


They should work anywhere.  The difficulty arises if you link to something 
that a user doesn't have installed, or if the link is ambiguous.



Q5. It sounds that (C) and (D) should be avoided.  Is that correct?


I think good practice is to make sure that the base of the filename (less 
.Rd) is also an alias in the file, and also the \name{} of the file.  The 
system would probably be less confusing if this were forced, but there are 
lots of files out there where it's not true.


You want the filename to be an alias because links sometimes go to aliases 
and sometimes to filenames; you want the name to match because that's what is 
displayed at the top of the page, so people might remember "just go to the 
Foo man page".



Q6. What if  exist in two packages 'pkgA' and 'pkgB' and I want
to specify that I mean topic  of package 'pkgA', cf namespaces
and pkgA::foo()?


If you follow the good practice above, then use \link[pkgA]{topic}.  If you 
don't follow that practice, you may be out of luck, because R will look for 
the filename topic.Rd in pkgA, not \alias{topic}.  However, as of 2.10.0, it 
will fall back to the latter.



Q7. I the 1st paragraph above it says "(possibly in another package)"
and in the 3rd paragraph above it is mentioned at "The only reason to
use these forms [...] is to force a reference to a package that might
be further down the search path" - is that the answer to Q4?  Will
\link{} be *dynamically* linked to whatever comes first on the
search() path - to reflect the running environment rather than the
intention of the document?


In 2.10.0, I believe this will be the case (but I'd have to check the code to 
be sure).  I'd recommend being explicit if you are worried about this 
possibility.




Reading between the lines, the development of Rd looks exiting.
Instead of 2nd guessing where we are heading, could someone in charge
please give some thoughts on what the plans are and an estimate on how
long it will take before we are there - what R version?


I don't think there's really someone "in charge", but I've been closely 
involved with this, so I'll give some thoughts.


Generally speaking, we have releases on a regular schedule, we don't hold 
them up for particular features.  So I don't think it would be possible to 
figure out when development on Rd files will be done.  It depends on when 
people have the time to do what they think needs doing, and to a large 
extent, that depends on how things get used.


Some things that are not there now, but which might be there in the future 
(i.e. later than 2.10.0):


- Better support for the \Sexpr macros (which let the content of man pages 
depend on R code, executed just before rendering).  Right now there's no 
special support for that R code; it would make sense to define some functions 
to make writing such stuff easier.  (This is something that could be done in 
a contributed package, it needn't be in base R.)


- Improved prompt() and package.skeleton() functions to take advantage of 
the above.


- Graphs in man pages.

- Ways to link from man pages to vignettes.  The reverse would be nice, but 
it's not possible with the current design, so that would be far off.


- Some general rationalization of the whole help system.

Duncan Murdoch



MISC:
I understand that \link[=-class]{} is part of
standard Rd conventions, but to the best of my knowledge
\link[=.class]{} is not, correct?  I would like
to suggest to write a separate paragraph for S4 classes without
mentioning S3 classes.  The following also adds to the confusion -
there exists one Rd page with \name{terms} and one with
\name{terms.object}, so it is not really clear what
\link[=terms.object]{terms} is strictly supposed to do - is it of form
\link[=]{} or \link[=]{}.  Maybe it is
helpful to clarify what the static/dynamic link will be and what will
be display.

Thanks

Henrik

PS. This is related to today's (Sept 23, 2009) BioConductor posts by
Gordon Smyth - "[Bioc-devel] BioC 2.5: "suspect" interpackage links";
https://stat.ethz.ch/pipermail/bioc-devel/2009-September/001975.html

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] documentation cross references under R 2.10.0dev for Windows

2009-09-28 Thread Gordon K Smyth
With one expection, all warnings go away when I download the relevant 
Bioconductor packages as source code and re-build them (rcmd INSTALL 
--build) on my own machine.


The warnings re-appear if I install the Bioconductor packages in the 
normal way using biocLite("Biobase") etc.  I will follow this up with the 
Bioconductor people.


The one exception is the self-reference to limma:00Index.  This marked as 
a missing link, under Windows only, although it works fine.


Gordon

On Mon, 28 Sep 2009, Gordon K Smyth wrote:

Rcmd check under R 2.10.0dev for Windows seems to be issuing a number of 
spurious warning messages about Rd cross-references.


The following warning messages appear when checking the latest (non-public) 
version of the Bioconductor package limma.  They appear only under Windows, 
not Unix or Mac.  All the flagged links appear to be ok, in that they 
specific a genuine html file, and should therefore not be marked as suspect 
or missing.


Regards
Gordon

* using R version 2.10.0 Under development (unstable) (2009-09-27 r49846)
* using session charset: ISO8859-1
* checking Rd cross-references ... WARNING
Missing link(s) in documentation object './man/01Introduction.Rd':
 '[limma:00Index]{LIMMA contents page}'

Suspect link(s) in documentation object './man/asmalist.Rd':
 '[marray:marrayNorm-class]{marrayNorm}'

Suspect link(s) in documentation object './man/asmatrix.Rd':
 '[Biobase]{exprs}'

Suspect link(s) in documentation object './man/dupcor.Rd':
 '[statmod]{mixedModel2Fit}'

Suspect link(s) in documentation object './man/EList.Rd':
 '[Biobase]{ExpressionSet-class}'

Suspect link(s) in documentation object './man/imageplot.Rd':
 '[marray]{maImage}'

Suspect link(s) in documentation object './man/intraspotCorrelation.Rd':
 '[statmod]{remlscore}'

Suspect link(s) in documentation object './man/limmaUsersGuide.Rd':
 '[Biobase]{openPDF}' '[Biobase]{openVignette}' '[base]{Sys.putenv}'

Suspect link(s) in documentation object './man/malist.Rd':
 '[marray:marrayNorm-class]{marrayNorm}'

Suspect link(s) in documentation object './man/normalizebetweenarrays.Rd':
 '[marray:maNormScale]{maNormScale}' '[affy:normalize]{normalize}'

Suspect link(s) in documentation object './man/normalizeWithinArrays.Rd':
 '[marray:maNorm]{maNorm}'

Suspect link(s) in documentation object './man/normexpfit.Rd':
 '[affy:bg.adjust]{bg.parameters}'

Suspect link(s) in documentation object './man/readgal.Rd':
 '[marray:read.Galfile]{read.Galfile}'

Suspect link(s) in documentation object './man/rglist.Rd':
 '[marray:marrayRaw-class]{marrayRaw}'



On Wed, 23 Sep 2009, Duncan Murdoch wrote:


On 23/09/2009 10:08 PM, Henrik Bengtsson wrote:

Hi,

in 'Writing R Extensions" of R v2.10.0, under Section
'Cross-references' (2009-09-07) it says:

1. "The markup \link{foo} (usually in the combination
\code{\link{foo}}) produces a hyperlink to the help for foo. Here foo
is a topic, that is the argument of \alias markup in another Rd file
(possibly in another package)."

2. "You can specify a link to a different topic than its name by
\link[=dest]{name} which links to topic dest with name name. This can
be used to refer to the documentation for S3/4 classes, for example
\code{"\link[=abc-class]{abc}"} would be a way to refer to the
documentation of an S4 class "abc" defined in your package, and
\code{"\link[=terms.object]{terms}"} to the S3 "terms" class (in
package stats). To make these easy to read, \code{"\linkS4class{abc}"}
expands to the form given above."

3. "There are two other forms of optional argument specified as
\link[pkg]{foo} and \link[pkg:bar]{foo} to link to the package pkg, to
files foo.html and bar.html respectively. These are rarely needed,
perhaps to refer to not-yet-installed packages (but there the HTML
help system will resolve the link at run time) or in the normally
undesirable event that more than one package offers help on a topic20
(in which case the present package has precedence so this is only
needed to refer to other packages). They are only in used in HTML help
(and ignored for hyperlinks in LaTeX conversions of help pages), and
link to the file rather than the topic (since there is no way to know
which topics are in which files in an uninstalled package). The *only*
reason to use these forms for base and recommended packages is to
force a reference to a package that might be further down the search
path. Because they have been frequently misused, as from R 2.10.0 the
HTML help system will look for topic foo in package pkg if it does not
find file foo.html."


Trying to summariz

[Rd] How to capture t-score and p-values from t.test

2006-03-26 Thread Gordon K Smyth
> Date: Sat, 25 Mar 2006 16:57:50 -0400
> From: Kjetil Brinchmann Halvorsen
>   <[EMAIL PROTECTED]>
> Subject: Re: [Rd] How to capture t-score and p-values from t.test
> To: "Bernzweig, Bruce \(Exchange\)" <[EMAIL PROTECTED]>
> Cc: r-devel@r-project.org
>
> Bernzweig, Bruce (Exchange) wrote:
>> When I do t.test on two distributions (see example below), it outputs
>> numerous data about the t.test.
>>
>> What I'd like to do is individually capture some of this data and assign
>> it to other variables.
>>
>> However, I am unable to find anything in the help section.
>
> t.test returns an object of class "htest", but ?htest :
>  > ?htest
> No documentation for 'htest' in specified packages and libraries:
> you could try 'help.search("htest")'
> but that does not do anythinh eiather.
>
> Some time ago, I wrote a helpfile for htest, but that was rejected,
> since "S3 classes are not usually documented".
>
> Kjetil

The "htest" object created by t.test() is fully documented in the "Value" 
section of
help("t.test").  In effect, this defines what a "htest" object is understood to 
be.

help.search("htest") brings up a list of every function which produces an 
"htest" object.

Gordon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel