Re: [Rd] (PR#8877) predict.lm does not have a weights argument for

2006-05-24 Thread Peter Dalgaard
[EMAIL PROTECTED] writes:

> (a) case weights:  w_i = 3 means `I have three observations like (y, x)'
> 
> (b) inverse-variance weights, most often an indication that w_i = 1/3 
> means that y_i is actually the average of 3 observations at x_i.
> 
> (c) multiple imputation, where a case with missing values in x is split 
> into say 5 parts, with case weights less than and summing to one.
> 
> (d) Heteroscedasticity, where the model is rather
> 
>  y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))
> 
> And there may well be other scenarios, but those are the most common (in 
> decreasing order) in my experience.

I'd have (d) higher on the list, but never mind. There's also

(e) Inverse probability weights: Knowing that part of the population
is undersampled and wanting results that are compatible with what you
would have gotten in a balanced sample. Prototypically: You sample X,
taking only a third of those with X > c; find population mean of X,
(or univariate regression on some other variable, which is only
recorded in the subsample).

(R-bugs stripped from recipients since this doesn't really have
anything to do with the purported bug.)

   
-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Suggesting changes to HELP files?

2006-05-24 Thread Friedrich Leisch
  > On 5/21/06, Spencer Graves <[EMAIL PROTECTED]> wrote:
  >> Is there a procedure for suggesting changes to HELP files of the core
  >> R distribution?  If yes, what is it?  If it would be considered a
  >> friendly gesture, I could find the relevant *.Rd file and submit a
  >> suggested modification to it someplace.  Alternatively, I could just
  >> send suggestions someplace if they would receive appropriate
  >> consideration.
  >>

As Duncan Murdoch already pointed out the easiest way of finding out
who in R core is responsible for which part of R is looking at the
subversion logs to see who has been working on a particular file, for
most Sweave & Vignette-related parts that would be me.

  >> 
  >> Second, I think the description of 'vignette' could be enhanced to
  >> include some version of my 'p.s.' to
  >> 'http://finzi.psych.upenn.edu/R/Rhelp02a/archive/73494.html' and other
  >> similar posts.  In particular, I see that the 'edit' method is described
  >> there, but I didn't understand what it said until I already knew how to
  >> use it.  Also, 'edit' doesn't work for me under ESS / Emacs.  For that,
  >> I use Stangle (as Sundar Dorai-Raj taught me).

Done in r-devel, pls have a look.

Best,
Fritz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] the computation of exact p-value for the nonparametric cor-test with ties

2006-05-24 Thread Antonietta di Salvatore
Hello,

I wuold like to propose my modifications of the original cor.test to you : I 
tried to calcolate the correct p-value for Spearman and Kendall's test with 
ties.
Let me know what you think.

Thanks you for your time.

Antonietta di Salvatore


test <- function(x, ...) UseMethod("test")

test.default <-
function(x, y, alternative = c("two.sided", "less", "greater"),
 method = c("pearson", "newkendall", "newspearman"), exact = NULL,
 conf.level = 0.95, ...)
{
alternative <- match.arg(alternative)
method <- match.arg(method)
DNAME <- paste(deparse(substitute(x)), "and", deparse(substitute(y)))

if(length(x) != length(y))
stop("x and y must have the same length")
OK <- complete.cases(x, y)
x <- x[OK]
y <- y[OK]
n <- length(x)

PVAL <- NULL
NVAL <- 0
conf.int <- FALSE
if(method == "pearson") {
if(n < 3)
stop("not enough finite observations")
method <- "Pearson's product-moment correlation"
names(NVAL) <- "correlation"
r <- cor(x, y)
df <- n - 2
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r / sqrt(1 - r^2))
p <- pt(STATISTIC, df)
if(n > 3) { ## confidence int.
if(!missing(conf.level) &&
   (length(conf.level) != 1 || !is.finite(conf.level) ||
conf.level < 0 || conf.level > 1))
stop(paste("conf.level must be a single number",
   "between 0 and 1"))
conf.int <- TRUE
z <- atanh(r)
sigma <- 1 / sqrt(n - 3)
cint <-
switch(alternative,
   less = c(-Inf, z + sigma * qnorm(conf.level)),
   greater = c(z - sigma * qnorm(conf.level), Inf),
   two.sided = z +
   c(-1, 1) * sigma * qnorm((1 + conf.level) / 2))
cint <- tanh(cint)
attr(cint, "conf.level") <- conf.level
}
}
else {
if(n < 2)
stop("not enough finite observations")
PARAMETER <- NULL

if(method == "newkendall") {
method <- "Kendall's rank correlation tau"
names(NVAL) <- "tau"
r <- cor(x,y, method = "kendall")
ESTIMATE <- c(tau = r)

if(!is.finite(ESTIMATE)) {  # all x or all y the same
ESTIMATE[] <- NA
STATISTIC <- c(T = NA)
PVAL <- NA
}
else {
if(is.null(exact))
exact <- (n < 50)
if(exact) {
xr<-rank(x)
yr<-rank(y)
 Gx<-table(xr)
 Gx<-Gx[Gx>1]
 Gx<-sum(Gx^2-Gx)
#Gx is null when x variable is without ties
 Gy<-table(yr)
 Gy<-Gy[Gy>1]
 Gy<-sum(Gy^2-Gy)
Gy is null when y variable is without ties
q <- round((r + 1)*sqrt(n^2-n-Gx)*sqrt(n^2-n-Gy)/4)
pkendall <- function(q, n) {
.C("pkendall",
   length(q),
   p = as.double(q),
   as.integer(n),
   PACKAGE = "stats")$p
}
PVAL <-
switch(alternative,
   "two.sided" = {
   if(q > sqrt(n^2-n-Gx)*sqrt(n^2-n-Gy)/4)
   p <- 1 - pkendall(q - 1, n)
   else
   p <- pkendall(q, n)
   min(2 * p, 1)
   },
   "greater" = 1 - pkendall(q - 1, n),
   "less" = pkendall(q, n))
STATISTIC <- c(T = q)
} else {
STATISTIC <- c(z = r / sqrt((4 * n + 10) / (9 * 
n*(n-1
p <- pnorm(STATISTIC)

}
}
} else {
  method<- "Spearman's rank correlation rho"
names(NVAL) <- "rho"
r <- cor(x,y,method="spearman")
ESTIMATE <- c(rho = r)
if(!is.finite(ESTIMATE)) {  # all x or all y the same
ESTIMATE[] <- NA
STATISTIC <- c(S = NA)
PVAL <- NA
}

else {
## Use the test statistic S = sum(rank(x) - rank(y))^2
## and AS 89 for obtaining better p-values than via the
## simple normal approximation.
## In the case of no ties, S = (1-rho) * (n^3-n)/6.
pspearman <- function(q, n, lower.tail = TRUE) {
if(n <= 1290) # n*(n^2 - 1) does not overflow
.C("prho",
   as.in

Re: [Rd] Wishlist: Vignettes on CRAN

2006-05-24 Thread Friedrich Leisch
> On Mon, 22 May 2006 21:42:34 -0700,
> Seth Falcon (SF) wrote:

  > "Gabor Grothendieck" <[EMAIL PROTECTED]> writes:
  >> This would certainly be nice.
  >> 
  >> Note that the BioConductor package vignettes are online:
  >> http://www.bioconductor.org/docs/vignettes.html
  >> but there is nothing comparable for CRAN.

  > That page is actually out of date (we'll be updating it real soon
  > now).  However, we are producing links to package vingnettes in the
  > automatically generated package summary pages.  For example, take a
  > look at the summary page for the Biobase package:

  > http://bioconductor.org/packages/1.8/bioc/html/Biobase.html

  > This was fairly easy to setup.  I would be happy to share the code,
  > etc.

That looks excellent, I would be happy to use somethink like that for
CRAN. But as Kurt already said: looks more like a summer project
(currently I am swamped with teaching).

.f

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] (PR#8877) predict.lm does not have a weights argument for

2006-05-24 Thread Prof Brian Ripley
On Wed, 24 May 2006, Peter Dalgaard wrote:

> [EMAIL PROTECTED] writes:
>
>> (a) case weights:  w_i = 3 means `I have three observations like (y, x)'
>>
>> (b) inverse-variance weights, most often an indication that w_i = 1/3
>> means that y_i is actually the average of 3 observations at x_i.
>>
>> (c) multiple imputation, where a case with missing values in x is split
>> into say 5 parts, with case weights less than and summing to one.
>>
>> (d) Heteroscedasticity, where the model is rather
>>
>>  y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))
>>
>> And there may well be other scenarios, but those are the most common (in
>> decreasing order) in my experience.
>
> I'd have (d) higher on the list, but never mind. There's also

I find that if you detect heteroscedasticity, then one of the following 
applies:

- a transformation of y would be beneficial

- a non-normal model, e.g. a Poisson regression, is more appropriate

- the error variance really depends on y or Ey not x, as in most
   measurement-error scenarios (and the example in ?nls and the example
   in the addendum to the bug report).

- in analytical chemistry as in the example on the addendum to the bug
   report, there are errors in both y and x to consider, and a functional
   relationship model is better.

So I very rarely actually get as far as predicting from a heteroscedastic 
regression model.

> (e) Inverse probability weights: Knowing that part of the population
> is undersampled and wanting results that are compatible with what you
> would have gotten in a balanced sample. Prototypically: You sample X,
> taking only a third of those with X > c; find population mean of X,
> (or univariate regression on some other variable, which is only
> recorded in the subsample).

I would call this an example of case weights (you are just weighting cases 
and saying `I have 1/p like this', and in rlm there is a difference 
between (a) and (b) and you would want to use wt.method="case" for (e)).

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] optim "CG" bug w/patch proposal (PR#8786)

2006-05-24 Thread ripley
On Wed, 17 May 2006, Prof Brian Ripley wrote:

> On Wed, 17 May 2006, [EMAIL PROTECTED] wrote:
>
>> 
>>> "Duncan" == Duncan Murdoch <[EMAIL PROTECTED]>
>>> on Tue, 16 May 2006 08:34:06 -0400 writes:
>>
>>Duncan> On 5/16/2006 4:56 AM, [EMAIL PROTECTED]
>>Duncan> wrote:
>>>> Probably I included too much at once in my bug report. I
>>>> can live with an unfulfilled wishlist and thank you for
>>>> thinking about it. The "badly-behaved" function is just
>>>> an example to demonstrate the bug I reported. I think it
>>>> is a bug if optim returns (without any warning) an
>>>> unmatching pair of par and value: f(par) != value. And it
>>>> is easily fixed.
>>
>>>>  Andreas
>>
>>Duncan> I agree with you that on return f(par) should be
>>Duncan> value.  I agree with Brian that changes to the
>>Duncan> underlying strategy need much more thought.
>> 
>> I agree (to both).
>> However, isn't Andreas' patch just fixing the problem
>> and not changing the underlying strategy at all?
>> [No, I did not study the code in very much detail ...]
>
> The (minor) issue is that x is updated but not f(x).  I think the intended 
> stategy was to update neither, so Andreas' patch was a change of stategy. In 
> particular, a question is if this should be marked as a convergence failure. 
> But people really need to read the reference before commenting,
> and I at least need to find the time to do so in more detail.

Having spent some time with the reference and a debugger, as far as I can 
see what is happening here is that the second phase of the line search 
fails. That can only happen (in exactly this way) for a discontinuous 
function where the first phase has jumped over a discontinuity to an 
essentially flat region.  In those circumstances updating *Fmin seems 
sensible: probably ideally one should detect the lack of near-quadratic 
and force a restart.  But I don't think it is worth much effort to detect 
examples that are a very long way from the assumptions.  So we'll just 
update *Fmin.

>
>> Martin Maechler
>>
>>>> Prof Brian Ripley wrote:
>>>>
>>>>> [Sorry for the belated reply: this came in just as I was leaving for 
>> a
>>>>> trip.]
>>>>>
>>>>> I've checked the original source, and the C code in optim does
>>>>> accurately reflect the published algorithm.
>>>>>
>>>>> Since your example is a discontinuous function, I don't see why you
>>>>> expect CG to work on it.  John Nash reports on his extensive
>>>>> experience that method 3 is the worst, and I don't think we should 
>> let
>>>>> a single 2D example of a badly-behaved function override that.
>>>>>
>>>>> Note that no other optim method copes with the discontiuity here: 
>> had
>>>>> your reported that it would have been clear that the problem was 
>> with
>>>>> the example.
>>>>>
>>>>> On Fri, 21 Apr 2006, [EMAIL PROTECTED] wrote:
>>>>>
>> Dear R team,
>>
>> when using optim with method "CG" I got the wrong $value for the
>> reported $par.
>>
>> Example:
>> f<-function(p) {
>> if (!all(p>-.7)) return(2)
>> if (!all(p<.7)) return(2)
>> sin((p[1])^2)*sin(p[2])
>> }
>> optim(c(0.1,-0.1),f,method="CG",control=list(trace=0,type=1))
>> $par 19280.68 -10622.32
>> $value -0.2346207 # should be 2!
>>
>> optim(c(0.1,-0.1),f,method="CG",control=list(trace=0,type=2))
>> $par 3834.021 -2718.958
>> $value -0.0009983175 # should be 2!
>>
>> Fix:
>> --- optim.c (Revision 37878)
>> +++ optim.c (Arbeitskopie)
>> @@ -970,7 +970,8 @@
>> if (!accpoint) {
>> steplength *= stepredn;
>> if (trace) Rprintf("*");
>> -   }
>> +   } else
>> +   *Fmin = f;
>> }
>> } while (!(count == n || accpoint));
>> if (count < n) {
>>
>> After fix:
>> optim(c(0.1,-0.1),f,method="CG",control=list(trace=0,type=1))
>> $par 0.6993467 -0.4900145
>> $value -0.2211150
>> optim(c(0.1,-0.1),f,method="CG",control=list(trace=0,type=2))
>> $par 3834.021 -2718.958
>> $value 2
>>
>> Wishlist:
>>>>>
>>>> [wishlist deleted]
>>>>
>>>>
>>
>>Duncan> __
>>Duncan> R-devel@r-project.org mailing list
>>Duncan> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> 
>
>

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road

[Rd] Problem with pasteing formulas (PR#8897)

2006-05-24 Thread john . little
Hi,
If I create a formula with say 100 terms and then paste it:

xnam <- paste("x", 1:100, sep="")
fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
paste(fmla)

The result seems to cut off everything after the first 500 characters 
and gives no warning message.

I have the most recent version of R from the R website and the problem 
occurs on both Unix and Windows machines

Thanks

John
-- 
John Little
Department of Mathematical Sciences
Durham University
Durham
UK

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Problem with pasteing formulas (PR#8897)

2006-05-24 Thread ripley
See ?as.character (which paste effectively calls on non-character objects, 
as ?paste says):

Note:

  'as.character' truncates components of language objects to 500
  characters (was about 70 before 1.3.1).

so it is working as documented.  Not then a bug.

You can (and probably should) use a construct like that in terms.formula:

 else paste(deparse(form[[2]]), collapse = "")

to convert a formula to a complete character string.


On Wed, 24 May 2006, [EMAIL PROTECTED] wrote:

> Hi,
> If I create a formula with say 100 terms and then paste it:
>
> xnam <- paste("x", 1:100, sep="")
> fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
> paste(fmla)
>
> The result seems to cut off everything after the first 500 characters
> and gives no warning message.
>
> I have the most recent version of R from the R website and the problem
> occurs on both Unix and Windows machines
>
> Thanks
>
> John
>

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Problem with pasteing formulas (PR#8897)

2006-05-24 Thread François Pinard
[Brian Ripley]

>Note:

>  'as.character' truncates components of language objects to 500
>  characters (was about 70 before 1.3.1).

>so it is working as documented.  Not then a bug.

Wrong reasoning.  Bugs may well be documented :-)

-- 
François Pinard   http://pinard.progiciels-bpi.ca

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] 2.3 issues on Mac

2006-05-24 Thread Ryan Lovett
/Library/Frameworks/R.framework/Versions/2.3/Resources/etc/ppc/Renviron and
/Library/Frameworks/R.framework/Versions/2.3/Resources/etc/i386/Renviron
both use /usr/local/teTeX/bin/powerpc-apple-darwin-current for LaTeX
related variables. Could this be updated for i386 to take into account the
teTeX intel binaries?

Also, Makeconf is not setup to create universal binary libraries. Will this
be done at some point?

Lastly, R.mpkg crashes Apple's Remote Desktop 2.2 when installing on a
client. I was wondering if anyone else sees this?

Thanks,
Ryan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] (PR#8877) predict.lm does not have a weights argument for newdata

2006-05-24 Thread jranke
Prof. Ripley,

thank you for being more explicit now! Thank you also for the fix to
the wish that you derived from my bug report. Here comes the rationale
for my updated patch, which I *humbly* propose as a more general
solution to the problem.

I have spent several days now on this problem, but I am no statistician,
so please excuse my ignorance of any notational or other conventions
that I might disregard below. I tried hard to do it right.

Judging from the documentation to lm, I find that lm has only *one*
perspective on weights which I propose is reasonable and sufficient. It
minimises  "sum(w*e^2))", more clearly expressed as 

sum(w_i * e_i^2)

I infer that the point is to construct the weights such that 
the errors

\epsilon = sqrt(w_k) * \epsilon_k 

are from a normal distribution N(0, \sigma^2), where index k covers all
observations, past as well as future ones, the model is

y = x\beta + \epsilon_k

with 

\epsilon_k \sim N(0, \sigma_k^2)

and

\sigma_k^2 = \sigma^2 / w_k

This is my view on this, it might be naive, it might be wrong, but
if so, I can't see my mistake.

An estimator for \sigma^2 is then

sum(w_i * e_i^2) / df

which is called res.var in the R code to predict.lm. An estimator
for \sigma_k^2, the variance for observation k, is therefore

res.var / w_k

which is what my proposed patch, which can now be found in an updated
form under

 
http://www.uft.uni-bremen.de/chemie/ranke/r-patches/predict.lm.patch.r38195

implements. I removed the first version called lm.predict.patch because
it did not correctly deal with prediction intervals for old data, sorry
for the inconvenience ("Not found"). Meanwhile you disabled prediction
intervals for old data, which the above patch reverts (no offense,
please, I just think if predict.lm does confidence intervals for the
regression line at the location of old data points, it might as well do
illustrative prediction intervals at these locations. At least for the
old data, the weights are known from the model object.)

I tested my solution with the script

http://www.uft.uni-bremen.de/chemie/ranke/r-patches/test.predict.lm.R

* Prof Brian Ripley <[EMAIL PROTECTED]> [060524 07:50]:
> I am more than 'a little disappointed' that you expect a detailed 
> explanation of the problems with your 'bug' report, especially as you did 
> not provide any explanation yourself as to your reasoning (nor did you 
> provide any credentials nor references).

I am sorry for this, and I worked hard to supply the information in the
followups including this one.
 
> Note that
> 
> 1) Your report did not make clear that this was only relevant to 
> prediction intervals, which are not commonly used.

I am not really sure if confidence intervals couldn't be improved
in scenario (d). 
 
> 2) Only in some rather special circumstances do weights enter into 
> prediction intervals, and definitely not necessarily the weights used for 
> fitting.  

Yes, under these circumstances R would then need weights from the user
in order to construct prediction intervals. 

> Indeed, it seems that to label the variances that do enter as 
> inverse weights would be rather misleading.

Possibly. I assume it can be correctly done and documented (see my patch
for a proposal).

> 3) In a later message you referenced Brown's book, which is dealing with a 
> different model.
> 
> The model fitted by lm is
> 
>   y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2)
> 
> (Row vector x, column vector \beta.)

I assumed that the model in Brown's book is a special case of the model
fitted by lm, the only difference being that x is a row vector (1,
x_B), where x_B is Brown's random variable X, and \beta contains
\alpha_B and \beta_B.
 
> If the observations are from the model, OLS is appropriate, but weighting 
> is used in several scenarios, including:
> 
> (a) case weights:  w_i = 3 means `I have three observations like (y, x)'

I am not sure what you mean by this. Do you mean you have three exactly
equal observations? In this case this could be regarded as a special
case of (b) and w_i = 1/3 would be used as input for lm.

What does the "'" mean in (y, x)'?

> (b) inverse-variance weights, most often an indication that w_i = 1/3 
> means that y_i is actually the average of 3 observations at x_i.

Yes, and I take the reasoning for this to be as follows: An estimator
for the variance of the means of n observations is 1/n times the
variance of the single observations. Why shouldn't this apply to the
means of multiple future observations?

> (c) multiple imputation, where a case with missing values in x is split 
> into say 5 parts, with case weights less than and summing to one.

I don't understand this, but I suppose lm works in the same way for
weights w_i derived from such a procedure.
 
> (d) Heteroscedasticity, where the model is rather
> 
> y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))
> 
>