Re: [Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

2016-03-28 Thread Martin Maechler
> peter dalgaard 
> on Fri, 25 Mar 2016 09:41:00 +0100 writes:

> As I see it, the display showing the first p << n PCs
> adding up to 100% of the variance is plainly wrong.  I
> suspect it comes about via a mental short-circuit: If we
> try to control p using a tolerance, then that amounts to
> saying that the remaining PCs are effectively
> zero-variance, but that is (usually) not the intention at
> all.

> The common case is that the remainder terms have a roughly
> _constant_, small-ish variance and are interpreted as
> noise. Of course the magnitude of the noise is important
> information.

Thank you, Peter, Kasper, Steve.

@Kasper, I've known about *approximate* first few PC methods.
(Are there also exact ones which are more efficient than those
 based on LAPACK'S DGESDD ?)
... and so indeed, prcomp() will not be the method of choice for
really large problems.  Still, of course, we should try to "do our best".

Given your sentiments, notably Peter's, I now envisage to
do the non-backcompatible change to *summary.prcomp()* and
compute "importances" based on all proportions up to 'p' (= 100%).
What I think would be nice is for the print.summary.prcomp()
method to only show the first 'k' (my notation), i.e., those
which were chosen by 'tol' and/or 'rank.'.

Martin

>> On 25 Mar 2016, at 00:02 , Steve Bronder
>>  wrote:
>> 
>> I agree with Kasper, this is a 'big' issue. Does your
>> method of taking only n PCs reduce the load on memory?
>> 
>> The new addition to the summary looks like a good idea,
>> but Proportion of Variance as you describe it may be
>> confusing to new users. Am I correct in saying Proportion
>> of variance describes the amount of variance with respect
>> to the number of components the user chooses to show? So
>> if I only choose one I will explain 100% of the variance?
>> I think showing 'Total Proportion of Variance' is
>> important if that is the case.
>> 
>> 
>> Regards,
>> 
>> Steve Bronder Website: stevebronder.com Phone:
>> 412-719-1282 Email: sbron...@stevebronder.com
>> 
>> 
>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>> kasperdanielhan...@gmail.com> wrote:
>> 
>>> Martin, I fully agree.  This becomes an issue when you
>>> have big matrices.
>>> 
>>> (Note that there are awesome methods for actually only
>>> computing a small number of PCs (unlike your code which
>>> uses svn which gets all of them); these are available in
>>> various CRAN packages).
>>> 
>>> Best, Kasper
>>> 
>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>> maech...@stat.math.ethz.ch
 wrote:
>>> 
 Following from the R-help thread of March 22 on "Memory
 usage in prcomp",
 
 I've started looking into adding an optional 'rank.'
 argument to prcomp allowing to more efficiently get
 only a few PCs instead of the full p PCs, say when p =
 1000 and you know you only want 5 PCs.
 
 (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
 
 As it was mentioned, we already have an optional 'tol'
 argument which allows *not* to choose all PCs.
 
 When I do that, say
 
 C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and
 its root all.equal(S, crossprod(C)) set.seed(17) X <-
 matrix(rnorm(32000), 1000, 32) Z <- X %*% C ## ==>
 cov(Z) ~= C'C = S all.equal(cov(Z), S, tol = 0.08) pZ
 <- prcomp(Z, tol = 0.1) summary(pZ) # only ~14 PCs (out
 of 32)
 
 I get for the last line, the summary.prcomp(.) call :
 
> summary(pZ) # only ~14 PCs (out of 32)
 Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7
 PC8 Standard deviation 3.6415 2.7178 1.8447 1.3943
 1.10207 0.90922
>>> 0.76951
 0.67490 Proportion of Variance 0.4352 0.2424 0.1117
 0.0638 0.03986 0.02713
>>> 0.01943
 0.01495 Cumulative Proportion 0.4352 0.6775 0.7892
 0.8530 0.89288 0.92001
>>> 0.93944
 0.95439 PC9 PC10 PC11 PC12 PC13 PC14 Standard deviation
 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
 Proportion of Variance 0.01214 0.00875 0.00789 0.00648
 0.00534 0.0050 Cumulative Proportion 0.96653 0.97528
 0.98318 0.98966 0.99500 1.
> 
 
 which computes the *proportions* as if there were only
 14 PCs in total (but there were 32 originally).
 
 I would think that the summary should or could in
 addition show the usual "proportion of variance
 explained" like result which does involve all 32
 variances or std.dev.s ... which are returned from the
 svd() anyway, even in the case when I use my new
 'rank.' argument which only ret

Re: [Rd] sys.function(0)

2016-03-28 Thread peter dalgaard
Dunno, really. Some strange things can happen with nonstandard evaluation, like 
having a function designed to evaluate something in the parent of its caller, 
but nonetheless sometimes being called from the command line. So things are 
sometimes defensively coded.

-pd

> On 28 Mar 2016, at 00:08 , Mick Jordan  wrote:
> 
> A related question is why are sys.parent/parent.frame so permissive in their 
> error checking? E.g:
> 
> > sys.parent(-1)
> [1] 0
> > sys.parent(-2)
> [1] 0
> > sys.parent(1)
> [1] 0
> > sys.parent(2)
> [1] 0
> > parent.frame(4)
> 
> >

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] documentation / warning when passing a vector as lower/upper bound in stats::integrate()

2016-03-28 Thread Baptiste Auguie
Dear R-dev list,

I wonder if stats::integrate shouldn't warn the user when a numeric vector
of length > 1 is passed as lower or upper bounds. If a vector is passed,
only the first value is used and the others are silently ignored:

integrate(sin, lower=0, upper=pi)
integrate(sin, lower=0:10, upper=pi)

?integrate doesn't appear to mention explicitly that the function is not
vectorised over those arguments.

It's probably not a common mistake, but it can have unfortunate
consequences in the iterative calculation of multiple integrals. Someone
was puzzled by this today (http://stackoverflow.com/q/36275909/471093) and
it wasn't immediately obvious what had led to incorrect results (and worse,
it could have gone unnoticed).

Best regards,

baptiste

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel