[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

2016-03-24 Thread Martin Maechler
Following from the R-help thread of March 22 on "Memory usage in prcomp",

I've started looking into adding an optional   'rank.'  argument
to prcomp  allowing to more efficiently get only a few PCs
instead of the full p PCs, say when p = 1000 and you know you
only want 5 PCs.

 (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html

As it was mentioned, we already have an optional 'tol' argument
which allows *not* to choose all PCs.

When I do that,
say

 C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
 all.equal(S, crossprod(C))
 set.seed(17)
 X <- matrix(rnorm(32000), 1000, 32)
 Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
 all.equal(cov(Z), S, tol = 0.08)
 pZ <- prcomp(Z, tol = 0.1)
 summary(pZ) # only ~14 PCs (out of 32)
 
I get for the last line, the   summary.prcomp(.) call :

> summary(pZ) # only ~14 PCs (out of 32)
Importance of components:
  PC1PC2PC3PC4 PC5 PC6 PC7 
PC8
Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 0.76951 
0.67490
Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 0.01943 
0.01495
Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 0.93944 
0.95439
   PC9PC10PC11PC12PC13   PC14
Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.
>

which computes the *proportions* as if there were only 14 PCs in
total (but there were 32 originally).

I would think that the summary should  or could in addition show
the usual  "proportion of variance explained"  like result which
does involve all 32  variances or std.dev.s ... which are
returned from the svd() anyway, even in the case when I use my
new 'rank.' argument which only returns a "few" PCs instead of
all.

Would you think the current  summary() output is good enough or
rather misleading?

I think I would want to see (possibly in addition) proportions
with respect to the full variance and not just to the variance
of those few components selected.

Opinions?

Martin Maechler
ETH Zurich

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

2016-03-24 Thread Kasper Daniel Hansen
Martin, I fully agree.  This becomes an issue when you have big matrices.

(Note that there are awesome methods for actually only computing a small
number of PCs (unlike your code which uses svn which gets all of them);
these are available in various CRAN packages).

Best,
Kasper

On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler  wrote:

> Following from the R-help thread of March 22 on "Memory usage in prcomp",
>
> I've started looking into adding an optional   'rank.'  argument
> to prcomp  allowing to more efficiently get only a few PCs
> instead of the full p PCs, say when p = 1000 and you know you
> only want 5 PCs.
>
>  (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
>
> As it was mentioned, we already have an optional 'tol' argument
> which allows *not* to choose all PCs.
>
> When I do that,
> say
>
>  C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
>  all.equal(S, crossprod(C))
>  set.seed(17)
>  X <- matrix(rnorm(32000), 1000, 32)
>  Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
>  all.equal(cov(Z), S, tol = 0.08)
>  pZ <- prcomp(Z, tol = 0.1)
>  summary(pZ) # only ~14 PCs (out of 32)
>
> I get for the last line, the   summary.prcomp(.) call :
>
> > summary(pZ) # only ~14 PCs (out of 32)
> Importance of components:
>   PC1PC2PC3PC4 PC5 PC6
>  PC7 PC8
> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 0.76951
> 0.67490
> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 0.01943
> 0.01495
> Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 0.93944
> 0.95439
>PC9PC10PC11PC12PC13   PC14
> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
> Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.
> >
>
> which computes the *proportions* as if there were only 14 PCs in
> total (but there were 32 originally).
>
> I would think that the summary should  or could in addition show
> the usual  "proportion of variance explained"  like result which
> does involve all 32  variances or std.dev.s ... which are
> returned from the svd() anyway, even in the case when I use my
> new 'rank.' argument which only returns a "few" PCs instead of
> all.
>
> Would you think the current  summary() output is good enough or
> rather misleading?
>
> I think I would want to see (possibly in addition) proportions
> with respect to the full variance and not just to the variance
> of those few components selected.
>
> Opinions?
>
> Martin Maechler
> ETH Zurich
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

2016-03-24 Thread Steve Bronder
I agree with Kasper, this is a 'big' issue. Does your method of taking only
n PCs reduce the load on memory?

The new addition to the summary looks like a good idea, but Proportion of
Variance as you describe it may be confusing to new users. Am I correct in
saying Proportion of variance describes the amount of variance with respect
to the number of components the user chooses to show? So if I only choose
one I will explain 100% of the variance? I think showing 'Total Proportion
of Variance' is important if that is the case.

Regards,

Steve Bronder
Website: stevebronder.com
Phone: 412-719-1282
Email: sbron...@stevebronder.com


On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:

> Martin, I fully agree.  This becomes an issue when you have big matrices.
>
> (Note that there are awesome methods for actually only computing a small
> number of PCs (unlike your code which uses svn which gets all of them);
> these are available in various CRAN packages).
>
> Best,
> Kasper
>
> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
> maech...@stat.math.ethz.ch
> > wrote:
>
> > Following from the R-help thread of March 22 on "Memory usage in prcomp",
> >
> > I've started looking into adding an optional   'rank.'  argument
> > to prcomp  allowing to more efficiently get only a few PCs
> > instead of the full p PCs, say when p = 1000 and you know you
> > only want 5 PCs.
> >
> >  (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
> >
> > As it was mentioned, we already have an optional 'tol' argument
> > which allows *not* to choose all PCs.
> >
> > When I do that,
> > say
> >
> >  C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
> >  all.equal(S, crossprod(C))
> >  set.seed(17)
> >  X <- matrix(rnorm(32000), 1000, 32)
> >  Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
> >  all.equal(cov(Z), S, tol = 0.08)
> >  pZ <- prcomp(Z, tol = 0.1)
> >  summary(pZ) # only ~14 PCs (out of 32)
> >
> > I get for the last line, the   summary.prcomp(.) call :
> >
> > > summary(pZ) # only ~14 PCs (out of 32)
> > Importance of components:
> >   PC1PC2PC3PC4 PC5 PC6
> >  PC7 PC8
> > Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
> 0.76951
> > 0.67490
> > Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
> 0.01943
> > 0.01495
> > Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
> 0.93944
> > 0.95439
> >PC9PC10PC11PC12PC13   PC14
> > Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
> > Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
> > Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.
> > >
> >
> > which computes the *proportions* as if there were only 14 PCs in
> > total (but there were 32 originally).
> >
> > I would think that the summary should  or could in addition show
> > the usual  "proportion of variance explained"  like result which
> > does involve all 32  variances or std.dev.s ... which are
> > returned from the svd() anyway, even in the case when I use my
> > new 'rank.' argument which only returns a "few" PCs instead of
> > all.
> >
> > Would you think the current  summary() output is good enough or
> > rather misleading?
> >
> > I think I would want to see (possibly in addition) proportions
> > with respect to the full variance and not just to the variance
> > of those few components selected.
> >
> > Opinions?
> >
> > Martin Maechler
> > ETH Zurich
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

2016-03-24 Thread Steve Bronder
I agree with Kasper, this is a 'big' issue. Does your method of taking only
n PCs reduce the load on memory?

The new addition to the summary looks like a good idea, but Proportion of
Variance as you describe it may be confusing to new users. Am I correct in
saying Proportion of variance describes the amount of variance with respect
to the number of components the user chooses to show? So if I only choose
one I will explain 100% of the variance? I think showing 'Total Proportion
of Variance' is important if that is the case.


Regards,

Steve Bronder
Website: stevebronder.com
Phone: 412-719-1282
Email: sbron...@stevebronder.com


On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:

> Martin, I fully agree.  This becomes an issue when you have big matrices.
>
> (Note that there are awesome methods for actually only computing a small
> number of PCs (unlike your code which uses svn which gets all of them);
> these are available in various CRAN packages).
>
> Best,
> Kasper
>
> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
> maech...@stat.math.ethz.ch
> > wrote:
>
> > Following from the R-help thread of March 22 on "Memory usage in prcomp",
> >
> > I've started looking into adding an optional   'rank.'  argument
> > to prcomp  allowing to more efficiently get only a few PCs
> > instead of the full p PCs, say when p = 1000 and you know you
> > only want 5 PCs.
> >
> >  (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
> >
> > As it was mentioned, we already have an optional 'tol' argument
> > which allows *not* to choose all PCs.
> >
> > When I do that,
> > say
> >
> >  C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
> >  all.equal(S, crossprod(C))
> >  set.seed(17)
> >  X <- matrix(rnorm(32000), 1000, 32)
> >  Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
> >  all.equal(cov(Z), S, tol = 0.08)
> >  pZ <- prcomp(Z, tol = 0.1)
> >  summary(pZ) # only ~14 PCs (out of 32)
> >
> > I get for the last line, the   summary.prcomp(.) call :
> >
> > > summary(pZ) # only ~14 PCs (out of 32)
> > Importance of components:
> >   PC1PC2PC3PC4 PC5 PC6
> >  PC7 PC8
> > Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
> 0.76951
> > 0.67490
> > Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
> 0.01943
> > 0.01495
> > Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
> 0.93944
> > 0.95439
> >PC9PC10PC11PC12PC13   PC14
> > Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
> > Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
> > Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.
> > >
> >
> > which computes the *proportions* as if there were only 14 PCs in
> > total (but there were 32 originally).
> >
> > I would think that the summary should  or could in addition show
> > the usual  "proportion of variance explained"  like result which
> > does involve all 32  variances or std.dev.s ... which are
> > returned from the svd() anyway, even in the case when I use my
> > new 'rank.' argument which only returns a "few" PCs instead of
> > all.
> >
> > Would you think the current  summary() output is good enough or
> > rather misleading?
> >
> > I think I would want to see (possibly in addition) proportions
> > with respect to the full variance and not just to the variance
> > of those few components selected.
> >
> > Opinions?
> >
> > Martin Maechler
> > ETH Zurich
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel