Re: [R] Help with PCA data file prep and R code

Albin Blaschka Wed, 28 Sep 2016 07:02:55 -0700

Hi,

maybe the package vegan with its tutorials is a good starting point,too...


http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf

http://cc.oulu.fi/~jarioksa/opetus/metodi/sessio2.pdf

all the best,
Albin

Am 22.09.2016 10:23 PM, schrieb David L Carlson:

Looking at your data there are several issues.

1. Tank is an integer, but it sounds like you intend to use it as a
categorical measure. If so that it should be a factor, but factors
cannot be used in pca. Is Tank 10 10 times more of something than Tank
1?

2. Date is a factor. That means you are not measuring time, just the
fact that 2 rows are the same time or different time. Factors cannot
be used in pca.

3. Treatment is a factor, but factors cannot be used in pca.

4. Your log transformed data has many 0's and no negative values. Did
you add 1 to each value before taking logarithms?

First line of your code after reading the data:

meso.pca <- prcomp(mesocleaned, center=TRUE, scale.=TRUE)

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

                   scale. = TRUE)


-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352


-----Original Message-----

From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of SarahStinson

Sent: Wednesday, September 21, 2016 10:25 AM
To: r-help@r-project.org
Subject: Re: [R] Help with PCA data file prep and R code

Hello DRUGs,

I'm new to R and would appreciate some expert advice on prepping filesfor,

and running, PCA...

My data set consists of aquatic invertebrate and zooplankton count dataand

physicochemical measurements from an ecotoxicology study. Four chemical

treatments were applied to mesocosm tanks, 4 replicates per treatment(16

tanks total), then data were collected weekly over a 3 month period.

I cleaned the data in excel by removing columns with all zero values,and

all rows with NA values.
All zooplankton values were volume normalized, then log normalized. All

other data was log normalized in excel prior to analysis in R. Allvectorssare numeric. I've attached the .txt file to this email rather thatusing

dput(dataframe).

My questions are:

1. Did I do the cleaning step appropriately? I know that there are waystorun PCA's using data that contain NA values (pcaMethods), but wasn'table

to get the code to work...

(I understand that this isn't strictly an R question, but any helpwould be

appreciated.)
2. Does my code look correct for the PCA and visualization (see below)?

Thanks in advance,
Sarah

#read data
mesocleaned <- read.csv("MesoCleanedforPCA.9.16.16.csv")

#run PCA
meso.pca <- prcomp(mesocleaned,
                   center = TRUE,
                   scale. = TRUE)

# print method
print(meso.pca)

#compute standard deviation of each principal component
std_dev <- meso.pca$sdev

#compute variance
pr_var <- std_dev^2

#check variance of first 10 components
pr_var[1:10]

#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]

#The first principal component explains 12.7% of the variance
#The second explains 8.1%

#visualize
biplot(meso.pca)

#for visualization, make Treatment vector a factor instead of numeric
meso.treatment <- as.factor(mesocleaned[, 3])

#ggbiplot to visualize by Treatment group

#reference:https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/


library(devtools)
install_github("ggbiplot", "vqv")
library(ggbiplot)

print(ggbiplot(meso.pca, obs.scale = 1, var.scale = 1, groups =
meso.treatment, ellipse = TRUE, circle = TRUE))
g <- ggbiplot(meso.pca, obs.scale = 1, var.scale = 1,
              groups = meso.treatment, ellipse = TRUE,
              circle = TRUE)

g <- g + scale_color_brewer(name = deparse(substitute(Treatments)),palette

= 'Dark2') #must change meso.treatment to a factor for this to work
g <- g + theme(legend.direction = 'horizontal',
               legend.position = 'top')
print(g)

#Circle plot

#plot each variables coefficients inside a unit circle to get insighton a

possible interpretation for PCs.

#reference:https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/


theta <- seq(0,2*pi,length.out = 100)
circle <- data.frame(x = cos(theta), y = sin(theta))
p <- ggplot(circle,aes(x,y)) + geom_path()

loadings <- data.frame(meso.pca$rotation,
                       .names = row.names(meso.pca$rotation))
p + geom_text(data=loadings,
              mapping=aes(x = PC1, y = PC2, label = .names, colour =
.names)) +
  coord_fixed(ratio=1) +
  labs(x = "PC1", y = "PC2")

On Tue, Sep 20, 2016 at 10:28 PM, Sarah Stinson <sastin...@ucdavis.edu>
wrote:

Hello DRUGs,
I'm new to R and would appreciate some expert advice on prepping files
for, and running, PCA...
My data set consists of aquatic invertebrate and zooplankton countdataand physicochemical measurements from an ecotoxicology study. Fourchemicaltreatments were applied to mesocosm tanks, 4 replicates per treatment(16
tanks total), then data were collected weekly over a 3 month period.
I cleaned the data in excel by removing columns with all zero values,and
all rows with NA values.
All zooplankton values were volume normalized, then log normalized.Allother data was log normalized in excel prior to analysis in R. Allvectorssare numeric. I've attached the .csv file to this email rather thatusing
dput(dataframe). I hope that's acceptable.

My questions are:
1. Did I do the cleaning step appropriately? I know that there areways torun PCA's using data that contain NA values (pcaMethods), but wasn'table
to get the code to work...
(I understand that this isn't strictly an R question, but any helpwould
be appreciated.)
2. Does my code look correct for the PCA and visualization (seebelow)?
Thanks in advance,
Sarah

#read data
mesocleaned <- read.csv("MesoCleanedforPCA.9.16.16.csv")

#run PCA
meso.pca <- prcomp(mesocleaned,
                   center = TRUE,
                   scale. = TRUE)

# print method
print(meso.pca)

#compute standard deviation of each principal component
std_dev <- meso.pca$sdev

#compute variance
pr_var <- std_dev^2

#check variance of first 10 components
pr_var[1:10]

#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]

#The first principal component explains 12.7% of the variance
#The second explains 8.1%

#visualize
biplot(meso.pca)

#for visualization, make Treatment vector a factor instead of numeric
meso.treatment <- as.factor(mesocleaned[, 3])

#ggbiplot to visualize by Treatment group
#reference:https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
library(devtools)
install_github("ggbiplot", "vqv")
library(ggbiplot)

print(ggbiplot(meso.pca, obs.scale = 1, var.scale = 1, groups =
meso.treatment, ellipse = TRUE, circle = TRUE))
g <- ggbiplot(meso.pca, obs.scale = 1, var.scale = 1,
              groups = meso.treatment, ellipse = TRUE,
              circle = TRUE)
g <- g + scale_color_brewer(name = deparse(substitute(Treatments)),
palette = 'Dark2') #must change meso.treatment to a factor for this towork
g <- g + theme(legend.direction = 'horizontal',
               legend.position = 'top')
print(g)

#Circle plot
#plot each variables coefficients inside a unit circle to get insighton a
possible interpretation for PCs.
#reference:https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
theta <- seq(0,2*pi,length.out = 100)
circle <- data.frame(x = cos(theta), y = sin(theta))
p <- ggplot(circle,aes(x,y)) + geom_path()

loadings <- data.frame(meso.pca$rotation,
                       .names = row.names(meso.pca$rotation))
p + geom_text(data=loadings,
              mapping=aes(x = PC1, y = PC2, label = .names, colour =
.names)) +
  coord_fixed(ratio=1) +
  labs(x = "PC1", y = "PC2")

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


--
| Dr.rer.nat. Albin Blaschka
| Etrichstrasse 26, A-5020 Salzburg
| * www.standortsanalyse.net *
| * www.researchgate.net/profile/Albin_Blaschka *
| - It's hard to live in the mountains, hard but not hopeless!

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with PCA data file prep and R code

Reply via email to