[R] Anova Type II and Contrasts

2012-07-06 Thread mails
the study design of the data I have to analyse is simple. There is 1 control 
group (CTRL) and 2 different treatment groups (TREAT_1 and TREAT_2). 
The data also includes 2 covariates COV1 and COV2. I have been asked to check 
if there is a linear or quadratic treatment effect in the data. 

I created a dummy data set to explain my situation: 

df1 <- data.frame( 

Observation = c(rep("CTRL",15), rep("TREAT_1",13), rep("TREAT_2", 12)), 

COV1 = c(rep("A1", 30), rep("A2", 10)), 

COV2 = c(rep("B1", 5), rep("B2", 5), rep("B3", 10), rep("B1", 5), rep("B2", 5), 
rep("B3", 10)), 

Variable = c(3944133, 3632461, 3351754, 3655975, 3487722, 3644783, 3491138, 
3328894, 
3654507, 3465627, 3511446, 3507249, 3373233, 3432867, 
3640888, 

3677593, 3585096, 3441775, 3608574, 3669114, 4000812, 
3503511, 3423968, 
3647391, 3584604, 3548256, 3505411, 3665138, 

   4049955, 3425512, 3834061, 3639699, 3522208, 3711928, 
3576597, 3786781, 
   3591042, 3995802, 3493091, 3674475) 
) 

plot(Variable ~ Observation, data = df1) 

As you can see from the plot there is a linear relationship between the control 
and the treatment groups. To check if this linear effect is statistical 
significant I change the contrasts using the contr.poly() function and fit a 
linear model like this: 

contrasts(df1$Observation) <- contr.poly(levels(df1$Observation)) 
lm1 <- lm(log(Variable) ~ Observation, data = df1) 
summary.lm(lm1) 

>From the summary we can see that the linear effect is statistically 
>significant: 

Observation.L  0.029141   0.0123772.3550.024 *   
Observation.Q  0.002233   0.0124820.1790.859   

However, this first model does not include any of the two covariates. Including 
them results in a non-significant p-value for the linear relationship: 

lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1) 
summary.lm(lm2) 

Observation.L  0.041160.02624   1.5680.126 
Observation.Q  0.010030.01894   0.5300.600 
COV1A2-0.012030.04202  -0.2860.776 
COV2B2-0.020710.02202  -0.9410.354 
COV2B3-0.020830.02066  -1.0080.320   

So far so good. However, I have been told to conduct a Type II Anova rather 
than a Type I. To conduct a Type II Anova I used the Anova() function 
 from the car package. 

Anova(lm2, type="II") 

Anova Table (Type II tests) 

Response: log(Variable) 
Sum Sq Df F value Pr(>F) 
Observation 0.006253  2  1.4651 0.2453 
COV1  0.000175  1  0.0820 0.7763 
COV2  0.002768  2  0.6485 0.5292 
Residuals  0.072555 34 

The problem here with using Type II is that you do not get a p-value for the 
linear and quadratic effect. 
So I do not know if the treatment effect is statistically linear and or 
quadratic. 

I found out that the following code produces the same p-value for Observation 
as the Anova() function. However, the result also does not include 
any p-values for the linear or quadratic effect: 

lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1) 
lm3 <- lm(log(Variable) ~ COV1 + COV2, data = df1) 
anova(lm2, lm3) 


Does anybody know how to conduct a Type II anova and the contrasts function to 
obtain the p-values for the linear and quadratic effects? 

Help would be very much appreciated. 

Best Peter
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Delete rows from data.frame matching a certain criteria

2012-03-01 Thread mails
Hello,


consider the following data.frame:

test <- data.frame(n = c(1,2,3,4,5), v = c(6,5,7,5,3), pattern =
c(1,1,NA,1,NA))

> test
  n v pattern
1  1 6   1
2  2 5   1
3  3 7  NA
4  4 5   1
5  5 3  NA


I tried to use apply and the adply function to set v to NA where pattern = 1
and v to v where pattern = 1


So basically the result should look like this:
> test
  n v pattern
1  1 NA   1
2  2 NA  1
3  3 7  NA
4  4 NA   1
5  5 3  NA

So far, I solved it by creating subsets and using merge but it turns out to
be super slow. Is there a way to do that
with the apply function?

Any help/hint is appreciated

Thanks


--
View this message in context: 
http://r.789695.n4.nabble.com/Delete-rows-from-data-frame-matching-a-certain-criteria-tp4435414p4435414.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Help on reshape function

2012-03-06 Thread mails
Hello,


I am trying to reshape a data.frame in wide format into long format.
Although in the reshape R documentation
they programmer list some examples I am struggling to bring my data.frame
into long and then transform it back into wide format. The data.frame I look
at is:


df <- data.frame(ID1 = c(1,1,1,1,1,1,1,1,1), ID2 = c("A", "A", "A", "B",
"B", "B", "C", "C", "C"),

 ID3 = c("E", "E", "E", "E", "E", "E", "E", 
"E", "E"), 
 
 X1 = c(1,4,3,5,2,4,6,4,2), X2 = 
c(6,8,9,6,7,8,9,6,7),
 
 X3 = c(7,6,7,5,6,5,6,7,5), X4 = 
c(1,2,1,2,3,1,2,1,2))

> df
  ID1 ID2 ID3 X1 X2 X3 X4
1   1   A   E  1  6  7  1
2   1   A   E  4  8  6  2
3   1   A   E  3  9  7  1
4   1   B   E  5  6  5  2
5   1   B   E  2  7  6  3
6   1   B   E  4  8  5  1
7   1   C   E  6  9  6  2
8   1   C   E  4  6  7  1
9   1   C   E  2  7  5  2

I want to use the reshape function to get the following result:

> df
  ID1 ID2 ID3 X
1   1   A   E  1  
2   1   A   E  4  
3   1   A   E  3  
4   1   B   E  5  
5   1   B   E  2  
6   1   B   E  4  
7   1   C   E  6  
8   1   C   E  4  
9   1   C   E  2  

10   1   A   E  6
11   1   A   E  8
12   1   A   E  9
13   1   B   E  6
14   1   B   E  7
15   1   B   E  8
16   1   C   E  9
17   1   C   E  6
18   1   C   E  7

19   1   A   E  7
20   1   A   E  6
21   1   A   E  7
22   1   B   E  5
23   1   B   E  6
24   1   B   E  5
25   1   C   E  6
26   1   C   E  7
27   1   C   E  5

28   1   A   E  1
29   1   A   E  2
30   1   A   E  1
31   1   B   E  2
32   1   B   E  3
33   1   B   E  1
34   1   C   E  2
35   1   C   E  1
36   1   C   E  2


Can anyone help?

Cheers



--
View this message in context: 
http://r.789695.n4.nabble.com/Help-on-reshape-function-tp4449464p4449464.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help on reshape function

2012-03-06 Thread mails
Hey Michael,

thanks for your help. I think the reshape2 library works much better than
the normal reshape function.
However, I still cannot retransform my data.frame into wide format once used
melt function to transform
it into long format. How do I get it back to wide format?
The documentation, unfortunately, is not very detailed.




--
View this message in context: 
http://r.789695.n4.nabble.com/Help-on-reshape-function-tp4449464p4450044.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Reshape data frame with dcast and melt

2012-03-19 Thread mails
Hello,

I implemented two functions reshape_long and reshape_wide (see full working
example below) to reshape data frames.
I created several small examples and the two functions seemed to work
properly. However, using the reshape_wide function
on my real data sets (about 200.000 to 300.000 rows) failed. What happens is
set all values for X, Y and Z were set to 1.
The structure of my real data looks exactly the same as the small example
below. After working on it for 2 days I think the
problem is that the "primary key" (test_name, group_name and id) is only
unique in the wide form. After applying the 
reshape_long function the primary key is not longer unique. I was wondering
if anyone can tell me whether the step 
from d1 -> reshape_wide -> d2 can work at all because of the non uniqueness
of d1.



library(reshape2)

library(taRifx)




reshape_long <- function(data, ids) {

# Bring data into long form

data_long <- melt(data, id.vars = ids, variable.name="Data_Points",
value.name="value")

data_long$value <- as.numeric(data_long$value)

# Remove rows were analyte value is NA

data_long <- data_long[!is.na(data_long$value), ]

# Resort data

formula_sort <- as.formula(paste("~", paste(ids, collapse="+")))

data_long <- sort(data_long, f = formula_sort)

return(data_long)

}

reshape_wide <- function(data, ids) {

# Bring data into wide form

formula_wide <- as.formula(paste(paste(ids, collapse="+"), "~
Data_Points"))

data_wide <- dcast(data, formula_wide)

# Resort data

formula_sort <- as.formula(paste("~", paste(ids, collapse="+")))

data_wide <- sort(data_wide, f = formula_sort)

return(data_wide)

}




d <- data.frame(

test_name = c(rep("Test_A", 6), rep("Test_B", 6)),

group_name = c(rep("Group_C", 3), rep("Group_D", 3), rep("Group_C", 3),
rep("Group_D", 3)),

id = c("I1", "I2", "I3", "I4", "I5", "I6",

   "I1", "I2", "I3", "I7", "I8", "I9"),

X = c(NA,NA,1,2,3,4,5,6,NA,7,8,9),

Y = as.numeric(10:21),

Z = c(NA,22,23,NA,24,NA,25,26,NA,27,28,29)

)

d

d1 <- reshape_long(d, ids=c("test_name", "group_name", "id"))

d1

d2 <- reshape_wide(d1, ids=c("test_name", "group_name", "id"))

d2

identical(d,d2)


--
View this message in context: 
http://r.789695.n4.nabble.com/Reshape-data-frame-with-dcast-and-melt-tp4484332p4484332.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Help needed in interpreting linear models

2012-01-13 Thread mails
Dear members of the R-help list,

I have sent the email below to the R-SIG-ME list to ask for help in
interpreting some R output of fitted linear models.

Unfortunately, I haven't yet received any answers. As I am not sure if my
email was sent successfully to the mailing list I

am asking for help here:



Dear members of the R-SIG-ME list,


I am new to linear models and struggling with interpreting some of the R
output but hope to get some advice from here.

I created the following dummy data set:

scores <- c(2,6,10,12,14,20)

weight <- c(60,70,80,75,80,85)

height <- c(180,180,190,180,180,180)

The scores of a game/match should be dependent on the weight of the player
but not on the height. 

For me the output of the following two linear models make sense:

> (lm1 <- summary(lm(scores ~ weight)))

Call:
lm(formula = scores ~ weight)

Residuals:
   123456 
 1.08333 -1.41667 -3.91667  1.3  0.08333  2.8 

Coefficients:
Estimate Std. Error t value Pr(>|t|)   
(Intercept) -38.083310.0394  -3.793  0.01921 * 
weight0.6500 0.1331   4.885  0.00813 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.661 on 4 degrees of freedom
Multiple R-squared: 0.8564, Adjusted R-squared: 0.8205 
F-statistic: 23.86 on 1 and 4 DF,  p-value: 0.008134 

> 
> (lm2 <- summary(lm(scores ~ height)))

Call:
lm(formula = scores ~ height)

Residuals:
 1  2  3  4  5  6 
-8.800e+00 -4.800e+00  1.377e-14  1.200e+00  3.200e+00  9.200e+00 

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  25.2000   139.6175   0.1800.866
height   -0.0800 0.7684  -0.1040.922

Residual standard error: 7.014 on 4 degrees of freedom
Multiple R-squared: 0.002703,   Adjusted R-squared: -0.2466 
F-statistic: 0.01084 on 1 and 4 DF,  p-value: 0.9221 

The p-value of the first output is 0.008134 which makes sense as scores and
weight have a high correlation

and therefore, the scores "can be explained" by the explanatory
variable/factor weight very well. Hence, the R-squared

value is close to 1. For the second example it also makes sense that the
p-value is almost 1 (p=0.9221) as there is

hardly any correlation between scores and height.

What is not clear to me is shown in my 3rd linear model which includes both
weight and height.

> (lm3 <- summary(lm(scores ~ weight + height)))

Call:
lm(formula = scores ~ weight + height)

Residuals:
 1  2  3  4  5  6 
 1.189e+00 -1.946e+00 -2.165e-15  4.865e-01 -1.081e+00  1.351e+00 

Coefficients:
Estimate Std. Error t value Pr(>|t|)   
(Intercept) 49.45946   33.50261   1.476  0.23635   
weight   0.713510.08716   8.186  0.00381 **
height  -0.508110.19096  -2.661  0.07628 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 1.677 on 3 degrees of freedom
Multiple R-squared: 0.9573, Adjusted R-squared: 0.9288 
F-statistic:  33.6 on 2 and 3 DF,  p-value: 0.008833 

It makes sense that the R-squared value is higher when one adds both
explanatory variables/factors to the linear model as 

the more variables are added the more variance is explained and therefore
the fit of the model will be better. However, I do NOT

understand why the p-value of height (Pr(> | t |)  = 0.07628) is now almost
significant? And also, I do NOT understand why the overall

p-value of 0.008833 is less significant as compared to the one from model
lm1 which was p-value: 0.008134.

The p-value of weight being low (p=0.00381) makes sense as this factor
"explains" the scores very well.



After fitting the 3 models (lm1, lm2 and lm3) I wanted to compare model lm1
with lm3 using the anova function to check whether the factor height

significantly improves the model. In other words I wanted to check if adding
height to the model helps explaining the scores of the players.

The output of the anova looks as follows:

> lm1 <- lm(scores ~ weight)
> 
> lm2 <- lm(scores ~ weight + height)
> 
> anova(lm1,lm2)
Analysis of Variance Table

Model 1: scores ~ weight
Model 2: scores ~ weight + height
  Res.Df RSS Df Sum of Sq  F  Pr(>F)  
1  4 28.  
2  3  8.4324  119.901 7.0801 0.07628 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

In my opinion the p-value should be almost 1 and not close to significance
(0.07) as we have seen from model lm2

height does not at all "explain" the scores. Here, I thought that a
significant p-value means that the factor height adds

significant value to the model.


I would be very grateful if anyone could help me in interpreting the R
output.

Best regards

 






--
View this message in context: 
http://r.789695.n4.nabble.com/Help-needed-in-interpreting-linear-models-tp4291670p4291670.html
Sent from the R help mailing list arch

Re: [R] Help needed in interpreting linear models

2012-01-13 Thread mails
Hi Petr,

thanks for your answer.

First of all it's not homework I am a student and need to analyse cancer
data using linear models.
I looked into that topic since a week now and still struggling in
interpreting some of the R output that is why
I was asking for help here.

I don't quite understand your answer because the 180/190 values belong to
height and not to weight. What do you want to 
show with plot(scores,weight). What I can see from the plot is that there is
a correlation between the two variables and 
therefore weight "explains" scores.

Regards

--
View this message in context: 
http://r.789695.n4.nabble.com/Help-needed-in-interpreting-linear-models-tp4291670p4291894.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Relevel dynamically

2012-01-16 Thread mails
Dear list,


I am running R code which produces a data.frame with variable rows. For
plotting purposes I need to relevel 
the column called "names" but since the dimension of the data.frame can vary
I do not know how to dynamically
revel the data.frame.

Here is an example data.frame called df to illustrate what I try to achieve:

> df <- data.frame(cbind(c("a","b","c","d","e"), c(5,6,4,8,9)))
> colnames(df) <- c("name", "value")
> df
  name value
1a 5
2b 6
3c 4
4d 8
5e 9

Now the releveling:
df$name <- factor(df$name, levels=c("b","d","c","e","a"))

That was easy. But the plotting goes wrong when the data.frame has less
rows:

> df
  name value
1a 5
4d 8
5e 9

Now my script runs the same line again:
df$name <- factor(df$name, levels=c("b","d","c","e","a"))

R does not throw any errors, however, lattice does not plot the results
correctly. So I dynamically create
a line which does the following: df$name <- factor(df$name,
levels=c("d","e","a"))


Thanks for any advice in advance!

Cheers




--
View this message in context: 
http://r.789695.n4.nabble.com/Relevel-dynamically-tp4299531p4299531.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Relevel dynamically

2012-01-16 Thread mails
Hi Michael,

yes you were right and your solution works great. thanks a lot. I managed to
do it but 

with 4 lines of code rather than only 1. 


Cheers

--
View this message in context: 
http://r.789695.n4.nabble.com/Relevel-dynamically-tp4299531p4300080.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] unlink: parameter "force" not available in R windows version

2012-02-04 Thread mails
Hello,

I use the R command

unlink(file, recursive = TRUE, force = TRUE)

to delete folders, subfolders and files on Mac OS X.


When I was running my script on a Windows computer I realised that for
unlink there is no option/parameter "force". I do not know why but without
that "force" option R cannot delete the subfolders/files. 

What can I do?

--
View this message in context: 
http://r.789695.n4.nabble.com/unlink-parameter-force-not-available-in-R-windows-version-tp4357464p4357464.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unlink: parameter "force" not available in R windows version

2012-02-04 Thread mails
Hello,

I am not exactly sure but I think the version which is installed on that
windows computer is 2.13.x
So it looks like that the people who developed that function recently
updated it.
I need to check that on Monday.

Thanks for you answer.





--
View this message in context: 
http://r.789695.n4.nabble.com/unlink-parameter-force-not-available-in-R-windows-version-tp4357464p4357646.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unlink: parameter "force" not available in R windows version

2012-02-05 Thread mails
Hi Brian,

that explains it. I will update R to 2.14.x tomorrow.

Thanks for your post.



--
View this message in context: 
http://r.789695.n4.nabble.com/unlink-parameter-force-not-available-in-R-windows-version-tp4357464p4358761.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Problems reading tab-delim files using read.table and read.delim

2012-02-08 Thread mails
Hello,

I used read.xlsx to read in Excel files but for large files it turned out to
be not very efficient.
For that reason I use a programme which writes each sheet in an Excel file
into tab-delim txt files.
After that I tried using read.table and read.delim to read in those txt
files. Unfortunately, the results
are not as expected. To show you what I mean I created a tiny Excel sheet
with some rows and columns and
read it in using read.xlsx. I also used my script to write that sheet to a
tab-delim txt file and read that one it with
read.table and read.delim. Here is the R output:



> (test <- read.table(Sheet1.txt, header=TRUE, sep="\t"))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, 
: 
  line 1 did not have 5 elements

> (test <- read.delim(Sheet1.txt, header=TRUE, sep="\t"))
 c1 c2 c3  X
123 213 NA NA NA
234 asd NA NA NA

> (test <- read.xlsx(file.path(data), "Sheet1"))
   c1   c2  c3   NA. NA..1 NA..2
1 123  213  
2 234  asd  NA  


The last output is what I would expect the file to be read in. Columns 4 to
6 do not have any header rows. in R1C4 I added some white spaces as well as
into R2C5 and R2C6 which a read in correctly by the read.xlsx function.

read.table and read.delim seem not to be able to handle such files. Is there
any workaround for that?


Cheers

--
View this message in context: 
http://r.789695.n4.nabble.com/Problems-reading-tab-delim-files-using-read-table-and-read-delim-tp4369195p4369195.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Outlier removal techniques

2012-02-09 Thread mails
Hello,


I need to analyse a data matrix with dimensions of 30x100.
Before analysing the data there is, however, a need to remove outliers from
the data.
I read quite a lot about outlier removal already and I think the most common
technique for that seems to be Principal Component Analysis (PCA). However,
I think that these technqiue is quite subjective. When is an outlier an
outlier?
I uploaded an example PCA plot here: 

http://s14.postimage.org/oknyya1ld/pca.png

Should we treat the green and red dots as outliers already or only the blue
one which
lies outside the 95% confidence interval. It seems very arbitrary how people
remove outliers using PCA.

I also thought about fitting a linear model through my data and look at
distribution of the residuals. 
However, the problem with using linear models is that one can actually never
be sure that the model
 used is the one which describes the data best. In model A, for instance, we
might treat sample 1 as and 
outlier but fitting a different model B sample 1 might not be an outlier at
all.

I had a brief look at k-means clustering as well but I think it's not the
right thing to go for. 
Again, how do one decide which cluster is an outler? And also it is known
that different 
cluster analysis lead to totally different results. So which one to choose?


Is there any other way to non-subjectively remove outliers from data?
I would really appreciated any ideas/comments you might have on that topic.


Cheers

--
View this message in context: 
http://r.789695.n4.nabble.com/Outlier-removal-techniques-tp4372652p4372652.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.