[R] Anova Type II and Contrasts
the study design of the data I have to analyse is simple. There is 1 control group (CTRL) and 2 different treatment groups (TREAT_1 and TREAT_2). The data also includes 2 covariates COV1 and COV2. I have been asked to check if there is a linear or quadratic treatment effect in the data. I created a dummy data set to explain my situation: df1 <- data.frame( Observation = c(rep("CTRL",15), rep("TREAT_1",13), rep("TREAT_2", 12)), COV1 = c(rep("A1", 30), rep("A2", 10)), COV2 = c(rep("B1", 5), rep("B2", 5), rep("B3", 10), rep("B1", 5), rep("B2", 5), rep("B3", 10)), Variable = c(3944133, 3632461, 3351754, 3655975, 3487722, 3644783, 3491138, 3328894, 3654507, 3465627, 3511446, 3507249, 3373233, 3432867, 3640888, 3677593, 3585096, 3441775, 3608574, 3669114, 4000812, 3503511, 3423968, 3647391, 3584604, 3548256, 3505411, 3665138, 4049955, 3425512, 3834061, 3639699, 3522208, 3711928, 3576597, 3786781, 3591042, 3995802, 3493091, 3674475) ) plot(Variable ~ Observation, data = df1) As you can see from the plot there is a linear relationship between the control and the treatment groups. To check if this linear effect is statistical significant I change the contrasts using the contr.poly() function and fit a linear model like this: contrasts(df1$Observation) <- contr.poly(levels(df1$Observation)) lm1 <- lm(log(Variable) ~ Observation, data = df1) summary.lm(lm1) >From the summary we can see that the linear effect is statistically >significant: Observation.L 0.029141 0.0123772.3550.024 * Observation.Q 0.002233 0.0124820.1790.859 However, this first model does not include any of the two covariates. Including them results in a non-significant p-value for the linear relationship: lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1) summary.lm(lm2) Observation.L 0.041160.02624 1.5680.126 Observation.Q 0.010030.01894 0.5300.600 COV1A2-0.012030.04202 -0.2860.776 COV2B2-0.020710.02202 -0.9410.354 COV2B3-0.020830.02066 -1.0080.320 So far so good. However, I have been told to conduct a Type II Anova rather than a Type I. To conduct a Type II Anova I used the Anova() function from the car package. Anova(lm2, type="II") Anova Table (Type II tests) Response: log(Variable) Sum Sq Df F value Pr(>F) Observation 0.006253 2 1.4651 0.2453 COV1 0.000175 1 0.0820 0.7763 COV2 0.002768 2 0.6485 0.5292 Residuals 0.072555 34 The problem here with using Type II is that you do not get a p-value for the linear and quadratic effect. So I do not know if the treatment effect is statistically linear and or quadratic. I found out that the following code produces the same p-value for Observation as the Anova() function. However, the result also does not include any p-values for the linear or quadratic effect: lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1) lm3 <- lm(log(Variable) ~ COV1 + COV2, data = df1) anova(lm2, lm3) Does anybody know how to conduct a Type II anova and the contrasts function to obtain the p-values for the linear and quadratic effects? Help would be very much appreciated. Best Peter [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Delete rows from data.frame matching a certain criteria
Hello, consider the following data.frame: test <- data.frame(n = c(1,2,3,4,5), v = c(6,5,7,5,3), pattern = c(1,1,NA,1,NA)) > test n v pattern 1 1 6 1 2 2 5 1 3 3 7 NA 4 4 5 1 5 5 3 NA I tried to use apply and the adply function to set v to NA where pattern = 1 and v to v where pattern = 1 So basically the result should look like this: > test n v pattern 1 1 NA 1 2 2 NA 1 3 3 7 NA 4 4 NA 1 5 5 3 NA So far, I solved it by creating subsets and using merge but it turns out to be super slow. Is there a way to do that with the apply function? Any help/hint is appreciated Thanks -- View this message in context: http://r.789695.n4.nabble.com/Delete-rows-from-data-frame-matching-a-certain-criteria-tp4435414p4435414.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help on reshape function
Hello, I am trying to reshape a data.frame in wide format into long format. Although in the reshape R documentation they programmer list some examples I am struggling to bring my data.frame into long and then transform it back into wide format. The data.frame I look at is: df <- data.frame(ID1 = c(1,1,1,1,1,1,1,1,1), ID2 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), ID3 = c("E", "E", "E", "E", "E", "E", "E", "E", "E"), X1 = c(1,4,3,5,2,4,6,4,2), X2 = c(6,8,9,6,7,8,9,6,7), X3 = c(7,6,7,5,6,5,6,7,5), X4 = c(1,2,1,2,3,1,2,1,2)) > df ID1 ID2 ID3 X1 X2 X3 X4 1 1 A E 1 6 7 1 2 1 A E 4 8 6 2 3 1 A E 3 9 7 1 4 1 B E 5 6 5 2 5 1 B E 2 7 6 3 6 1 B E 4 8 5 1 7 1 C E 6 9 6 2 8 1 C E 4 6 7 1 9 1 C E 2 7 5 2 I want to use the reshape function to get the following result: > df ID1 ID2 ID3 X 1 1 A E 1 2 1 A E 4 3 1 A E 3 4 1 B E 5 5 1 B E 2 6 1 B E 4 7 1 C E 6 8 1 C E 4 9 1 C E 2 10 1 A E 6 11 1 A E 8 12 1 A E 9 13 1 B E 6 14 1 B E 7 15 1 B E 8 16 1 C E 9 17 1 C E 6 18 1 C E 7 19 1 A E 7 20 1 A E 6 21 1 A E 7 22 1 B E 5 23 1 B E 6 24 1 B E 5 25 1 C E 6 26 1 C E 7 27 1 C E 5 28 1 A E 1 29 1 A E 2 30 1 A E 1 31 1 B E 2 32 1 B E 3 33 1 B E 1 34 1 C E 2 35 1 C E 1 36 1 C E 2 Can anyone help? Cheers -- View this message in context: http://r.789695.n4.nabble.com/Help-on-reshape-function-tp4449464p4449464.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help on reshape function
Hey Michael, thanks for your help. I think the reshape2 library works much better than the normal reshape function. However, I still cannot retransform my data.frame into wide format once used melt function to transform it into long format. How do I get it back to wide format? The documentation, unfortunately, is not very detailed. -- View this message in context: http://r.789695.n4.nabble.com/Help-on-reshape-function-tp4449464p4450044.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Reshape data frame with dcast and melt
Hello, I implemented two functions reshape_long and reshape_wide (see full working example below) to reshape data frames. I created several small examples and the two functions seemed to work properly. However, using the reshape_wide function on my real data sets (about 200.000 to 300.000 rows) failed. What happens is set all values for X, Y and Z were set to 1. The structure of my real data looks exactly the same as the small example below. After working on it for 2 days I think the problem is that the "primary key" (test_name, group_name and id) is only unique in the wide form. After applying the reshape_long function the primary key is not longer unique. I was wondering if anyone can tell me whether the step from d1 -> reshape_wide -> d2 can work at all because of the non uniqueness of d1. library(reshape2) library(taRifx) reshape_long <- function(data, ids) { # Bring data into long form data_long <- melt(data, id.vars = ids, variable.name="Data_Points", value.name="value") data_long$value <- as.numeric(data_long$value) # Remove rows were analyte value is NA data_long <- data_long[!is.na(data_long$value), ] # Resort data formula_sort <- as.formula(paste("~", paste(ids, collapse="+"))) data_long <- sort(data_long, f = formula_sort) return(data_long) } reshape_wide <- function(data, ids) { # Bring data into wide form formula_wide <- as.formula(paste(paste(ids, collapse="+"), "~ Data_Points")) data_wide <- dcast(data, formula_wide) # Resort data formula_sort <- as.formula(paste("~", paste(ids, collapse="+"))) data_wide <- sort(data_wide, f = formula_sort) return(data_wide) } d <- data.frame( test_name = c(rep("Test_A", 6), rep("Test_B", 6)), group_name = c(rep("Group_C", 3), rep("Group_D", 3), rep("Group_C", 3), rep("Group_D", 3)), id = c("I1", "I2", "I3", "I4", "I5", "I6", "I1", "I2", "I3", "I7", "I8", "I9"), X = c(NA,NA,1,2,3,4,5,6,NA,7,8,9), Y = as.numeric(10:21), Z = c(NA,22,23,NA,24,NA,25,26,NA,27,28,29) ) d d1 <- reshape_long(d, ids=c("test_name", "group_name", "id")) d1 d2 <- reshape_wide(d1, ids=c("test_name", "group_name", "id")) d2 identical(d,d2) -- View this message in context: http://r.789695.n4.nabble.com/Reshape-data-frame-with-dcast-and-melt-tp4484332p4484332.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help needed in interpreting linear models
Dear members of the R-help list, I have sent the email below to the R-SIG-ME list to ask for help in interpreting some R output of fitted linear models. Unfortunately, I haven't yet received any answers. As I am not sure if my email was sent successfully to the mailing list I am asking for help here: Dear members of the R-SIG-ME list, I am new to linear models and struggling with interpreting some of the R output but hope to get some advice from here. I created the following dummy data set: scores <- c(2,6,10,12,14,20) weight <- c(60,70,80,75,80,85) height <- c(180,180,190,180,180,180) The scores of a game/match should be dependent on the weight of the player but not on the height. For me the output of the following two linear models make sense: > (lm1 <- summary(lm(scores ~ weight))) Call: lm(formula = scores ~ weight) Residuals: 123456 1.08333 -1.41667 -3.91667 1.3 0.08333 2.8 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -38.083310.0394 -3.793 0.01921 * weight0.6500 0.1331 4.885 0.00813 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.661 on 4 degrees of freedom Multiple R-squared: 0.8564, Adjusted R-squared: 0.8205 F-statistic: 23.86 on 1 and 4 DF, p-value: 0.008134 > > (lm2 <- summary(lm(scores ~ height))) Call: lm(formula = scores ~ height) Residuals: 1 2 3 4 5 6 -8.800e+00 -4.800e+00 1.377e-14 1.200e+00 3.200e+00 9.200e+00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.2000 139.6175 0.1800.866 height -0.0800 0.7684 -0.1040.922 Residual standard error: 7.014 on 4 degrees of freedom Multiple R-squared: 0.002703, Adjusted R-squared: -0.2466 F-statistic: 0.01084 on 1 and 4 DF, p-value: 0.9221 The p-value of the first output is 0.008134 which makes sense as scores and weight have a high correlation and therefore, the scores "can be explained" by the explanatory variable/factor weight very well. Hence, the R-squared value is close to 1. For the second example it also makes sense that the p-value is almost 1 (p=0.9221) as there is hardly any correlation between scores and height. What is not clear to me is shown in my 3rd linear model which includes both weight and height. > (lm3 <- summary(lm(scores ~ weight + height))) Call: lm(formula = scores ~ weight + height) Residuals: 1 2 3 4 5 6 1.189e+00 -1.946e+00 -2.165e-15 4.865e-01 -1.081e+00 1.351e+00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 49.45946 33.50261 1.476 0.23635 weight 0.713510.08716 8.186 0.00381 ** height -0.508110.19096 -2.661 0.07628 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.677 on 3 degrees of freedom Multiple R-squared: 0.9573, Adjusted R-squared: 0.9288 F-statistic: 33.6 on 2 and 3 DF, p-value: 0.008833 It makes sense that the R-squared value is higher when one adds both explanatory variables/factors to the linear model as the more variables are added the more variance is explained and therefore the fit of the model will be better. However, I do NOT understand why the p-value of height (Pr(> | t |) = 0.07628) is now almost significant? And also, I do NOT understand why the overall p-value of 0.008833 is less significant as compared to the one from model lm1 which was p-value: 0.008134. The p-value of weight being low (p=0.00381) makes sense as this factor "explains" the scores very well. After fitting the 3 models (lm1, lm2 and lm3) I wanted to compare model lm1 with lm3 using the anova function to check whether the factor height significantly improves the model. In other words I wanted to check if adding height to the model helps explaining the scores of the players. The output of the anova looks as follows: > lm1 <- lm(scores ~ weight) > > lm2 <- lm(scores ~ weight + height) > > anova(lm1,lm2) Analysis of Variance Table Model 1: scores ~ weight Model 2: scores ~ weight + height Res.Df RSS Df Sum of Sq F Pr(>F) 1 4 28. 2 3 8.4324 119.901 7.0801 0.07628 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 In my opinion the p-value should be almost 1 and not close to significance (0.07) as we have seen from model lm2 height does not at all "explain" the scores. Here, I thought that a significant p-value means that the factor height adds significant value to the model. I would be very grateful if anyone could help me in interpreting the R output. Best regards -- View this message in context: http://r.789695.n4.nabble.com/Help-needed-in-interpreting-linear-models-tp4291670p4291670.html Sent from the R help mailing list arch
Re: [R] Help needed in interpreting linear models
Hi Petr, thanks for your answer. First of all it's not homework I am a student and need to analyse cancer data using linear models. I looked into that topic since a week now and still struggling in interpreting some of the R output that is why I was asking for help here. I don't quite understand your answer because the 180/190 values belong to height and not to weight. What do you want to show with plot(scores,weight). What I can see from the plot is that there is a correlation between the two variables and therefore weight "explains" scores. Regards -- View this message in context: http://r.789695.n4.nabble.com/Help-needed-in-interpreting-linear-models-tp4291670p4291894.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Relevel dynamically
Dear list, I am running R code which produces a data.frame with variable rows. For plotting purposes I need to relevel the column called "names" but since the dimension of the data.frame can vary I do not know how to dynamically revel the data.frame. Here is an example data.frame called df to illustrate what I try to achieve: > df <- data.frame(cbind(c("a","b","c","d","e"), c(5,6,4,8,9))) > colnames(df) <- c("name", "value") > df name value 1a 5 2b 6 3c 4 4d 8 5e 9 Now the releveling: df$name <- factor(df$name, levels=c("b","d","c","e","a")) That was easy. But the plotting goes wrong when the data.frame has less rows: > df name value 1a 5 4d 8 5e 9 Now my script runs the same line again: df$name <- factor(df$name, levels=c("b","d","c","e","a")) R does not throw any errors, however, lattice does not plot the results correctly. So I dynamically create a line which does the following: df$name <- factor(df$name, levels=c("d","e","a")) Thanks for any advice in advance! Cheers -- View this message in context: http://r.789695.n4.nabble.com/Relevel-dynamically-tp4299531p4299531.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Relevel dynamically
Hi Michael, yes you were right and your solution works great. thanks a lot. I managed to do it but with 4 lines of code rather than only 1. Cheers -- View this message in context: http://r.789695.n4.nabble.com/Relevel-dynamically-tp4299531p4300080.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] unlink: parameter "force" not available in R windows version
Hello, I use the R command unlink(file, recursive = TRUE, force = TRUE) to delete folders, subfolders and files on Mac OS X. When I was running my script on a Windows computer I realised that for unlink there is no option/parameter "force". I do not know why but without that "force" option R cannot delete the subfolders/files. What can I do? -- View this message in context: http://r.789695.n4.nabble.com/unlink-parameter-force-not-available-in-R-windows-version-tp4357464p4357464.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unlink: parameter "force" not available in R windows version
Hello, I am not exactly sure but I think the version which is installed on that windows computer is 2.13.x So it looks like that the people who developed that function recently updated it. I need to check that on Monday. Thanks for you answer. -- View this message in context: http://r.789695.n4.nabble.com/unlink-parameter-force-not-available-in-R-windows-version-tp4357464p4357646.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unlink: parameter "force" not available in R windows version
Hi Brian, that explains it. I will update R to 2.14.x tomorrow. Thanks for your post. -- View this message in context: http://r.789695.n4.nabble.com/unlink-parameter-force-not-available-in-R-windows-version-tp4357464p4358761.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Problems reading tab-delim files using read.table and read.delim
Hello, I used read.xlsx to read in Excel files but for large files it turned out to be not very efficient. For that reason I use a programme which writes each sheet in an Excel file into tab-delim txt files. After that I tried using read.table and read.delim to read in those txt files. Unfortunately, the results are not as expected. To show you what I mean I created a tiny Excel sheet with some rows and columns and read it in using read.xlsx. I also used my script to write that sheet to a tab-delim txt file and read that one it with read.table and read.delim. Here is the R output: > (test <- read.table(Sheet1.txt, header=TRUE, sep="\t")) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 5 elements > (test <- read.delim(Sheet1.txt, header=TRUE, sep="\t")) c1 c2 c3 X 123 213 NA NA NA 234 asd NA NA NA > (test <- read.xlsx(file.path(data), "Sheet1")) c1 c2 c3 NA. NA..1 NA..2 1 123 213 2 234 asd NA The last output is what I would expect the file to be read in. Columns 4 to 6 do not have any header rows. in R1C4 I added some white spaces as well as into R2C5 and R2C6 which a read in correctly by the read.xlsx function. read.table and read.delim seem not to be able to handle such files. Is there any workaround for that? Cheers -- View this message in context: http://r.789695.n4.nabble.com/Problems-reading-tab-delim-files-using-read-table-and-read-delim-tp4369195p4369195.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Outlier removal techniques
Hello, I need to analyse a data matrix with dimensions of 30x100. Before analysing the data there is, however, a need to remove outliers from the data. I read quite a lot about outlier removal already and I think the most common technique for that seems to be Principal Component Analysis (PCA). However, I think that these technqiue is quite subjective. When is an outlier an outlier? I uploaded an example PCA plot here: http://s14.postimage.org/oknyya1ld/pca.png Should we treat the green and red dots as outliers already or only the blue one which lies outside the 95% confidence interval. It seems very arbitrary how people remove outliers using PCA. I also thought about fitting a linear model through my data and look at distribution of the residuals. However, the problem with using linear models is that one can actually never be sure that the model used is the one which describes the data best. In model A, for instance, we might treat sample 1 as and outlier but fitting a different model B sample 1 might not be an outlier at all. I had a brief look at k-means clustering as well but I think it's not the right thing to go for. Again, how do one decide which cluster is an outler? And also it is known that different cluster analysis lead to totally different results. So which one to choose? Is there any other way to non-subjectively remove outliers from data? I would really appreciated any ideas/comments you might have on that topic. Cheers -- View this message in context: http://r.789695.n4.nabble.com/Outlier-removal-techniques-tp4372652p4372652.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.