[R] identify duplicate entries in data frame and calculate mean
I have a data frame with 10 columns. In the last column is an alphaneumaric identifier. For most rows, this alphaneumaric identifier is unique to the file, however some of these alphanemeric idenitifiers occur in duplicate, triplicate or more. When they do occur more than once they are in consecutive rows, so when there is a duplicate or triplicate or quadruplicate (let's call them multiplicates), they are in consecutive rows. In column 7 there is an integer number (may or may not be unique. does not matter). I want to identify each multiple entries (multiplicates) occurring in column 10 and then for each multiplicate calculate the mean of the integers column 7. As an example, I will show just two columns: Length Identifier 321 A234 350 A234 340 A234 180 B123 198 B225 What I want to do (in the above example) is collapse all the A234's and report the mean to get this: Length Identifier 337 A234 180 B123 198 B225 Matthew __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] identify duplicate entries in data frame and calculate mean
Thank you very much, Tom. This gets me thinking in the right direction. One thing I should have mentioned that I did not is that the number of rows in the data frame will be a little over 40,000 rows. On 5/24/2016 4:08 PM, Tom Wright wrote: > Using dplyr > > $ library(dplyr) > $ x<-data.frame(Length=c(321,350,340,180,198), > ID=c(rep('A234',3),'B123','B225') ) > $ x %>% group_by(ID) %>% summarise(m=mean(Length)) > > > > On Tue, May 24, 2016 at 3:46 PM, Matthew > <mailto:mccorm...@molbio.mgh.harvard.edu>> wrote: > > I have a data frame with 10 columns. > In the last column is an alphaneumaric identifier. > For most rows, this alphaneumaric identifier is unique to the > file, however some of these alphanemeric idenitifiers occur in > duplicate, triplicate or more. When they do occur more than once > they are in consecutive rows, so when there is a duplicate or > triplicate or quadruplicate (let's call them multiplicates), they > are in consecutive rows. > > In column 7 there is an integer number (may or may not be unique. > does not matter). > > I want to identify each multiple entries (multiplicates) occurring > in column 10 and then for each multiplicate calculate the mean of > the integers column 7. > > As an example, I will show just two columns: > Length Identifier > 321 A234 > 350 A234 > 340 A234 > 180 B123 > 198 B225 > > What I want to do (in the above example) is collapse all the > A234's and report the mean to get this: > Length Identifier > 337 A234 > 180 B123 > 198 B225 > > > Matthew > > __ > R-help@r-project.org <mailto:R-help@r-project.org> mailing list -- > To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] identify duplicate entries in data frame and calculate mean
Thanks, Tom. I was making a mistake looking at your example and that's what my problem was. Cool answer, works great. Thank you very much. Matthew On 5/24/2016 4:23 PM, Tom Wright wrote: > Don't see that as being a big problem. If your data grows then dplyr > supports connections to external databases. Alternately if you just > want a mean, most databases can do that directly in SQL. > > On Tue, May 24, 2016 at 4:17 PM, Matthew > <mailto:mccorm...@molbio.mgh.harvard.edu>> wrote: > > Thank you very much, Tom. > This gets me thinking in the right direction. > One thing I should have mentioned that I did not is that the > number of rows in the data frame will be a little over 40,000 rows. > > > On 5/24/2016 4:08 PM, Tom Wright wrote: >> Using dplyr >> >> $ library(dplyr) >> $ x<-data.frame(Length=c(321,350,340,180,198), >> ID=c(rep('A234',3),'B123','B225') ) >> $ x %>% group_by(ID) %>% summarise(m=mean(Length)) >> >> >> >> On Tue, May 24, 2016 at 3:46 PM, Matthew >> > <mailto:mccorm...@molbio.mgh.harvard.edu>> wrote: >> >> I have a data frame with 10 columns. >> In the last column is an alphaneumaric identifier. >> For most rows, this alphaneumaric identifier is unique to the >> file, however some of these alphanemeric idenitifiers occur >> in duplicate, triplicate or more. When they do occur more >> than once they are in consecutive rows, so when there is a >> duplicate or triplicate or quadruplicate (let's call them >> multiplicates), they are in consecutive rows. >> >> In column 7 there is an integer number (may or may not be >> unique. does not matter). >> >> I want to identify each multiple entries (multiplicates) >> occurring in column 10 and then for each multiplicate >> calculate the mean of the integers column 7. >> >> As an example, I will show just two columns: >> Length Identifier >> 321 A234 >> 350 A234 >> 340 A234 >> 180 B123 >> 198 B225 >> >> What I want to do (in the above example) is collapse all the >> A234's and report the mean to get this: >> Length Identifier >> 337 A234 >> 180 B123 >> 198 B225 >> >> >> Matthew >> >> __ >> R-help@r-project.org <mailto:R-help@r-project.org> mailing >> list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible >> code. >> >> > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] identify duplicate entries in data frame and calculate mean
Thank you very much, Dan. These work great. Two more great answers to my question. Matthew On 5/24/2016 4:15 PM, Nordlund, Dan (DSHS/RDA) wrote: You have several options. 1. You could use the aggregate function. If your data frame is called DF, you could do something like with(DF, aggregate(Length, list(Identifier), mean)) 2. You could use the dplyr package like this library(dplyr) summarize(group_by(DF, Identifier), mean(Length)) Hope this is helpful, Dan Daniel Nordlund, PhD Research and Data Analysis Division Services & Enterprise Support Administration Washington State Department of Social and Health Services -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Matthew Sent: Tuesday, May 24, 2016 12:47 PM To: r-help@r-project.org Subject: [R] identify duplicate entries in data frame and calculate mean I have a data frame with 10 columns. In the last column is an alphaneumaric identifier. For most rows, this alphaneumaric identifier is unique to the file, however some of these alphanemeric idenitifiers occur in duplicate, triplicate or more. When they do occur more than once they are in consecutive rows, so when there is a duplicate or triplicate or quadruplicate (let's call them multiplicates), they are in consecutive rows. In column 7 there is an integer number (may or may not be unique. does not matter). I want to identify each multiple entries (multiplicates) occurring in column 10 and then for each multiplicate calculate the mean of the integers column 7. As an example, I will show just two columns: Length Identifier 321 A234 350 A234 340 A234 180 B123 198 B225 What I want to do (in the above example) is collapse all the A234's and report the mean to get this: Length Identifier 337 A234 180 B123 198 B225 Matthew __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] use value in variable to be name of another variable
I want to get a value that has been assigned to a variable, and then use that value to be the name of a variable. For example, tTargTFS[1,1] # returns: V1 "AT1G01010" Now, I want to make AT1G01010 the name of a variable: AT1G01010 <- tTargTFS[-1,1] Then, go to the next tTargTFS[1,2]. Which produces V1 "AT1G01030" And then, AT1G01030 <- tTargTFS[-1,2] I want to do this up to tTargTFS[1, 2666], so I want to do this in a script and not manually. tTargTFS is a list of 2: chr [1:265, 1:2666], but I also have the data in a data frame of 265 observations of 2666 variables, if this data structure makes things easier. My initial attempts are not working. Starting with a test data structure that is a little simpler I have tried: for (i in 1:4) { ATG <- tTargTFS[1, i] assign(cat(ATG), tTargTFS[-1, i]) } Matthew __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] use value in variable to be name of another variable
Hi Jim, Wow ! And it does exactly what I was looking for. Thank you very much. That assign function is pretty nice. I should become more familiar with it. Matthew On 7/11/2016 5:59 PM, Jim Lemon wrote: Hi Matthew, This question is a bit mysterious as we don't know what the object "chr" is. However, have a look at this and see if it is close to what you want to do. # set up a little matrix of character values tTargTFS<-matrix(paste("A",rep(1:4,each=4),"B",rep(1:4,4),sep=""),ncol=4) # try the assignment on the first row and column assign(tTargTFS[1,1],tTargTFS[-1,1]) # see what it looks like - okay A1B1 # run the assignment over the matrix for(i in 1:4) assign(tTargTFS[1,i],tTargTFS[-1,i]) # see what the variables look like A1B1 A2B1 A3B1 A4B1 It does what I would expect. Jim On Tue, Jul 12, 2016 at 6:01 AM, Matthew wrote: I want to get a value that has been assigned to a variable, and then use that value to be the name of a variable. For example, tTargTFS[1,1] # returns: V1 "AT1G01010" Now, I want to make AT1G01010 the name of a variable: AT1G01010 <- tTargTFS[-1,1] Then, go to the next tTargTFS[1,2]. Which produces V1 "AT1G01030" And then, AT1G01030 <- tTargTFS[-1,2] I want to do this up to tTargTFS[1, 2666], so I want to do this in a script and not manually. tTargTFS is a list of 2: chr [1:265, 1:2666], but I also have the data in a data frame of 265 observations of 2666 variables, if this data structure makes things easier. My initial attempts are not working. Starting with a test data structure that is a little simpler I have tried: for (i in 1:4) { ATG <- tTargTFS[1, i] assign(cat(ATG), tTargTFS[-1, i]) } Matthew __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] use value in variable to be name of another variable
Hi Rolf, Thanks for the warning. I think because my initial efforts used the assign function, that Jim provided his solution using it. Any suggestions for how it could be done without assign() ? Matthew On 7/11/2016 6:31 PM, Rolf Turner wrote: On 12/07/16 10:13, Matthew wrote: Hi Jim, Wow ! And it does exactly what I was looking for. Thank you very much. That assign function is pretty nice. I should become more familiar with it. Indeed you should, and assign() is indeed nice and useful and handy. But it should be used with care and circumspection. It *alters the global environment* which is fraught with peril. Generally speaking most things that can be done with assign() (and its companion function get()) are better and more safely done using lists and functions and other "natural" R-ish constructs. Resist the temptation to turn R into a macro language. cheers, Rolf Turner __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tcltk2 entry box
Is anyone familiar enough with the tcltk2 package to know if it is possible to have an entry box where a user can enter information (such as a path to a file or a number) and then be able to use the entered information downstream in a R script ? The idea is for someone unfamiliar with R to just start an R script that would take care of all the commands for them so all they have to do is get the script started. However, there is always a couple of pieces of information that will change each time the script is used (for example, a different file will be processed by the script). So, I would like a way for the user to input that information as the script ran. Matthew McCormack __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tcltk2 entry box
Thank you very much, Greg, for the tkwait commands. I am just starting to try out examples on the sciviews web page to get a feel for tcltk in R and the tkwait.variable and tkwait.window seem like they could be very useful to me. I will add these in to my practice scripts and see what I can do with them. Matthew On 7/9/2015 5:31 PM, Greg Snow wrote: If you want you script to wait until you have a value entered then you can use the tkwait.variable or tkwait.window commands to make the script wait before continuing (or you can bind the code to a button so that you enter the value, then click on the button to run the code). On Wed, Jul 8, 2015 at 7:58 PM, Matthew McCormack wrote: Wow ! Very nice. Thank you very much, John. This is very helpful and just what I need. Yes, I can see that I should have paid attention to tcltk before going to tcltk2. Matthew On 7/8/2015 8:37 PM, John Fox wrote: Dear Matthew, For file selection, see ?tcltk::tk_choose.files or ?tcltk::tkgetOpenFile . You could enter a number in a tk entry widget, but, depending upon the nature of the number, a slider or other widget might be a better choice. For a variety of helpful tcltk examples see <http://www.sciviews.org/_rgui/tcltk/>, originally by James Wettenhall but now maintained by Philippe Grosjean (the author of the tcltk2 package). (You probably don't need tcltk2 for the simple operations that you mention, but see ?tk2spinbox for an alternative to a slider.) Best, John --- John Fox, Professor McMaster University Hamilton, Ontario, Canada http://socserv.socsci.mcmaster.ca/jfox/ -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Matthew Sent: July-08-15 8:01 PM To: r-help Subject: [R] tcltk2 entry box Is anyone familiar enough with the tcltk2 package to know if it is possible to have an entry box where a user can enter information (such as a path to a file or a number) and then be able to use the entered information downstream in a R script ? The idea is for someone unfamiliar with R to just start an R script that would take care of all the commands for them so all they have to do is get the script started. However, there is always a couple of pieces of information that will change each time the script is used (for example, a different file will be processed by the script). So, I would like a way for the user to input that information as the script ran. Matthew McCormack __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] need help with excel data
Try asap utilities (Home and Student edition), http://www.asap-utilities.com/index.php. When installed it will look like this in Excel, Select Columns & Rows and then #18. If that is not helpful, then DigDB, http://www.digdb.com/, but this one requires a subscription. It will also split columns. You may have to do some 'cleaning' of individual cells, such as removing leading and/or trainling spaces. A lot of this can be one with the ASAP Utilities 'Text' pull down menu. Matthew On 1/21/2015 3:31 PM, Dr Polanski wrote: Hi all! Sorry to bother you, I am trying to learn some R via coursera courses and other internet sources yet haven’t managed to go far And now I need to do some, I hope, not too difficult things, which I think R can do, yet have no idea how to make it do so I have a big set of data (empirical) which was obtained by my colleagues and store at not convenient way - all of the data in two cells of an excel table an example of the data is in the attached file (the link) https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing so the first column has a number and the second has a whole vector (I guess it is) which looks like «some words in Cyrillic(the length varies)» and then the set of numbers «12*23 34*45» (another problem that some times it is «12*23, 34*56» And the number of raws is about 3000 so it is impossible to do manually what I need to have at the end is to have it separately in different excel cells - what is written in words - | 12 | 23 | 34 | 45 | Do you think it is possible to do so using R (or something else?) Thank you very much in advance and sorry for asking for help and so stupid question, the problem is - I am trying and yet haven’t even managed to install openSUSE onto my laptop - only Ubuntu! :) Thank you very much! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] creating a dataframe with full_join and looping over a list of lists
I have been trying create a dataframe by looping through a list of lists, and using dplyr's full_join so as to keep common elements on the same row. But, I have a couple of problems. 1) The lists have different numbers of elements. 2) In the final dataframe, I would like the column names to be the names of the lists. Is it possible ? for(j in avector){ mydf3 <- data.frame(myenter) # Start out with a list, myenter, to dataframe. mydf3 now has 1 column. # This first column will be the longest column in the final mydf3. atglsts <- as.data.frame(comatgs[j]) # Loop through a list of lists, comatgs, and with each loop a particular list # is made into a dataframe of one column, atglsts. # The name of the column is the name of the list. # Each atglsts dataframe has a different number of elements. mydf3 <- full_join(mydf3, atglsts) # What I want to do, is to add the newly made dataframe, atglsts, as a } # new column of the data frame, mydf3 using full_join # in order to keep common elements on the same row. # I could rename the colname to 'AGI' so that I can join by 'AGI', # but then I would lose the name of the list. # In the final dataframe, I want to know the name of the original list # the column was made from. Matthew [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] creating a dataframe with full_join and looping over a list of lists.
My apologies, my first e-mail formatted very poorly when sent, so I am trying again with something I hope will be less confusing. I have been trying create a dataframe by looping through a list of lists, and using dplyr's full_join so as to keep common elements on the same row. But, I have a couple of problems. 1) The lists have different numbers of elements. 2) In the final dataframe, I would like the column names to be the names of the lists. Is it possible ? Code: *for(j in avector){mydf3 <- data.frame(myenter) atglsts <- as.data.frame(comatgs[j]) mydf3 <- full_join(mydf3, atglsts) }* Explanation: # Start out with a list, myenter, to dataframe. mydf3 now has 1 column. # This first column will be the longest column in the final mydf3. # Loop through a list of lists, comatgs, and with each loop a particular list # is made into a dataframe of one column, atglsts. # The name of the column is the name of the list. # Each atglsts dataframe has a different number of elements. # What I want to do, is to add the newly made dataframe, atglsts, as a # new column of the data frame, mydf3 using full_join # in order to keep common elements on the same row. # I could rename the colname to 'AGI' so that I can join by 'AGI', # but then I would lose the name of the list. # In the final dataframe, I want to know the name of the original list # the column was made from. Matthew [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] creating a dataframe with full_join and looping over a list of lists.
This is fantastic ! It was exactly what I was looking for. It is part of a larger Shiny app, so difficult to provide a working example as part of the post, and after figuring out how your code works ( I am an R novice), I made a couple of small tweaks and it works great ! Thank you very much, Jim, for the work you put into this. Matthew On 3/21/2019 11:01 PM, Jim Lemon wrote: External Email - Use Caution Hi Matthew, Remember, keep it on the list so that people know the status of the request. I couldn't get this to work with the "_source_info_" variable. It seems to be unreadable as a variable name. So, this _may_ be what you want. I don't know if it can be done with "merge" and I don't know the function "full_join". WRKY8_colamp_a<-as.character( c("AT1G02920","AT1G06135","AT1G07160","AT1G11925","AT1G14540","AT1G16150", "AT1G21120","AT1G26380","AT1G26410","AT1G35210","AT1G49000","AT1G51920", "AT1G56250","AT1G66090","AT1G72520","AT1G80840","AT2G02010","AT2G18690", "AT2G30750","AT2G39200","AT2G43620","AT3G01830","AT3G54150","AT3G55840", "AT4G03460","AT4G11470","AT4G11890","AT4G14370","AT4G15417","AT4G15975", "AT4G31940","AT4G35180","AT5G01540","AT5G05300","AT5G11140","AT5G24110", "AT5G25250","AT5G36925","AT5G46295","AT5G64750","AT5G64905","AT5G66020")) bHLH10_col_a<-as.character(c("AT1G72520","AT3G55840","AT5G20230","AT5G64750")) bHLH10_colamp_a<-as.character( c("AT1G01560","AT1G02920","AT1G16420","AT1G17147","AT1G35210","AT1G51620", "AT1G57630","AT1G72520","AT2G18690","AT2G19190","AT2G40180","AT2G44370", "AT3G23250","AT3G55840","AT4G03460","AT4G04480","AT4G04540","AT4G08555", "AT4G11470","AT4G11890","AT4G16820","AT4G23280","AT4G35180","AT5G01540", "AT5G05300","AT5G20230","AT5G22530","AT5G24110","AT5G56960","AT5G57010", "AT5G57220","AT5G64750","AT5G66020")) # let myenter be the sorted superset myenter<- sort(unique(c(WRKY8_colamp_a,bHLH10_col_a,bHLH10_colamp_a))) splice<-function(x,y) { nx<-length(x) ny<-length(y) newy<-rep(NA,nx) if(ny) { yi<-1 for(xi in 1:nx) { if(x[xi] == y[yi]) { newy[xi]<-y[yi] yi<-yi+1 } if(yi>ny) break() } } return(newy) } comatgs<-list(WRKY8_colamp_a=WRKY8_colamp_a, bHLH10_col_a=bHLH10_col_a,bHLH10_colamp_a=bHLH10_colamp_a) mydf3<-data.frame(myenter,stringsAsFactors=FALSE) for(j in 1:length(comatgs)) { tmp<-data.frame(splice(myenter,sort(comatgs[[j]]))) names(tmp)<-names(comatgs)[j] mydf3<-cbind(mydf3,tmp) } Jim On Fri, Mar 22, 2019 at 10:29 AM Matthew wrote: Hi Jim, Thanks for the reply. That was pretty dumb of me. I took that out of the loop. comatgs is longer than this but here is a sample of 4 of 569 elements: $WRKY8_colamp_a [1] "AT1G02920" "AT1G06135" "AT1G07160" "AT1G11925" "AT1G14540" "AT1G16150" "AT1G21120" [8] "AT1G26380" "AT1G26410" "AT1G35210" "AT1G49000" "AT1G51920" "AT1G56250" "AT1G66090" [15] "AT1G72520" "AT1G80840" "AT2G02010" "AT2G18690" "AT2G30750" "AT2G39200" "AT2G43620" [22] "AT3G01830" "AT3G54150" "AT3G55840" "AT4G03460" "AT4G11470" "AT4G11890" "AT4G14370" [29] "AT4G15417" "AT4G15975" "AT4G31940" "AT4G35180" "AT5G01540" "AT5G05300" "AT5G11140" [36] "AT5G24110" "AT5G25250" "AT5G36925" "AT5G46295" "AT5G64750" "AT5G64905" "AT5G66020" $`_source_info_` character(0) $bHLH10_col_a [1] "AT1G72520" "AT3G55840" "AT5G20230" "AT5G64750" $bHLH10_colamp_a [1] "AT1G01560" "AT1G02920" "AT1G16420" "AT1G17147" "AT1G35210" "AT1G51620" "AT1G57630" [8] "AT1G72520" "AT2G18690" "AT2G19190" "AT2G40180" "AT2G44370"
[R] working on a data frame
I am coming from the perspective of Excel and VBA scripts, but I would like to do the following in R. I have a data frame with 14 columns and 32,795 rows. I want to check the value in column 8 (row 1) to see if it is a 0. If it is not a zero, proceed to the next row and check the value for column 8. If it is a zero, then a) change the zero to a 1, b) divide the value in column 9 (row 1) by 1, c) place the result in column 10 (row 1) and d) repeat this for each of the other 32,794 rows. Is this possible with an R script, and is this the way to go about it. If it is, could anyone get me started ? Matthew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] working on a data frame
Thank you for your comments, Peter. A couple of questions. Can I do something like the following ? if yourData[,8]==0, then yourData[,8]==1, yourData[,10] <- yourData[,9]/yourData[,8] I think I am just going to have to learn more about R. I thought getting into R would be like going from Perl to Python or Java etc., but it seems like R programming works differently. Matthew On 7/25/2014 12:06 AM, Peter Alspach wrote: Tena koe Matthew " Column 10 contains the result of the value in column 9 divided by the value in column 8. If the value in column 8==0, then the division can not be done, so I want to change the zero to a one in order to do the division.". That being the case, think in terms of vectors, as Sarah says. Try: yourData[,10] <- yourData[,9]/yourData[,8] yourData[yourData[,8]==0,10] <- yourData[yourData[,8]==0,9] This doesn't change the 0 to 1 in column 8, but it doesn't appear you actually need to do that. HTH Peter Alspach -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Matthew McCormack Sent: Friday, 25 July 2014 3:16 p.m. To: Sarah Goslee Cc: r-help@r-project.org Subject: Re: [R] working on a data frame On 7/24/2014 8:52 PM, Sarah Goslee wrote: Hi, Your description isn't clear: On Thursday, July 24, 2014, Matthew mailto:mccorm...@molbio.mgh.harvard.edu>> wrote: I am coming from the perspective of Excel and VBA scripts, but I would like to do the following in R. I have a data frame with 14 columns and 32,795 rows. I want to check the value in column 8 (row 1) to see if it is a 0. If it is not a zero, proceed to the next row and check the value for column 8. If it is a zero, then a) change the zero to a 1, b) divide the value in column 9 (row 1) by 1, Row 1, or the row in which column 8 == 0? All rows in which the value in column 8==0. Why do you want to divide by 1? Column 10 contains the result of the value in column 9 divided by the value in column 8. If the value in column 8==0, then the division can not be done, so I want to change the zero to a one in order to do the division. This is a fairly standard thing to do with this data. (The data are measurements of amounts at two time points. Sometimes a thing will not be present in the beginning (0), but very present at the later time. Column 10 is the log2 of the change. Infinite is not an easy number to work with, so it is common to change the 0 to a 1. On the other hand, something may be present at time 1, but not at the later time. In this case column 10 would be taking the log2 of a number divided by 0, so again the zero is commonly changed to a one in order to get a useable value in column 10. In both the preceding cases there was a real change, but Inf and NaN are not helpful.) c) place the result in column 10 (row 1) and Ditto on the row 1 question. I want to work on all rows where column 8 (and column 9) contain a zero. Column 10 contains the result of the value in column 9 divided by the value in column 8. So, for row 1, column 10 row 1 contains the ratio column 9 row 1 divided by column 8 row 1, and so on through the whole 32,000 or so rows. Most rows do not have a zero in columns 8 or 9. Some rows have zero in column 8 only, and some rows have a zero in column 9 only. I want to get rid of the zeros in these two columns and then do the division to get a manageable value in column 10. Division by zero and Inf are not considered 'manageable' by me. What do you want column 10 to be if column 8 isn't 0? Does it already have a value. I suppose it must. Yes column 10 does have something, but this something can be Inf or NaN, which I want to get rid of. d) repeat this for each of the other 32,794 rows. Is this possible with an R script, and is this the way to go about it. If it is, could anyone get me started ? Assuming you want to put the new values in the rows where column 8 == 0, you can do it in two steps: mydata[,10] <- ifelse(mydata[,8] == 0, mydata[,9]/whatever, mydata[,10]) #where whatever is the thing you want to divide by that probably isn't 1 mydata[,8] <- ifelse(mydata[,8] == 0, 1, mydata[,8]) R programming is best done by thinking about vectorizing things, rather than doing them in loops. Reading the Intro to R that comes with your installation is a good place to start. Would it be better to change the data frame into a matrix, or something else ? Thanks for your help. Sarah Matthew -- Sarah Goslee http://www.stringpage.com http://www.sarahgoslee.com http://www.functionaldiversity.org [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide c
Re: [R] working on a data frame
Thank you very much Peter, Bill and Petr for some great and quite elegant solutions. There is a lot I can learn from these. Yes to your question Bill about the raw numbers, they are counts and they can not be negatives. The data is RNA Sequencing data where there are approximately 32,000 genes being measured for changes between two conditions. There are some genes that are not present (can not be measured) initially, but are present in the second condition, and the reverse is true also of some genes that are present initially and then not be present in the second condition (these are often the most interesting genes). This makes it difficult to compare mathematically the changes of all genes, so it is common practice to change the 0's to 1's and then redo the log2. 1 is considered sufficiently small, actually anywhere up to 3 or 5 could be just do to 'background noise' in the measurement process, but it is somewhat arbitrary. Matthew On 7/28/2014 2:43 AM, PIKAL Petr wrote: Hi I like to use logical values directly in computations if possible. yourData[,10] <- yourData[,9]/(yourData[,8]+(yourData[,8]==0)) Logical values are automagicaly considered FALSE=0 and TRUE=1 and can be used in computations. If you really want to change 0 to 1 in column 8 you can use yourData[,8] <- yourData[,8]+(yourData[,8]==0) without ifelse stuff. Regards Petr -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of William Dunlap Sent: Friday, July 25, 2014 8:07 PM To: Matthew Cc: r-help@r-project.org Subject: Re: [R] working on a data frame if yourData[,8]==0, then yourData[,8]==1, yourData[,10] <- yourData[,9]/yourData[,8] You could do express this in R as is8Zero <- yourData[,8] == 0 yourData[is8Zero, 8] <- 1 yourData[is8Zero, 10] <- yourData[is8Zero,9] / yourData[is8Zero,8] Note how logical (Boolean) values are used as subscripts - read the '[' as 'such that' when using logical subscripts. There are many more ways to express the same thing. (I am tempted to change the algorithm to avoid the divide by zero problem by making the quotient (numerator + epsilon)/(denominator + epsilon) where epsilon is a very small number. I am assuming that the raw numbers are counts or at least cannot be negative.) Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Jul 25, 2014 at 10:44 AM, Matthew wrote: Thank you for your comments, Peter. A couple of questions. Can I do something like the following ? if yourData[,8]==0, then yourData[,8]==1, yourData[,10] <- yourData[,9]/yourData[,8] I think I am just going to have to learn more about R. I thought getting into R would be like going from Perl to Python or Java etc., but it seems like R programming works differently. Matthew On 7/25/2014 12:06 AM, Peter Alspach wrote: Tena koe Matthew " Column 10 contains the result of the value in column 9 divided by the value in column 8. If the value in column 8==0, then the division can not be done, so I want to change the zero to a one in order to do the division.". That being the case, think in terms of vectors, as Sarah says. Try: yourData[,10] <- yourData[,9]/yourData[,8] yourData[yourData[,8]==0,10] <- yourData[yourData[,8]==0,9] This doesn't change the 0 to 1 in column 8, but it doesn't appear you actually need to do that. HTH Peter Alspach -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Matthew McCormack Sent: Friday, 25 July 2014 3:16 p.m. To: Sarah Goslee Cc: r-help@r-project.org Subject: Re: [R] working on a data frame On 7/24/2014 8:52 PM, Sarah Goslee wrote: Hi, Your description isn't clear: On Thursday, July 24, 2014, Matthew <mailto:mccorm...@molbio.mgh.harvard.edu>> wrote: I am coming from the perspective of Excel and VBA scripts, but I would like to do the following in R. I have a data frame with 14 columns and 32,795 rows. I want to check the value in column 8 (row 1) to see if it is a 0. If it is not a zero, proceed to the next row and check the value for column 8. If it is a zero, then a) change the zero to a 1, b) divide the value in column 9 (row 1) by 1, Row 1, or the row in which column 8 == 0? All rows in which the value in column 8==0. Why do you want to divide by 1? Column 10 contains the result of the value in column 9 divided by the value in column 8. If the value in column 8==0, then the division can not be done, so I want to change the zero to a one in order to do the division. This is a fairly standard thing to do with this data. (The data are measurements of amounts at two time points. Sometimes a thing will not be present in the beginning (0), but very present at the later time. Column 10 is the log2 of the change. Infinite is not an
[R] find the data frames in list of objects and make a list of them
Hi everyone, I would like the find which objects are data frames in all the objects I have created ( in other words in what you get when you type: ls() ), then I would like to make a list of these data frames. Explained in other words; after typing ls(), you get the names of objects. Which objects are data frames ? How to then make a list of these data frames. A second question: is this the best way to make a list of data frames without having to manually type c(dataframe1, dataframe2, ...) ? Matthew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] find the data frames in list of objects and make a list of them
Jim, Wow that was cl ! This function is *really* useful. Thank you very much ! (It is also way beyond my capability). I need to make a list of data frames because then I am going to bind them with plyr using 'dplyr::rbind_all(listOfDataFrames)'. This will make a single data frame, and from that single data frame I can make a heat map of all the data. For example, when I use your fantastic function, my.ls(), I get: my.ls() Size Class Length Dim .Random.seed2,544integer 626 cpl28,664 character 512 filenames 2,120 character 19 filepath 216 character 1 i 152 character 1 Mer7_1-1_160-226A_1_gene_exp_diff_filt_hc_log2.txt 81,152 data.frame 3 529 x 3 Mer7_1-1_Mer7_1-2_gene_exp_diff_filt_hc_log2.txt 31,624 data.frame 3 199 x 3 Mer7_1-1_S150-160-226A_1_gene_exp_diff_filt_hc_log2.txt81,152 data.frame 3 529 x 3 Mer7_1-1_W29_1_gene_exp_diff_filt_hc_log2.txt 129,376 data.frame 3 849 x 3 Mer7_1-1_W29_S150-226A_1_gene_exp_diff_filt_hc_log2.txt 126,816 data.frame 3 835 x 3 Mer7_1-1_W29_S160-162A_1_gene_exp_diff_filt_hc_log2.txt82,792 data.frame 3 537 x 3 Mer7_1-1_W29_S226A_1_gene_exp_diff_filt_hc_log2.txt 115,008 data.frame 3 756 x 3 Mer7_1-2_160-226A_1_gene_exp_diff_filt_hc_log2.txt 79,936 data.frame 3 519 x 3 Mer7_1-2_S150-160-226A_1_gene_exp_diff_filt_hc_log2.txt84,512 data.frame 3 548 x 3 Mer7_1-2_W29_1_gene_exp_diff_filt_hc_log2.txt 130,568 data.frame 3 857 x 3 Mer7_1-2_W29_S160-162A_1_gene_exp_diff_filt_hc_log2.txt83,768 data.frame 3 542 x 3 Mer7_1-2_W29_S226A_1_gene_exp_diff_filt_hc_log2.txt 119,008 data.frame 3 783 x 3 Mer7_2-1_160-226A_2_gene_exp_diff_filt_hc_log2.txt105,344 data.frame 3 685 x 3 Mer7_2-1_Mer7_2-2_gene_exp_diff_filt_hc_log2.txt 26,216 data.frame 3 166 x 3 Mer7_2-1_S150-160-226A_2_gene_exp_diff_filt_hc_log2.txt 106,368 data.frame 3 693 x 3 Mer7_2-1_W29_2_gene_exp_diff_filt_hc_log2.txt 160,200 data.frame 3 1053 x 3 Mer7_2-1_W29_S150-226A_2_gene_exp_diff_filt_hc_log2.txt 152,696 data.frame 3 1005 x 3 Mer7_2-1_W29_S160-162A_2_gene_exp_diff_filt_hc_log2.txt 113,992 data.frame 3 743 x 3 Mer7_2-1_W29_S226A_2_gene_exp_diff_filt_hc_log2.txt 138,944 data.frame 3 914 x 3 my.ls 35,624 function 1 myfiles 2,120 character 19 names 2,424 list 19 test 680 character 5 whatisthis 2,424 list 19 **Total 2,026,440--- --- --- What I need is make the list of data frames for the dplyr command, dplyr::rbind_all(listOfDataFrames). Ideally, this would also be a specific subset of all the data frames, say the data frames with W29 in the name. This is something we, our lab, would be doing routinely and at various times of the day, so I want to automate the process so it does not need anyone to manually sit at the computer and type the list of data frames. Matthew On 8/13/2014 3:06 PM, jim holtman wrote: > Here is a function that I use that might give you the results you want: > > = >> my.ls() > Size Class Length Dim > .Random.seed 2,544integer 626 > .remapHeaderFile 40,440 data.frame 2 373 x 2 > colID 216 character 3 > delDate 104 character 1 > deliv15,752 data.table 7 164 x 7 > f_drawPallet 36,896 function 1 > i96 character 1 > indx168,816 character1782 > pallet 172,696 data.table 31782 x 3 > pallets 405,736 data.table 14 1782 x 14 > picks26,572,856 data.table 19 154247 x 19 > wb 656 Workbook 1 > wSplit 68,043,136 list1782 > x56numeric 2 > **Total 95,460,000--- --- --- > > >> my.ls > function (pos = 1, sorted = FALSE, envir = as.environment(pos)) > { > .result
Re: [R] find the data frames in list of objects and make a list of them
Hi Richard, Thank you very much for your reply and your code. Your code is doing just what I asked for, but does not seem to be what I need. I will need to review some basic R before I can continue. I am trying to list data frames in order to bind them into 1 single data frame with something like: dplyr::rbind_all(list of data frames), but when I try dplyr::rbind_all(lsDataFrame(ls())), I get the error: object at index 1 not a data.frame. So, I am going to have to learn some more about lists in R before proceding. Thank you for your help and code. Matthew Matthew On 8/13/2014 3:12 PM, Richard M. Heiberger wrote: I would do something like this lsDataFrame <- function(xx=ls()) xx[sapply(xx, function(x) is.data.frame(get(x)))] ls("package:datasets") lsDataFrame(ls("package:datasets")) On Wed, Aug 13, 2014 at 2:56 PM, Matthew wrote: Hi everyone, I would like the find which objects are data frames in all the objects I have created ( in other words in what you get when you type: ls() ), then I would like to make a list of these data frames. Explained in other words; after typing ls(), you get the names of objects. Which objects are data frames ? How to then make a list of these data frames. A second question: is this the best way to make a list of data frames without having to manually type c(dataframe1, dataframe2, ...) ? Matthew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] find the data frames in list of objects and make a list of them
Thank you very much, Bill ! It has taken my a while to figure out, but yes, what I need is a list (the R object, list) of data frames and not a character vector containing the names of the data frames. Thank you very much. This works well and is getting me in the direction I want to go. Matthew On 8/13/2014 7:40 PM, William Dunlap wrote: > Previously you asked >> A second question: is this the best way to make a list >> of data frames without having to manually type c(dataframe1, dataframe2, >> ...) ? > If you use 'c' there you will not get a list of data.frames - you will > get a list of all the columns in the data.frame you supplied. Use > 'list' instead of 'c' if you are taking that route. > > The *apply functions are helpful here. To make list of all > data.frames in an environment you can use the following function, > which takes the environment to search as an argument. > > f <- function(envir = globalenv()) { > tmp <- eapply(envir, > all.names=TRUE, > FUN=function(obj) if (is.data.frame(obj)) > obj else NULL) > # remove NULL's now > tmp[!vapply(tmp, is.null, TRUE)] > } > > Use is as >allDataFrames <- f(globalenv()) # or just f() > > > > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Wed, Aug 13, 2014 at 3:49 PM, Matthew > wrote: >> Hi Richard, >> >> Thank you very much for your reply and your code. >> Your code is doing just what I asked for, but does not seem to be what I >> need. >> >> I will need to review some basic R before I can continue. >> >> I am trying to list data frames in order to bind them into 1 single data >> frame with something like: dplyr::rbind_all(list of data frames), but when I >> try dplyr::rbind_all(lsDataFrame(ls())), I get the error: object at index 1 >> not a data.frame. So, I am going to have to learn some more about lists in R >> before proceding. >> >> Thank you for your help and code. >> >> Matthew >> >> >> >> >> >> Matthew >> >> On 8/13/2014 3:12 PM, Richard M. Heiberger wrote: >>> I would do something like this >>> >>> lsDataFrame <- function(xx=ls()) xx[sapply(xx, function(x) >>> is.data.frame(get(x)))] >>> ls("package:datasets") >>> lsDataFrame(ls("package:datasets")) >>> >>> On Wed, Aug 13, 2014 at 2:56 PM, Matthew >>> wrote: >>>> Hi everyone, >>>> >>>> I would like the find which objects are data frames in all the >>>> objects I >>>> have created ( in other words in what you get when you type: ls() ), >>>> then I >>>> would like to make a list of these data frames. >>>> >>>> Explained in other words; after typing ls(), you get the names of >>>> objects. >>>> Which objects are data frames ? How to then make a list of these data >>>> frames. >>>> >>>> A second question: is this the best way to make a list of data frames >>>> without having to manually type c(dataframe1, dataframe2, ...) ? >>>> >>>> Matthew >>>> >>>> __ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] change default installation of R
I have R version 2.15.0 installed in /usr/local/bin, and this is the default; in other words when I type which R this is the path I get. I also have installed R into/usr/local/R-3.1.1/. I used ./configure and then make to install this version. After make, I get the following error messages: ../unix/sys-std.o: In function `initialize_rlcompletion': /usr/local/R-3.1.1/src/unix/sys-std.c:689: undefined reference to `rl_sort_completion_matches' collect2: ld returned 1 exit status make[3]: *** [R.bin] Error 1 make[3]: Leaving directory `/usr/local/R-3.1.1/src/main' make[2]: *** [R] Error 2 make[2]: Leaving directory `/usr/local/R-3.1.1/src/main' make[1]: *** [R] Error 1 make[1]: Leaving directory `/usr/local/R-3.1.1/src' make: *** [R] Error 1 I want to change R-3.1.1 to the default, so that when I type which R, I get /usr/local/R-3.1.1 To do this I first cd'd into /usr/local/bin and renamed R to R-old_10-30-14 then created a symlink by 'ln -s /usr/local/R-3.1.1/bin R' but when I type which R, I get 'no R in ... , where ' . . . ' is my PATH variable. If I remove the symlink and then create another one with ln -s /usr/local/R-3.1.1/bin/R R, then after typing 'which R', I get /usr/local/bin/R: line 259: /usr/local/R-3.1.1/bin/exe c/R: No such file or directory /usr/local/bin/R: line 259: exec: /usr/local/R-3.1.1/bin/exec/R: cannot execute: No such file or directory This is the same message I get if I just type at the command line: /usr/local/R-3.1.1/bin/R. Matthew [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] transpose and split dataframe
I have a data frame that is a lot bigger but for simplicity sake we can say it looks like this: Regulator hits AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675 AT2G55980 AT2G85403,AT4G89223 In other words: data.frame : 2 obs. of 2 variables $Regulator: Factor w/ 2 levels $hits : Factor w/ 6 levels I want to transpose it so that Regulator is now the column headings and each of the AGI numbers now separated by commas is a row. So, AT1G69490 is now the header of the first column and AT4G31950 is row 1 of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is header of column 2 and AT2G85403 is row 1 of column 2, etc. I have tried playing around with strsplit(TF2list[2:2]) and strsplit(as.character(TF2list[2:2]), but I am getting nowhere. Matthew __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Fwd: Re: transpose and split dataframe
Thanks for your reply. I was trying to simplify it a little, but must have got it wrong. Here is the real dataframe, TF2list: str(TF2list) 'data.frame': 152 obs. of 2 variables: $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54 54 82 82 82 82 82 ... $ hits : Factor w/ 97 levels "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"| __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ... And the first few lines resulting from dput(head(TF2list)): dput(head(TF2list)) structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L, 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380", "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300", "AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600", "AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ... This is another way of looking at the first 4 entries (Regulator is tab-separated from hits): Regulator hits 1 AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 2 AT1G29860 AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135 3 AT1G2986 AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830 So, the goal would be to first: Transpose the existing dataframe so that the factor Regulator becomes a column name (column 1 name = AT1G69490, column2 name AT1G29860, etc.) and the hits associated with each Regulator become rows. Hits is a comma separated 'list' ( I do not not know if technically it is an R list.), so it would have to be comma 'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950, col 1 row 2 - AT5G24410, etc); like this : AT1G69490 AT4G31950 AT5G24110 AT1G05675 AT5G64905 ... I did not include all the rows) I think it would be best to actually make the first entry a separate dataframe ( 1 column with name = AT1G69490 and number of rows depending on the number of hits), then make the second column (column name = AT1G29860, and number of rows depending on the number of hits) into a new dataframe and do a full join of of the two dataframes; continue by making the third column (column name = AT1G2986) into a dataframe and full join it with the previous; continue for the 152 observations so that then end result is a dataframe with 152 columns and number of rows depending on the entry with the greatest number of hits. The full joins I can do with dplyr, but getting up to that point seems rather difficult. This would get me what my ultimate goal would be; each Regulator is a column name (152 columns) and a given row has either NA or the same hit. This seems very difficult to me, but I appreciate any attempt. Matthew On 4/30/2019 4:34 PM, David L Carlson wrote: > External Email - Use Caution > > I think we need more information. Can you give us the structure of the data > with str(YourDataFrame). Alternatively you could copy a small piece into your > email message by copying and pasting the results of the following code: > > dput(head(YourDataFrame)) > > The data frame you present could not be a data frame since you say "hits" is > a factor with a variable number of elements. If each value of "hits" was a > single character string, it would only have 2 factor levels not 6 and your > efforts to parse the string would make more sense. Transposing to a data > frame would only be possi
Re: [R] Fwd: Re: transpose and split dataframe
Thank you very much, David and Jim for your work and solutions. I have been working through both of them to better learn R. They both proceed through a similar logic except David's starts with a character matrix and Jim's with a dataframe, and both end with equivalent dataframes ( identical(tmmdf, TF2list2)) returns TRUE ). They have both been very helpful. However, there is one attribute of my intended final dataframe that is missing. Looking at part of the final dataframe: head(tmmdf) AT1G69490 AT1G29860 AT1G29860.1 AT4G18170 AT4G18170.1 AT5G46350 1 *AT4G31950* *AT4G31950* AT5G64905 *AT4G31950* AT5G64905 *AT4G31950* 2 AT5G24110 AT5G24110 AT1G21120 AT5G24110 AT1G14540 AT5G24110 3 AT1G26380 AT1G05675 AT1G07160 AT1G05675 AT1G21120 AT1G05675 Row 1 has *AT4G31950* in columns 1,2,4 and 6, but AT4G31950 in columns 3 and 5. What I was aiming at would be that each row would have a unique entry so that AT4G31950 is row 1 columns 1,2,4 and 6, and NA is row 1 columns 3 and 5. AT4G31950 is row 2 columns 3 and 5 and NA is row 2 columns 1,2,4 and 6. So, it would look like this: head(intended_df) AT1G69490 AT1G29860 AT1G29860.1 AT4G18170 AT4G18170.1 AT5G46350 1 AT4G31950 AT4G31950 NA AT4G31950 NA AT4G31950 2 NA NA AT4G31950 NA AT4G31950 NA I have been trying to adjust the code to get my intended result basically by trying to build a dataframe one column at a time from each entry in the character matrix, but have not got anything near working yet. Matthew On 4/30/2019 6:29 PM, David L Carlson wrote > If you read the data frame with read.csv() or one of the other read() > functions, use the asis=TRUE argument to prevent conversion to factors. If > not do the conversion first: > > # Convert factors to characters > DataMatrix <- sapply(TF2list, as.character) > # Split the vector of hits > DataList <- sapply(DataMatrix[, 2], strsplit, split=",") > # Use the values in Regulator to name the parts of the list > names(DataList) <- DataMatrix[,"Regulator"] > > # Now create a data frame > # How long is the longest list of hits? > mx <- max(sapply(DataList, length)) > # Now add NAs to vectors shorter than mx > DataList2 <- lapply(DataList, function(x) c(x, rep(NA, mx-length(x > # Finally convert back to a data frame > TF2list2 <- do.call(data.frame, DataList2) > > Try this on a portion of the list, say 25 lines and print each object to see > what is happening. > > > David L Carlson > Department of Anthropology > Texas A&M University > College Station, TX 77843-4352 > > > > > > -Original Message- > From: R-help On Behalf Of Matthew > Sent: Tuesday, April 30, 2019 4:31 PM > To: r-help@r-project.org > Subject: [R] Fwd: Re: transpose and split dataframe > > Thanks for your reply. I was trying to simplify it a little, but must > have got it wrong. Here is the real dataframe, TF2list: > > str(TF2list) > 'data.frame': 152 obs. of 2 variables: > $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54 > 54 82 82 82 82 82 ... > $ hits : Factor w/ 97 levels > "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"| > __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ... > > And the first few lines resulting from dput(head(TF2list)): > > dput(head(TF2list)) > structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L, > 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380", > "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300", > "AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600", > "AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ... > > This is another way of looking at the first 4 entries (Regulator is > tab-separated from hits): > > Regulator > hits > 1 > AT1G69490 > > AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 > 2 > AT1G29860 >
Re: [R] Fwd: Re: transpose and split dataframe
Thank you very much Jim and David for your scripts and accompanying explanations. I was intrigued at the results that came from David's script. As seen below where I have taken a small piece of his DataTable: AT1G69490 AT1G29860 AT4G18170 *AT5G46350* AT1G01560 0 0 0 1 *AT1G02920* 1 2 2 4 AT1G02930 1 2 2 4 AT1G05675 1 1 1 2 There are numbers other than 1 or 0, which was not what I was expecting. The data I am working with come from downloading results of an analysis done at a particular web site. I looked at Jim's solution, and the equivalent of the above would be: AT1G69490 _AT1G29860_ _AT1G29860_ AT4G18170 AT4G18170 *AT5G46350 AT5G46350 AT5G46350 AT5G46350 AT5G46350* AT1G01560 NA NA NA NA NA NA NA NA AT1G01560 NA *AT1G02920* AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 AT1G02920 NA AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 AT1G02930 NA AT1G05675 AT1G05675 AT1G05675 NA AT1G05675 NA AT1G05675 AT1G05675 NA NA NA The above is the format that I was desiring, but I was not expecting that a single ATG number would be the name of multiple columns. As shown above, _AT1G2960_ is the name of two columns and *AT5G46350* is the name of 5 columns (You may have to widen the e-mail across the screen to see it clearly). When a single ATG number, such as AT5G46350, names multiple columns, then the contents of each of those columns may or may not be the same. For example, going across a single row looking at *AT1G02920*, it occurs in the first column, hence the 1 in David's DataTable. It occurs in both AT1G29860 columns, hence the 2 in the DataTable. It again occurs in both AT4G18170 columns, so another 2 in the DataTable, and finally it occurs in only 4 of the 5 AT5G46350 columns, so the 4 in the DataTable. When the same ATG number names multiple columns it is because different methods were used to determine the content of each column. So, if an ATG number such as AT1G05675 occurs in all columns with the same name, I then know that it was by multiple methods that this has been shown, and if it only occurs in some of the columns, I know that all methods did not associate it with the column name ATG. David's result complements Jim's, and both end up being very helpful to me. Thanks again to both of you for your time and help. Matthew On 5/2/2019 8:40 PM, Jim Lemon wrote: > External Email - Use Caution > > Hi again, > Just noticed that the NA fill in the original solution is unnecessary, thus: > > # split the second column at the commas > hitsplit<-strsplit(mmdf$hits,",") > # get all the sorted hits > allhits<-sort(unique(unlist(hitsplit))) > tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits))) > # change the names of the list > names(tmmdf)<-mmdf$Regulator > for(column in 1:length(hitsplit)) { > hitmatches<-match(hitsplit[[column]],allhits) > hitmatches<-hitmatches[!is.na(hitmatches)] > tmmdf[hitmatches,column]<-allhits[hitmatches] > } > > Jim > > On Fri, May 3, 2019 at 10:32 AM Jim Lemon wrote: >> Hi Matthew, >> I'm not sure whether you want something like your initial request or >> David's solution. The result of this can be transformed into the >> latter: >> >> mmdf<-read.table(text="Regulator hits >> AT1G69490 >> AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830 >> AT1G29860 >> AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135 >> AT1G2986 >> AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G
Re: [R] Contour lines in a persp plot
Thanks a lot, that is all i want. If someone is interessed, see the code below panel.3d.contour <- function(x, y, z, rot.mat, distance, nlevels = 20, zlim.scaled, ...) # les3 points de suspension pour dire les autres paramètres sont ceux données par défaut { add.line <- trellis.par.get("add.line") panel.3dwire(x, y, z, rot.mat, distance, zlim.scaled = zlim.scaled, ...) clines <- contourLines(x, y, matrix(z, nrow = length(x), byrow = TRUE), nlevels = nlevels) for (ll in clines) { m <- ltransform3dto3d(rbind(ll$x, ll$y, zlim.scaled[2]), rot.mat, distance) panel.lines(m[1,], m[2,], col = add.line$col, lty = add.line$lty, lwd = add.line$lwd) } } fn<-function(x,y){sin(x)+2*y} #this looks like a corrugated tin roof x<-seq(from=1,to=100,by=2) #generates a list of x values to sample y<-seq(from=1,to=100,by=2) #generates a list of y values to sample z<-outer(x,y,FUN=fn) #applies the funct. across the combos of x and y wireframe(z,zlim = c(1, 300), nlevels = 10, aspect = c(1, 0.5), panel.aspect = 0.6, panel.3d.wireframe = panel.3d.contour, shade = FALSE , screen = list(z = 20, x = -60)) -- View this message in context: http://r.789695.n4.nabble.com/Contour-lines-in-a-persp-plot-tp4667220p4667309.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Save intermediate result in a same file
Hello everybody, i have to save a 100 iteration computation in a file every 5 iterations until the end. I first give a vector A of 100 elements for the 100 iterations and i want to update A every 5 iterations. I use "save" but it doesn't work. Someone has an idea, i need a help Cheers. -- View this message in context: http://r.789695.n4.nabble.com/Save-intermediate-result-in-a-same-file-tp4677350.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] fwrite() not found in data.table package
Hi all, I used to use fwrite() function in data.table but I cannot get it to work now. The function is not in the data.table package, even though a help page exists for it. My session info is below. Any ideas on how to get fwrite() to work would be much appreciated. Thanks! > sessionInfo() R version 3.2.0 (2015-04-16) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8LC_PAPER=en_US.UTF-8 [8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.10.5 loaded via a namespace (and not attached): [1] tools_3.2.0 chron_2.3-47 tcltk_3.2.0 -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fwrite() not found in data.table package
Thanks Jeff! It turns out that my problem was that I tried to install the newest data.table package while the old data.table package was loaded in R. Full instructions for installing data.table are here: https://github.com/Rdatatable/data.table/wiki/Installation On Mon, Oct 2, 2017 at 10:55 AM, Jeff Newmiller wrote: > You are asking about (a) a contributed package (b) for a package version > that is not in CRAN and (c) an R version that is outdated, which stretches > the definition of "on topic" here. Since that function does not appear to > have been removed from that package (I am not installing a development > version to test if it is broken for your benefit), I will throw out a guess > that if you update R to 3.4.1 or 3.4.2 then things might start working. If > not, I suggest you use the CRAN version of the package and create a > reproducible example (check it with package reprex) and try again here, or > ask one of the maintainers of that package. > -- > Sent from my phone. Please excuse my brevity. > > On October 2, 2017 8:56:46 AM PDT, Matthew Keller > wrote: > >Hi all, > > > >I used to use fwrite() function in data.table but I cannot get it to > >work > >now. The function is not in the data.table package, even though a help > >page > >exists for it. My session info is below. Any ideas on how to get > >fwrite() > >to work would be much appreciated. Thanks! > > > >> sessionInfo() > >R version 3.2.0 (2015-04-16) > >Platform: x86_64-unknown-linux-gnu (64-bit) > >Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago) > > > >locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 > >LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 > >LC_PAPER=en_US.UTF-8 > > [8] LC_NAME=C LC_ADDRESS=C > >LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > >LC_IDENTIFICATION=C > > > >attached base packages: > >[1] stats graphics grDevices utils datasets methods base > > > >other attached packages: > >[1] data.table_1.10.5 > > > >loaded via a namespace (and not attached): > >[1] tools_3.2.0 chron_2.3-47 tcltk_3.2.0 > -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tcltk2 entry box
Wow ! Very nice. Thank you very much, John. This is very helpful and just what I need. Yes, I can see that I should have paid attention to tcltk before going to tcltk2. Matthew On 7/8/2015 8:37 PM, John Fox wrote: Dear Matthew, For file selection, see ?tcltk::tk_choose.files or ?tcltk::tkgetOpenFile . You could enter a number in a tk entry widget, but, depending upon the nature of the number, a slider or other widget might be a better choice. For a variety of helpful tcltk examples see <http://www.sciviews.org/_rgui/tcltk/>, originally by James Wettenhall but now maintained by Philippe Grosjean (the author of the tcltk2 package). (You probably don't need tcltk2 for the simple operations that you mention, but see ?tk2spinbox for an alternative to a slider.) Best, John --- John Fox, Professor McMaster University Hamilton, Ontario, Canada http://socserv.socsci.mcmaster.ca/jfox/ -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Matthew Sent: July-08-15 8:01 PM To: r-help Subject: [R] tcltk2 entry box Is anyone familiar enough with the tcltk2 package to know if it is possible to have an entry box where a user can enter information (such as a path to a file or a number) and then be able to use the entered information downstream in a R script ? The idea is for someone unfamiliar with R to just start an R script that would take care of all the commands for them so all they have to do is get the script started. However, there is always a couple of pieces of information that will change each time the script is used (for example, a different file will be processed by the script). So, I would like a way for the user to input that information as the script ran. Matthew McCormack __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] setting up R -- VM Fusion, WIndows7
Hi, As i need R to speak to Bloomberg (and big only runs on windows), i'm running windows 7 via VM Fusion on my mac. I think i am having permission problems, as i cannot use install.packages, and cannot change .libPaths via either a .Rprofile, or Profile.site. I've posted more detail in this super-user question -- http://superuser.com/questions/948083/how-to-set-environment-variables-in-vm-fusion-windows-7 Throwing it over to this list as well, as I've spent about half the time i had allowed for my project on (not getting) set up. I realise this is a very niche problem - hoping that someone else has had a similar problem, and can offer pointers. best mj [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] simple question - mean of a row of a data.frame
Hi all, Simple question I should know: I'm unclear on the logic of why the sum of a row of a data.frame returns a valid sum but the mean of a row of a data.frame returns NA: sum(rock[2,]) [1] 10901.05 mean(rock[2,],trim=0) [1] NA Warning message: In mean.default(rock[2, ], trim = 0) : argument is not numeric or logical: returning NA I get that rock[2,] is itself a data.frame of mode list, but why the inconsistency between functions? How can you figure this out from, e.g., ?mean ?sum Thanks in advance, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Defining Variables from a Matrix for 10-Fold Cross Validation
Good afternoon, I am trying to run a 10-fold CV, using a matrix as my data set. Essentially, I want "y" to be the first column of the matrix, and my "x" to be all remaining columns (2-257). I've posted some of the code I used below, and the data set (called "zip.train") is in the "ElemStatLearn" package. The error message is highlighted in red, and the corresponding section of code is bolded. (I am not concerned with the warning message, just the error message). The issue I am experiencing is the error message below the code: I haven't come across that specific message before, and am not exactly sure how to interpret its meaning. What exactly is this error message trying to tell me? Any suggestions or insights are appreciated! Thank you all, Matthew Campbell > library (ElemStatLearn) > library(kknn) > data(zip.train) > train=zip.train[which(zip.train[,1] %in% c(2,3)),] > test=zip.test[which(zip.test[,1] %in% c(2,3)),] > nfold = 10 > infold = sample(rep(1:10, length.out = (x))) Warning message: In rep(1:10, length.out = (x)) : first element used of 'length.out' argument > *> mydata = data.frame(x = train[ , c(2,257)] , y = train[ , 1])* > > K = 20 > errorMatrix = matrix(NA, K, 10) > > for (l in nfold) + { + for (k in 1:20) + { + knn.fit = kknn(y ~ x, train = mydata[infold != l, ], test = mydata[infold == l, ], k = k) + errorMatrix[k, l] = mean((knn.fit$fitted.values - mydata$y[infold == l])^2) + } + } Error in model.frame.default(formula, data = train) : variable lengths differ (found for 'x') [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Define pch and color based on two different columns
I am making a lattice plot and I would like to use the value in one column to define the pch and another column to define color of points. Something like: xyplot(mpg ~ wt | cyl, data=mtcars, col = gear, pch = carb ) There are unique pch points in the second and third panels, but these points are only unique within the plots, not among all the plots (as they should be). You can see this if you use the following code: xyplot(mpg ~ wt | cyl, data=mtcars, groups = carb ) This plot looks great for one group, but if you try to invoke two groups using c(gear, carb) I think it simply takes unique combinations of those two variables and plots them as unique colors. Another solution given by a StackExchange user: mypch <- 1:6 mycol <- 1:3 xyplot(mpg ~ wt | cyl, panel = function(x, y, ..., groups, subscripts) { pch <- mypch[factor(carb[subscripts])] col <- mycol[factor(gear[subscripts])] grp <- c(gear,carb) panel.xyplot(x, y, pch = pch, col = col) } ) This solution has the same problems as the code at the top. I think the issue causing problems with both solutions is that not every value for each group is present in each panel, and they are almost never in the same order. I think R is just interpreting the appearance of unique values as a signal to change to the next pch or color. My actual data file is very large, and it's not possible to sort my way out of this mess. It would be best if I could just use the value in two columns to actually define a color or pch for each point on an entire plot. Is there a way to do this? Ps, I had to post this via email because the Nabble site kept sending me an error message: "Message rejected by filter rule match" Thanks, Matt *Matthew R. Snyder* *~* PhD Candidate University Fellow University of Toledo Computational biologist, ecologist, and bioinformatician Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. matthew.snyd...@rockets.utoledo.edu msnyder...@gmail.com [image: Mailtrack] <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> Sender notified by Mailtrack <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> 04/09/19, 1:49:27 PM [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Define pch and color based on two different columns
Thanks, Jim. I appreciate your contributed answer, but neither of those make the desired plot either. I'm actually kind of shocked this isn't an easier more straightforward thing. It seems like this would be something that a user would want to do frequently. I can actually do this for single plots in ggplot. Maybe I should contact the authors of lattice and see if this is something they can help me with or if they would like to add this as a feature in the future... Matt *Matthew R. Snyder* *~* PhD Candidate University Fellow University of Toledo Computational biologist, ecologist, and bioinformatician Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. matthew.snyd...@rockets.utoledo.edu msnyder...@gmail.com [image: Mailtrack] <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> Sender notified by Mailtrack <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> 04/09/19, 7:52:27 PM On Tue, Apr 9, 2019 at 4:53 PM Jim Lemon wrote: > Hi Matthew, > How about this? > > library(lattice) > xyplot(mpg ~ wt | cyl, >data=mtcars, >col = mtcars$gear, >pch = mtcars$carb > ) > library(plotrix) > grange<-range(mtcars$gear) > xyplot(mpg ~ wt | cyl, >data=mtcars, >col = > color.scale(mtcars$gear,extremes=c("blue","red"),xrange=grange), >pch = as.character(mtcars$carb) > ) > > Jim > > On Wed, Apr 10, 2019 at 7:43 AM Matthew Snyder > wrote: > > > > I am making a lattice plot and I would like to use the value in one > column > > to define the pch and another column to define color of points. Something > > like: > > > > xyplot(mpg ~ wt | cyl, > >data=mtcars, > >col = gear, > >pch = carb > > ) > > > > There are unique pch points in the second and third panels, but these > > points are only unique within the plots, not among all the plots (as they > > should be). You can see this if you use the following code: > > > > xyplot(mpg ~ wt | cyl, > >data=mtcars, > >groups = carb > > ) > > > > This plot looks great for one group, but if you try to invoke two groups > > using c(gear, carb) I think it simply takes unique combinations of those > > two variables and plots them as unique colors. > > > > Another solution given by a StackExchange user: > > > > mypch <- 1:6 > > mycol <- 1:3 > > > > xyplot(mpg ~ wt | cyl, > > panel = function(x, y, ..., groups, subscripts) { > > pch <- mypch[factor(carb[subscripts])] > > col <- mycol[factor(gear[subscripts])] > > grp <- c(gear,carb) > > panel.xyplot(x, y, pch = pch, col = col) > > } > > ) > > > > This solution has the same problems as the code at the top. I think the > > issue causing problems with both solutions is that not every value for > each > > group is present in each panel, and they are almost never in the same > > order. I think R is just interpreting the appearance of unique values as > a > > signal to change to the next pch or color. My actual data file is very > > large, and it's not possible to sort my way out of this mess. It would be > > best if I could just use the value in two columns to actually define a > > color or pch for each point on an entire plot. Is there a way to do this? > > > > Ps, I had to post this via email because the Nabble site kept sending me > an > > error message: "Message rejected by filter rule match" > > > > Thanks, > > Matt > > > > > > > > *Matthew R. Snyder* > > *~* > > PhD Candidate > > University Fellow > > University of Toledo > > Computational biologist, ecologist, and bioinformatician > > Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. > > matthew.snyd...@rockets.utoledo.edu > > msnyder...@gmail.com > > > > > > > > [image: Mailtrack] > > < > https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&; > > > > Sender > > notified by > > Mailtrack > > < > https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&; > > > > 04/09/19, > > 1:49:27 PM > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Define pch and color based on two different columns
I want to have one column in a dataframe define the color and another define the pch. This can be done easily with a single panel: xyplot(mpg ~ wt, data=mtcars, col = mtcars$gear, pch = mtcars$carb ) This produces the expected result: two pch that are the same color are unique in the whole plot. But when you add cyl as a factor. Those two points are only unique within their respective panels, and not across the whole plot. Matt *Matthew R. Snyder* *~* PhD Candidate University Fellow University of Toledo Computational biologist, ecologist, and bioinformatician Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. matthew.snyd...@rockets.utoledo.edu msnyder...@gmail.com [image: Mailtrack] <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> Sender notified by Mailtrack <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> 04/09/19, 9:26:09 PM On Tue, Apr 9, 2019 at 9:23 PM Bert Gunter wrote: > 1. I am quite sure that whatever it is that you want to do can be done. > Probably straightforwardly. The various R graphics systems are mature and > extensive. > > 2. But I, for one, do not understand from your post what it is that you > want to do. Nor does anyone else apparently. > > Cheers, > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Tue, Apr 9, 2019 at 8:10 PM Matthew Snyder > wrote: > >> Thanks, Jim. >> >> I appreciate your contributed answer, but neither of those make the >> desired >> plot either. I'm actually kind of shocked this isn't an easier more >> straightforward thing. It seems like this would be something that a user >> would want to do frequently. I can actually do this for single plots in >> ggplot. Maybe I should contact the authors of lattice and see if this is >> something they can help me with or if they would like to add this as a >> feature in the future... >> >> Matt >> >> >> >> *Matthew R. Snyder* >> *~* >> PhD Candidate >> University Fellow >> University of Toledo >> Computational biologist, ecologist, and bioinformatician >> Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. >> matthew.snyd...@rockets.utoledo.edu >> msnyder...@gmail.com >> >> >> >> [image: Mailtrack] >> < >> https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&; >> > >> Sender >> notified by >> Mailtrack >> < >> https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&; >> > >> 04/09/19, >> 7:52:27 PM >> >> On Tue, Apr 9, 2019 at 4:53 PM Jim Lemon wrote: >> >> > Hi Matthew, >> > How about this? >> > >> > library(lattice) >> > xyplot(mpg ~ wt | cyl, >> >data=mtcars, >> >col = mtcars$gear, >> >pch = mtcars$carb >> > ) >> > library(plotrix) >> > grange<-range(mtcars$gear) >> > xyplot(mpg ~ wt | cyl, >> >data=mtcars, >> >col = >> > color.scale(mtcars$gear,extremes=c("blue","red"),xrange=grange), >> >pch = as.character(mtcars$carb) >> > ) >> > >> > Jim >> > >> > On Wed, Apr 10, 2019 at 7:43 AM Matthew Snyder >> > wrote: >> > > >> > > I am making a lattice plot and I would like to use the value in one >> > column >> > > to define the pch and another column to define color of points. >> Something >> > > like: >> > > >> > > xyplot(mpg ~ wt | cyl, >> > >data=mtcars, >> > >col = gear, >> > >pch = carb >> > > ) >> > > >> > > There are unique pch points in the second and third panels, but these >> > > points are only unique within the plots, not among all the plots (as >> they >> > > should be). You can see this if you use the following code: >> > > >> > > xyplot(mpg ~ wt | cyl, >> > >data=mtcars, >> > >groups = carb >> > > ) >> > > >> > > This plot looks great for one group, but if you try to invoke two >> groups >> > > using c(gear, carb) I think it simply tak
Re: [R] Define pch and color based on two different columns
I tried this too: xyplot(mpg ~ wt | cyl, data=mtcars, # groups = carb, subscripts = TRUE, col = as.factor(mtcars$gear), pch = as.factor(mtcars$carb) ) Same problem... *Matthew R. Snyder* *~* PhD Candidate University Fellow University of Toledo Computational biologist, ecologist, and bioinformatician Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. matthew.snyd...@rockets.utoledo.edu msnyder...@gmail.com [image: Mailtrack] <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> Sender notified by Mailtrack <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> 04/09/19, 9:28:11 PM On Tue, Apr 9, 2019 at 8:18 PM Jeff Newmiller wrote: > Maybe you should use factors rather than character columns. > > On April 9, 2019 8:09:43 PM PDT, Matthew Snyder > wrote: > >Thanks, Jim. > > > >I appreciate your contributed answer, but neither of those make the > >desired > >plot either. I'm actually kind of shocked this isn't an easier more > >straightforward thing. It seems like this would be something that a > >user > >would want to do frequently. I can actually do this for single plots in > >ggplot. Maybe I should contact the authors of lattice and see if this > >is > >something they can help me with or if they would like to add this as a > >feature in the future... > > > >Matt > > > > > > > >*Matthew R. Snyder* > >*~* > >PhD Candidate > >University Fellow > >University of Toledo > >Computational biologist, ecologist, and bioinformatician > >Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. > >matthew.snyd...@rockets.utoledo.edu > >msnyder...@gmail.com > > > > > > > >[image: Mailtrack] > >< > https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&; > > > >Sender > >notified by > >Mailtrack > >< > https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&; > > > >04/09/19, > >7:52:27 PM > > > >On Tue, Apr 9, 2019 at 4:53 PM Jim Lemon wrote: > > > >> Hi Matthew, > >> How about this? > >> > >> library(lattice) > >> xyplot(mpg ~ wt | cyl, > >>data=mtcars, > >>col = mtcars$gear, > >>pch = mtcars$carb > >> ) > >> library(plotrix) > >> grange<-range(mtcars$gear) > >> xyplot(mpg ~ wt | cyl, > >>data=mtcars, > >>col = > >> color.scale(mtcars$gear,extremes=c("blue","red"),xrange=grange), > >>pch = as.character(mtcars$carb) > >> ) > >> > >> Jim > >> > >> On Wed, Apr 10, 2019 at 7:43 AM Matthew Snyder > >> wrote: > >> > > >> > I am making a lattice plot and I would like to use the value in one > >> column > >> > to define the pch and another column to define color of points. > >Something > >> > like: > >> > > >> > xyplot(mpg ~ wt | cyl, > >> >data=mtcars, > >> >col = gear, > >> >pch = carb > >> > ) > >> > > >> > There are unique pch points in the second and third panels, but > >these > >> > points are only unique within the plots, not among all the plots > >(as they > >> > should be). You can see this if you use the following code: > >> > > >> > xyplot(mpg ~ wt | cyl, > >> >data=mtcars, > >> >groups = carb > >> > ) > >> > > >> > This plot looks great for one group, but if you try to invoke two > >groups > >> > using c(gear, carb) I think it simply takes unique combinations of > >those > >> > two variables and plots them as unique colors. > >> > > >> > Another solution given by a StackExchange user: > >> > > >> > mypch <- 1:6 > >> > mycol <- 1:3 > >> > > >> > xyplot(mpg ~ wt | cyl, > >> > panel = function(x, y, ..., groups, subscripts) { > >> > pch <- mypch[factor(carb[subscripts])] > >> > col <- mycol[factor(gear[subscripts])] > >> > grp <- c(gear,carb) > >> > panel.xyplot(x, y, pch = pch, col = col) > >> > }
Re: [R] Define pch and color based on two different columns
You are not late to the party. And you solved it! Thank you very much. You just made my PhD a little closer to reality! Matt *Matthew R. Snyder* *~* PhD Candidate University Fellow University of Toledo Computational biologist, ecologist, and bioinformatician Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. matthew.snyd...@rockets.utoledo.edu msnyder...@gmail.com [image: Mailtrack] <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> Sender notified by Mailtrack <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;> 04/09/19, 10:01:53 PM On Tue, Apr 9, 2019 at 9:37 PM Peter Langfelder wrote: > Sorry for being late to the party, but has anyone suggested a minor > but important modification of the code from stack exchange? > > xyplot(mpg ~ wt | cyl, > panel = function(x, y, ..., groups, subscripts) { > pch <- mypch[factor(carb)[subscripts]] > col <- mycol[factor(gear)[subscripts]] > grp <- c(gear,carb) > panel.xyplot(x, y, pch = pch, col = col) > } > ) > > From the little I understand about what you're trying to do, this may > just do the trick. > > Peter > > On Tue, Apr 9, 2019 at 2:43 PM Matthew Snyder > wrote: > > > > I am making a lattice plot and I would like to use the value in one > column > > to define the pch and another column to define color of points. Something > > like: > > > > xyplot(mpg ~ wt | cyl, > >data=mtcars, > >col = gear, > >pch = carb > > ) > > > > There are unique pch points in the second and third panels, but these > > points are only unique within the plots, not among all the plots (as they > > should be). You can see this if you use the following code: > > > > xyplot(mpg ~ wt | cyl, > >data=mtcars, > >groups = carb > > ) > > > > This plot looks great for one group, but if you try to invoke two groups > > using c(gear, carb) I think it simply takes unique combinations of those > > two variables and plots them as unique colors. > > > > Another solution given by a StackExchange user: > > > > mypch <- 1:6 > > mycol <- 1:3 > > > > xyplot(mpg ~ wt | cyl, > > panel = function(x, y, ..., groups, subscripts) { > > pch <- mypch[factor(carb[subscripts])] > > col <- mycol[factor(gear[subscripts])] > > grp <- c(gear,carb) > > panel.xyplot(x, y, pch = pch, col = col) > > } > > ) > > > > This solution has the same problems as the code at the top. I think the > > issue causing problems with both solutions is that not every value for > each > > group is present in each panel, and they are almost never in the same > > order. I think R is just interpreting the appearance of unique values as > a > > signal to change to the next pch or color. My actual data file is very > > large, and it's not possible to sort my way out of this mess. It would be > > best if I could just use the value in two columns to actually define a > > color or pch for each point on an entire plot. Is there a way to do this? > > > > Ps, I had to post this via email because the Nabble site kept sending me > an > > error message: "Message rejected by filter rule match" > > > > Thanks, > > Matt > > > > > > > > *Matthew R. Snyder* > > *~* > > PhD Candidate > > University Fellow > > University of Toledo > > Computational biologist, ecologist, and bioinformatician > > Sponsored Guest Researcher at NOAA PMEL, Seattle, WA. > > matthew.snyd...@rockets.utoledo.edu > > msnyder...@gmail.com > > > > > > > > [image: Mailtrack] > > < > https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&; > > > > Sender > > notified by > > Mailtrack > > < > https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&; > > > > 04/09/19, > > 1:49:27 PM > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] working on a data frame
On 7/24/2014 8:52 PM, Sarah Goslee wrote: > Hi, > > Your description isn't clear: > > On Thursday, July 24, 2014, Matthew <mailto:mccorm...@molbio.mgh.harvard.edu>> wrote: > > I am coming from the perspective of Excel and VBA scripts, but I > would like to do the following in R. > > I have a data frame with 14 columns and 32,795 rows. > > I want to check the value in column 8 (row 1) to see if it is a 0. > If it is not a zero, proceed to the next row and check the value > for column 8. > If it is a zero, then > a) change the zero to a 1, > b) divide the value in column 9 (row 1) by 1, > > > Row 1, or the row in which column 8 == 0? All rows in which the value in column 8==0. > Why do you want to divide by 1? Column 10 contains the result of the value in column 9 divided by the value in column 8. If the value in column 8==0, then the division can not be done, so I want to change the zero to a one in order to do the division. This is a fairly standard thing to do with this data. (The data are measurements of amounts at two time points. Sometimes a thing will not be present in the beginning (0), but very present at the later time. Column 10 is the log2 of the change. Infinite is not an easy number to work with, so it is common to change the 0 to a 1. On the other hand, something may be present at time 1, but not at the later time. In this case column 10 would be taking the log2 of a number divided by 0, so again the zero is commonly changed to a one in order to get a useable value in column 10. In both the preceding cases there was a real change, but Inf and NaN are not helpful.) > > c) place the result in column 10 (row 1) and > > > Ditto on the row 1 question. I want to work on all rows where column 8 (and column 9) contain a zero. Column 10 contains the result of the value in column 9 divided by the value in column 8. So, for row 1, column 10 row 1 contains the ratio column 9 row 1 divided by column 8 row 1, and so on through the whole 32,000 or so rows. Most rows do not have a zero in columns 8 or 9. Some rows have zero in column 8 only, and some rows have a zero in column 9 only. I want to get rid of the zeros in these two columns and then do the division to get a manageable value in column 10. Division by zero and Inf are not considered 'manageable' by me. > What do you want column 10 to be if column 8 isn't 0? Does it already > have a value. I suppose it must. Yes column 10 does have something, but this something can be Inf or NaN, which I want to get rid of. > > d) repeat this for each of the other 32,794 rows. > > Is this possible with an R script, and is this the way to go about > it. If it is, could anyone get me started ? > > > Assuming you want to put the new values in the rows where column 8 == > 0, you can do it in two steps: > > mydata[,10] <- ifelse(mydata[,8] == 0, mydata[,9]/whatever, mydata[,10]) > #where whatever is the thing you want to divide by that probably isn't 1 > mydata[,8] <- ifelse(mydata[,8] == 0, 1, mydata[,8]) > > R programming is best done by thinking about vectorizing things, > rather than doing them in loops. Reading the Intro to R that comes > with your installation is a good place to start. Would it be better to change the data frame into a matrix, or something else ? Thanks for your help. > > Sarah > > > Matthew > > > > > -- > Sarah Goslee > http://www.stringpage.com > http://www.sarahgoslee.com > http://www.functionaldiversity.org [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] reshape: melt and cast
Yep, that works. Thanks, Stephen. I should have drawn the parallel with Excel Pivot tables sooner. On Tue, Sep 1, 2015 at 9:36 AM, stephen sefick wrote: > I would make this minimal. In other words, use an example data set, dput, > and use output of dput in a block of reproducible code. I don't understand > exactly what you want, but does sum work? If there is more than one record > for a given set of factors the sum is the sum of the counts. If only one > record, then the sum is the same as the original number. > > On Tue, Sep 1, 2015 at 10:00 AM, Matthew Pickard < > matthew.david.pick...@gmail.com> wrote: > >> Thanks, Stephen. I've looked into the fun.aggregate argument. I don't >> want to aggregate, so I thought leaving it blank (allowing it to default to >> NULL) would do that. >> >> >> Here's a corrected post (with further explanation): >> >> Hi, >> >> I have data that looks like this: >> >> >dput(head(ratings)) >> structure(list(QCode = structure(c(5L, 7L, 5L, 7L, 5L, 7L), .Label = >> c("APPEAR", >> "FEAR", "FUN", "GRAT", "GUILT", "Joy", "LOVE", "UNGRAT"), class = >> "factor"), >> PID = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("1123", >> "1136", "1137", "1142", "1146", "1147", "1148", "1149", "1152", >> "1153", "1154", "1156", "1158", "1161", "1164", "1179", "1182", >> "1183", "1191", "1196", "1197", "1198", "1199", "1200", "1201", >> "1203", "1205", "1207", "1208", "1209", "1214", "1216", "1219", >> "1220", "1222", "1223", "1224", "1225", "1226", "1229", "1236", >> "1237", "1238", "1240", "1241", "1243", "1245", "1246", "1248", >> "1254", "1255", "1256", "1257", "1260", "1262", "1264", "1268", >> "1270", "1272", "1278", "1279", "1280", "1282", "1283", "1287", >> "1288", "1292", "1293", "1297", "1310", "1311", "1315", "1329", >> "1332", "1333", "1343", "1346", "1347", "1352", "1354", "1355", >> "1356", "1360", "1368", "1369", "1370", "1378", "1398", "1400", >> "1403", "1404", "1411", "1412", "1420", "1421", "1423", "1424", >> "1426", "1428", "1432", "1433", "1435", "1436", "1438", "1439", >> "1440", "1441", "1443", "1444", "1446", "1447", "1448", "1449", >> "1450", "1453", "1454", "1456", "1459", "1460", "1461", "1462", >> "1463", "1468", "1471", "1475", "1478", "1481", "1482", "1487", >> "1488", "1490", "1493", "1495", "1497", "1503", "1504", "1508", >> "1509", "1511", "1513", "1514", "1515", "1522", "1524", "1525", >> "1526", "1527", "1528", "1529", "1532", "1534", "1536", "1538", >> "1539", "1540", "1543", "1550", "1551", "1552", "1554", "1555", >> "1556", "1558", "1559"), class = "factor"), RaterName = >> structure(c(1L, >> 1L, 1L, 1L, 1L, 1L), .Label = c("cwormhoudt", "zspeidel"), class = >> "factor"), >> SI1 = c(2L, 1L, 1L, 1L, 2L, 1L), SI2 = c(2L, 2L, 2L, 2L, >> 2L, 3L), SI3 = c(3L, 3L, 3L, 3L, 2L, 4L), SI4 = c(1L, 2L, >> 1L, 1
[R] fast way to create composite matrix based on mixed indices?
HI all, Sorry for the title here but I find this difficult to describe succinctly. Here's the problem. I want to create a new matrix where each row is a composite of an old matrix, but where the row & column indexes of the old matrix change for different parts of the new matrix. For example, the second row of new matrix (which has , e.g., 10 columns) might be columns 1 to 3 of row 2 of old matrix, columns 4 to 8 of row 1 of old matrix, and columns 9 to 10 of row 3 of old matrix. Here's an example in code: #The old matrix (old.mat <- matrix(1:30,nrow=3,byrow=TRUE)) #matrix of indices to create the new matrix from the old one. #The 1st column gives the row number of the new matrix #the 2nd gives the row of the old matrix that we're going to copy into the new matrix #the 3rd gives the starting column of the old matrix for the row in col 2 #the 4th gives the end column of the old matrix for the row in col 2 index <- matrix(c(1,1,1,4, 1,3,5,10, 2,2,1,3, 2,1,4,8, 2,3,9,10), nrow=5,byrow=TRUE, dimnames=list(NULL,c('new.mat.row','old.mat.row','old.mat.col.start','old.mat.col.end'))) I will be given old.mat and index and want to create new.mat from them. I want to create a new.matrix of two rows that looks like this: new.mat <- matrix(c(1:4,25:30,11:13,4:8,29:30),byrow=TRUE,nrow=2) So here, the first row of new.mat is columns 1 to 4 of row 1 of the old.mat and columns 5 to 10 of row 3 of old.mat. new.mat and old.mat will always have the same number of columns but the number of rows could differ. I could accomplish this in a loop, but the real problem is quite large (new.mat might have 1e8 elements), and so a for loop would be prohibitively slow. I may resort to unix tools and use a shell script, but wanted to first see if this is doable in R in a fast way. Thanks in advance! Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fast way to create composite matrix based on mixed indices?
Brilliant Denes. Thank you for your help. This worked and is obviously much faster than a loop... On Thu, Sep 17, 2015 at 3:22 PM, Dénes Tóth wrote: > Hi Matt, > > you could use matrix indexing. Here is a possible solution, which could be > optimized further (probably). > > # The old matrix > (old.mat <- matrix(1:30,nrow=3,byrow=TRUE)) > # matrix of indices > index <- matrix(c(1,1,1,4, > 1,3,5,10, > 2,2,1,3, > 2,1,4,8, > 2,3,9,10), > nrow=5,byrow=TRUE, > dimnames=list(NULL, > c('new.mat.row','old.mat.row', > 'old.mat.col.start','old.mat.col.end'))) > # expected result > new.mat <- matrix(c(1:4,25:30,11:13,4:8,29:30), > byrow=TRUE, nrow=2) > # > # column indices > ind <- mapply(seq, index[, 3], index[,4], > SIMPLIFY = FALSE, USE.NAMES = FALSE) > ind_len <- vapply(ind, length, integer(1)) > ind <- unlist(ind) > > # > # old indices > old.ind <- cbind(rep(index[,2], ind_len), ind) > # > # new indices > new.ind <- cbind(rep(index[,1], ind_len), ind) > # > # create the new matrix > result <- matrix(NA_integer_, max(index[,1]), max(index[,4])) > # > # fill the new matrix > result[new.ind] <- old.mat[old.ind] > # > # check the results > identical(result, new.mat) > > > HTH, > Denes > > > > > > On 09/17/2015 10:36 PM, Matthew Keller wrote: > >> HI all, >> >> Sorry for the title here but I find this difficult to describe succinctly. >> Here's the problem. >> >> I want to create a new matrix where each row is a composite of an old >> matrix, but where the row & column indexes of the old matrix change for >> different parts of the new matrix. For example, the second row of new >> matrix (which has , e.g., 10 columns) might be columns 1 to 3 of row 2 of >> old matrix, columns 4 to 8 of row 1 of old matrix, and columns 9 to 10 of >> row 3 of old matrix. >> >> Here's an example in code: >> >> #The old matrix >> (old.mat <- matrix(1:30,nrow=3,byrow=TRUE)) >> >> #matrix of indices to create the new matrix from the old one. >> #The 1st column gives the row number of the new matrix >> #the 2nd gives the row of the old matrix that we're going to copy into the >> new matrix >> #the 3rd gives the starting column of the old matrix for the row in col 2 >> #the 4th gives the end column of the old matrix for the row in col 2 >> index <- matrix(c(1,1,1,4, >>1,3,5,10, >>2,2,1,3, >>2,1,4,8, >>2,3,9,10), >> nrow=5,byrow=TRUE, >> >> >> dimnames=list(NULL,c('new.mat.row','old.mat.row','old.mat.col.start','old.mat.col.end'))) >> >> I will be given old.mat and index and want to create new.mat from them. >> >> I want to create a new.matrix of two rows that looks like this: >> new.mat <- matrix(c(1:4,25:30,11:13,4:8,29:30),byrow=TRUE,nrow=2) >> >> So here, the first row of new.mat is columns 1 to 4 of row 1 of the >> old.mat >> and columns 5 to 10 of row 3 of old.mat. >> >> new.mat and old.mat will always have the same number of columns but the >> number of rows could differ. >> >> I could accomplish this in a loop, but the real problem is quite large >> (new.mat might have 1e8 elements), and so a for loop would be >> prohibitively >> slow. >> >> I may resort to unix tools and use a shell script, but wanted to first see >> if this is doable in R in a fast way. >> >> Thanks in advance! >> >> Matt >> >> >> -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Why does residuals.coxph use naive.var?
Hi all, I noticed that the scaled Schoenfeld residuals produced by residuals.coxph(fit, type="scaledsch") were different from those returned by cox.zph for a model where robust standard errors have been estimated. Looking at the source code for both functions suggests this is because residuals.coxph uses the naive variance to scale the Schoenfeld residuals whereas cox.zph uses the robust version when it is available. Lines 20-21 of the version of residuals.coxph currently on github: vv <- drop(object$naive.var) if (is.null(vv)) vv <- drop(object$var) i.e. the naive variance is used even when a robust version is available. Why is this the case? Have I missed something? Am I right in thinking that using the robust variance is the better choice if the intention is to check the proportional hazards assumption? Here is a reproducible example using the heart data: data(heart) fit <- coxph(Surv(start, stop, event) ~ year + age + surgery + cluster(id), data=jasa1) # Should return True since both produce the scaled Schoenfeld residuals all(residuals(fit, type='scaledsch') == cox.zph(fit)$y) Thanks for your help. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] 64-bit R on Mac OS X 10.4.5
ot;gfortran -arch x86_64" \ > FC="gfortran -arch x86_64" \ > --with-system-zlib \ > --with-blas='-framework vecLib' --with-lapack && \ > make -j4 && \ > make check && \ > make install > cd ..when I try to run it by typing R, it gives me the following error: > -bash: R: command not found > > > Can any body help me to solve this problem or direct me to better > step-by-step instructions. > Thanks > Joseph > > > >[[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] tryCatch
Hi All R-Gurus, I am trying to debug a program, and I think tryCatch will help. The functions involved process through so many times before I encounter the error, things are a bit slow to use debug and browser(). I've read the help file and postings on conditions, and am still having trouble. While running my program I am getting a NAN from a called function, and I want to know the input parameters that generate this, so I have included the following in the code of the main function (calling function): tryCatch(delM > S, exception=function(e)print(list(S=S, Si=Si, D=D, theta=S/N, incr=del.t)), finally=print("finally")) This is actually part of an "if" statement, where delM > S is the condition. Now if delM is an NAN an error results. Now the above tryCatch does not work in the way I wish it. What sort of condition does this little expression throw when it encounters delM=NAN? is it an exception? What is wrong with the above handler etc? Kind regards, Matt Redding DISCLAIMER The information contained in the above e-mail message or messages (which includes any attachments) is confidential and may be legally privileged. It is intended only for the use of the person or entity to which it is addressed. If you are not the addressee any form of disclosure, copying, modification, distribution or any action taken or omitted in reliance on the information is unauthorised. Opinions contained in the message(s) do not necessarily reflect the opinions of the Queensland Government and its authorities. If you received this communication in error, please notify the sender immediately and delete it from your computer system network. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] RGoogleDocs: getDocs() - "problems connecting to get the list of documents"
Hi I have been using RGoogleDocs successfully for some time now but something seems to have happened which is preventing me from accessing my data in google spreadsheets. I get the message: "problems connecting to get the list of documents" when I use getDocs, despite being logged in e.g. sheets.con = getGoogleDocsConnection(getGoogleAuth("username", "password", service = "wise")) ts = getWorksheets("formname", sheets.con) è Error in getDocs(con) : problems connecting to get the list of documents Does anyone know what might be causing this? Is it maybe a problem at the google end? Matthew Blackett Researcher King's College London http://geography.kcl.ac.uk/micromet/MBlackett/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Video demo of using svSocket with data.table
Dear r-help, If you haven't already seen this then : http://www.youtube.com/watch?v=rvT8XThGA8o The video consists of typing at the console and graphics, there is no audio or slides. Please press the HD button and maximise. Its about 8 mins. Regards, Matthew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Plotting point text-labels with lattice splom
I have read the thread re: "Plotting text with lattice" but can't seem to get from there to what I need. . . would appreciate any advice. . . I have used splom to plot data of the first three principle components from a pca analysis. Here is the code I have thus far: > mydata.pr<-prcomp(mydata) > grps <- substr(rownames(mydata),1,4) > super.sym=trellis.par.get("superpose.symbol") > splom(data.frame(mydata.pr$x[,1:3]), groups = grps, panel = panel.superpose, key = list (title = "Four Items in PCA space", text = list(c("G", "H", "N", "Il")), points=list(pch=super.sym$pch[1:4], col=super.sym$col[1:4]))) I would now like to append text labels to each point in the plot that will identify the item based on its rowname in the original data set. so, something like this gets me the labels I want > labs<-substr(rownames(mydata),1,6) My trouble then comes in figuring out how to get these labels to "attach" to the corresponding points in the plot. Thanks. Matt -- Matthew Jockers Stanford University http://www.stanford.edu/~mjockers __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R help:
Hi I have written a code to do some averaging of data over uneven intervals. The for loop keeps missing particular depths and I once got an error message reading: *** caught segfault *** address 0xc023, cause 'memory not mapped' Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace The portion of the code that is giving me problems is: if(length(which( interp.depth == highres.depth[i] )) >0 ) { print(paste("depth = ",highres.depth[i],sep="")) depth.tracker <- c(highres.depth[i],depth.tracker) caco3.interp.vector <- c(mean(caco3.interp),caco3.interp.vector) caco3.interp <- numeric(0) } When the routine misses a depth, it returns a length of zero for (say) depth = 1.4, or highres.depth[141]. but when i type in the value 1.4, I get the proper answer. Any idea what is going on here? thanks Matt ______ Matthew S. Fantle Assistant Professor Department of Geosciences Penn State University 212 Deike Bldg. University Park, PA 16802 Phone: 814-863-9968 mfan...@psu.edu Departmental Homepage http://www.geosc.psu.edu/people/faculty/personalpages/mfantle/index.html [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data.table evaluating columns
I'd go a bit further and remind that the r-help posting guide is clear : " For questions about functions in standard packages distributed with R (see the FAQ Add-on packages in R), ask questions on R-help. If the question relates to a contributed package , e.g., one downloaded from CRAN, try contacting the package maintainer first. You can also use find("functionname") and packageDescription("packagename") to find this information. ONLY send such questions to R-help or R-devel if you get no reply or need further assistance. This applies to both requests for help and to bug reports. " The "ONLY" is in bold in the posting guide. I changed the bold to capitals above for people reading this in text only. Since Tom and I are friendly and responsive, users of data.table don't usually make it to r-help. We'll follow up this one off-list. Please note that Rob's question is very good by the rest of the posting guide, so no complaints there, only that it was sent to the wrong place. Please keep the questions coming, but send them to us, not r-help. You do sometimes see messages to r-help starting something like "I have contacted the authors/maintainers but didn't hear back, does anyone know ...". To not state that they had would be an implicit request for further work by the community (for free) to ask if they had. So its not enough to contact the maintainer first, but you also have to say that you have as well, and perhaps how long ago too would be helpful. For r-forge projects I usually send any question to everyone on the project (easy to find) or if they have a list then to that. HTH Matthew "Tom Short" wrote in message news:fd27013a1003021718w409acb32r1281dfeca5593...@mail.gmail.com... On Tue, Mar 2, 2010 at 7:09 PM, Rob Forler wrote: > Hi everyone, > > I have the following code that works in data frames taht I would like tow > ork in data.tables . However, I'm not really sure how to go about it. > > I basically have the following > > names = c("data1", "data2") > frame = data.frame(list(key1=as.integer(c(1,2,3,4,5,6)), > key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2= > c(3,3,2,3,5,2))) > > for(i in 1:length(names)){ > frame[, paste(names[i], "flag")] = frame[,names[i]] < 3 > > } > > Now I try with data.table code: > names = c("data1", "data2") > frame = data.table(list(key1=as.integer(c(1,2,3,4,5,6)), > key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2= > c(3,3,2,3,5,2))) > > for(i in 1:length(names)){ > frame[, paste(names[i], "flag"), with=F] = as.matrix(frame[,names[i], > with=F] )< 3 > > } Rob, this type of question is better for the package maintainer(s) directly rather than R-help. That said, one answer is to use list addressing: for(i in 1:length(names)){ frame[[paste(names[i], "flag")]] = frame[[names[i]]] < 3 } Another option is to manipulate frame as a data frame and convert to data.table when you need that functionality (conversion is quick). In the data table version, frame[,names[i], with=F] is the same as frame[,names[i], drop=FALSE] (the answer is a list, not a vector). Normally, it's easier to use [[]] or $ indexing to get this. Also, fname[i,j] <- something assignment is still a bit buggy for data.tables. - Tom Tom Short __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data.table evaluating columns
That in itself is a question for the maintainer, off r-help. When the posting guide says "contact the package maintainer first" it means it literally and applies even to questions about the existence of a mailing list for the package. So what I'm supposed to do now is tell you how the posting guide works, and tell you that I'll reply off list. Then hopefully the community will be happy with me too. So I'll reply off list :-) "Rob Forler" wrote in message news:eb472fec1003030502s4996511ap8dfd329a3...@mail.gmail.com... > Okay I appreciate the help, and I appreciate the FAQ reminder. I will read > the r-help posting guide. I'm relatively new to using the support systems > around R. So far everyone has been really helpful. > > I'm confused as to which data.table "list" I should be using. > http://lists.r-forge.r-project.org/pipermail/datatable-commits/ doesn't > appear to be correct. Or just directly sending an email to all of you? > > Thanks again, > Rob > > > > On Wed, Mar 3, 2010 at 6:05 AM, Matthew Dowle > wrote: > >> >> I'd go a bit further and remind that the r-help posting guide is clear : >> >> " For questions about functions in standard packages distributed with R >> (see the FAQ Add-on packages in R), ask questions on R-help. >> If the question relates to a contributed package , e.g., one downloaded >> from >> CRAN, try contacting the package maintainer first. You can also use >> find("functionname") and packageDescription("packagename") to find this >> information. ONLY send such questions to R-help or R-devel if you get no >> reply or need further assistance. This applies to both requests for help >> and >> to bug reports. " >> >> The "ONLY" is in bold in the posting guide. I changed the bold to >> capitals >> above for people reading this in text only. >> >> Since Tom and I are friendly and responsive, users of data.table don't >> usually make it to r-help. We'll follow up this one off-list. Please >> note >> that Rob's question is very good by the rest of the posting guide, so no >> complaints there, only that it was sent to the wrong place. Please keep >> the >> questions coming, but send them to us, not r-help. >> >> You do sometimes see messages to r-help starting something like "I have >> contacted the authors/maintainers but didn't hear back, does anyone know >> ...". To not state that they had would be an implicit request for >> further >> work by the community (for free) to ask if they had. So its not enough to >> contact the maintainer first, but you also have to say that you have as >> well, and perhaps how long ago too would be helpful. For r-forge >> projects >> I >> usually send any question to everyone on the project (easy to find) or if >> they have a list then to that. >> >> HTH >> Matthew >> >> >> "Tom Short" wrote in message >> news:fd27013a1003021718w409acb32r1281dfeca5593...@mail.gmail.com... >> On Tue, Mar 2, 2010 at 7:09 PM, Rob Forler wrote: >> > Hi everyone, >> > >> > I have the following code that works in data frames taht I would like >> > tow >> > ork in data.tables . However, I'm not really sure how to go about it. >> > >> > I basically have the following >> > >> > names = c("data1", "data2") >> > frame = data.frame(list(key1=as.integer(c(1,2,3,4,5,6)), >> > key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2= >> > c(3,3,2,3,5,2))) >> > >> > for(i in 1:length(names)){ >> > frame[, paste(names[i], "flag")] = frame[,names[i]] < 3 >> > >> > } >> > >> > Now I try with data.table code: >> > names = c("data1", "data2") >> > frame = data.table(list(key1=as.integer(c(1,2,3,4,5,6)), >> > key2=as.integer(c(1,2,3,2,5,6)),data1 = c(3,3,2,3,5,2), data2= >> > c(3,3,2,3,5,2))) >> > >> > for(i in 1:length(names)){ >> > frame[, paste(names[i], "flag"), with=F] = as.matrix(frame[,names[i], >> > with=F] )< 3 >> > >> > } >> >> Rob, this type of question is better for the package maintainer(s) >> directly rather than R-help. That said, one answer is to use list >> addressing: >> >> for(i in 1:length(names)){ >>frame[[paste(names[i], "flag")]] = frame[[names[i]]] < 3 >> } >> >> Another option
Re: [R] Three most useful R package
Dieter, One way to check if a package is active, is by looking on r-forge. If you are referring to data.table you would have found it is actually very active at the moment and is far from abandoned. What you may be referring to is a warning, not an error, with v1.2 on R2.10+. That was fixed many moons ago. The r-forge version is where its at. Rather than commenting in public about a warning on a package, and making a conclusion about its abandonment, and doing this without copying the maintainer, perhaps you could have contacted the maintainer to let him know you had found a problem. That would have been a more community spirited action to take. Doing that at the time you found out would have been helpful too rather than saving it up for now. Or you can always check the svn logs yourself, as the r-forge guys even made that trivial to do. All, Can we please now stop this thread ? The crantastic people worked hard to provide a better solution. If the community refuses to use crantastic, thats up to the community, but to start now filling up r-help with votes on packages when so much effort was put in to a much much better solution ages ago? Its as quick to put your votes into crantastic as it is to write to r-help. What your problem, folks, with crantastic? The second reply mentioned crantastic but you all chose to ignore it, it seems. If you want to vote, use crantastic. If you don't want to vote, don't vote. But using r-help to vote ?! The better solution is right there: http://crantastic.org/ Matthew "Dieter Menne" wrote in message news:1267626882999-1576618.p...@n4.nabble.com... > > > Rob Forler wrote: >> >> And data.table because it does aggregation about 50x times faster than >> plyr >> (which I used to use a lot). >> >> > > This is correct, from the error message its spits out one has to conclude > that is was abandoned at R-version 2.4.x > > Dieter > > > > > -- > View this message in context: > http://n4.nabble.com/Three-most-useful-R-package-tp1575671p1576618.html > Sent from the R help mailing list archive at Nabble.com. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ifthen() question
This post breaks the posting guide in multiple ways. Please read it again (and then again) - in particular the first 3 paragraphs. You will help yourself by following it. The solution is right there in the help page for ?data.frame and other places including Introduction to R. I think its more helpful *not* to tell you what it is, so that you discover it for yourself, learn how to learn, and google. I hope that you appreciate it that I've been helpful just simply (and quickly) telling you the answer *is* there. Having said that, you don't appear to be aware of many of the packages around that does this task - you appear to be re-inventing the wheel. I suggest you briefly investigate each and every one of the top 30 packages ranked by crantastic, before writing any more R code. A little time invested doing that will pay you dividends in the long run. That is not a complaint of you though, as that advice is not in the posting guide. Matthew "AC Del Re" wrote in message news:85cf8f8d1003040735k2b076142jc99b7ec34da87...@mail.gmail.com... > Hi All, > > I am using a specialized aggregation function to reduce a dataset with > multiple rows per id down to 1 row per id. My function work perfect when > there are >1 id but alters the 'var.g' in undesirable ways when this > condition is not met, Therefore, I have been trying ifthen() statements to > keep the original value when length of unique id == 1 but I cannot get it > to > work. e.g.: > > #function to aggregate effect sizes: > aggs <- function(g, n.1, n.2, cor = .50) { > n.1 <- mean(n.1) > n.2 <- mean(n.2) > N_ES <- length(g) > corr.mat <- matrix (rep(cor, N_ES^2), nrow=N_ES) > diag(corr.mat) <- 1 > g1g2 <- cbind(g) %*% g > PSI <- (8*corr.mat + g1g2*corr.mat^2)/(2*(n.1+n.2)) > PSI.inv <- solve(PSI) > a <- rowSums(PSI.inv)/sum(PSI.inv) > var.g <- 1/sum(PSI.inv) > g <- sum(g*a) > out<-cbind(g,var.g, n.1, n.2) > return(out) > } > > > # automating this procedure for all rows of df. This format works perfect > when there is > 1 id per row only: > > agg_g <- function(id, g, n.1, n.2, cor = .50) { > st <- unique(id) > out <- data.frame(id=rep(NA,length(st))) > for(i in 1:length(st)) { >out$id[i] <- st[i] >out$g[i] <- aggs(g=g[id==st[i]], n.1= n.1[id==st[i]], > n.2 = n.2[id==st[i]], cor)[1] >out$var.g[i] <- aggs(g=g[id==st[i]], n.1= n.1[id==st[i]], > n.2 = n.2[id==st[i]], cor)[2] >out$n.1[i] <- round(mean(n.1[id==st[i]]),0) >out$n.2[i] <- round(mean(n.2[id==st[i]]),0) > } > return(out) > } > > > # The attempted solution using ifthen() and minor changes to function but > it's not working properly: > agg_g <- function(df,var.g, id, g, n.1, n.2, cor = .50) { > df$var.g <- var.g > st <- unique(id) > out <- data.frame(id=rep(NA,length(st))) > for(i in 1:length(st)) { >out$id[i] <- st[i] >out$g[i] <- aggs(g=g[id==st[i]], n.1= n.1[id==st[i]], > n.2 = n.2[id==st[i]], cor)[1] >out$var.g[i]<-ifelse(length(st[i])==1, df$var.g[id==st[i]], > aggs(g=g[id==st[i]], > n.1= n.1[id==st[i]], > n.2 = n.2[id==st[i]], cor)[2]) >out$n.1[i] <- round(mean(n.1[id==st[i]]),0) >out$n.2[i] <- round(mean(n.2[id==st[i]]),0) > } > return(out) > } > > # sample data: > id<-c(1, rep(1:19)) > n.1<-c(10,20,13,22,28,12,12,36,19,12,36,75,33,121,37,14,40,16,14,20) > n.2 <- c(11,22,10,20,25,12,12,36,19,11,34,75,33,120,37,14,40,16,10,21) > g <- c(.68,.56,.23,.64,.49,-.04,1.49,1.33,.58,1.18,-.11,1.27,.26,.40,.49, > .51,.40,.34,.42,1.16) > var.g <- > c(.08,.06,.03,.04,.09,.04,.009,.033,.0058,.018,.011,.027,.026,.0040, > .049,.0051,.040,.034,.0042,.016) > df<-data.frame(id, n.1,n.2, g, var.g) > > Any help is much appreciated, > > AC > > [[alternative HTML version deleted]] > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Nonparametric generalization of ANOVA
Frank, I respect your views but I agree with Gabor. The posting guide does not support your views. It is not any of our views that are important but we are following the posting guide. It covers affiliation. It says only that "some" consider it "good manners to include a concise signature specifying affiliation". It does not agree that it is bad manners not to. It is therefore going too far to urge R-gurus, whoever they might be, to ignore such postings on that basis alone. It is up to responders (I think that is the better word which is the one used by the posting guide) whether they reply. Missing affiliation is ok by the posting guide. Users shouldn't be put off from posting because of that alone. Sending from an anonymous email address such as "BioStudent" is also fine by the posting guide as far as my eyes read it. It says only that the email address should work. I would also answer such anonymous posts, providing they demonstrate they made best efforts to follow the posting guide, as usual for all requests for help. Its so easy to send from a false, but apparently real name, why worry about that? If you disagree with the posting guide then you could make a suggestion to get the posting guide changed with respect to these points. But, currently, good and practice is defined by the posting guide, and I can't see that your view is backed up by it. In fact it seems to me that these points were carefully considered, and the wording is careful on these points. As far as I know you are wrong that there is no moderator. There are in fact an uncountable number of people who are empowered to moderate i.e. all of us. In other words its up to the responders to moderate. The posting guide is our guide. As a last resort we can alert the list administrator (which I believe is the correct name for him in that role), who has powers to remove an email address from the list if he thinks that is appropriate, or act otherwise, or not at all. It is actually up to responders (i.e. all of us) to ensure the posting guide is followed. My view is that the problems started with some responders on some occasions. They sometimes forgot, a little bit, to encourage and remind posters to follow the posting guide when it was not followed. This then may have encouraged more posters to think it was ok not to follow the posting guide. That is my own personal view, not a statistical one backed up by any evidence. Matthew "Frank E Harrell Jr" wrote in message news:4b913880.9020...@vanderbilt.edu... > Gabor Grothendieck wrote: >> I am happy to answer posts to r-help regardless of the name and email >> address of the poster but would draw the line at someone excessively >> posting without a reasonable effort to find the answer first or using >> it for homework since such requests could flood the list making it >> useless for everyone. > > Gabor I respectfully disagree. It is bad practice to allow anonymous > postings. We need to see real names and real affiliations. > > r-help is starting to border on uselessness because of the age old problem > of the same question being asked every two days, a high frequency of > specialty questions, and answers given with the best of intentions in > incremental or contradictory e-mail pieces (as opposed to a cumulative > wiki or hierarchically designed discussion web forum), as there is no > moderator for the list. We don't need even more traffic from anonymous > postings. > > Frank > >> >> On Fri, Mar 5, 2010 at 10:55 AM, Ravi Varadhan >> wrote: >>> David, >>> >>> I agree with your sentiments. I also think that it is bad posting >>> etiquette not to sign one's genuine name and affiliation when asking for >>> help, which "blue sky" seems to do a lot. Bert Gunter has already >>> raised this issue, and I completely agree with him. I would also like to >>> urge the R-gurus to ignore such postings. >>> >>> Best, >>> Ravi. >>> >>> >>> Ravi Varadhan, Ph.D. >>> Assistant Professor, >>> Division of Geriatric Medicine and Gerontology >>> School of Medicine >>> Johns Hopkins University >>> >>> Ph. (410) 502-2619 >>> email: rvarad...@jhmi.edu >>> >>> >>> - Original Message - >>> From: David Winsemius >>> Date: Friday, March 5, 2010 9:25 am >>> Subject: Re: [R] Nonparametric generalization of ANOVA >>> To: blue sky >>> Cc: r-h...@stat.math.ethz.ch >>> >>> >>>> On Mar 5, 2010, at 8:19 AM, blue sky wrote: >>>> >&
Re: [R] Nonparametric generalization of ANOVA
John, So you want BlueSky to change their name to "Paul Smith" at "New York University", just to give a totally random, false name, example, and then you will be happy ? I just picked a popular, real name at a real, big place. Are you, or is anyone else, going to check its real ? We want BlueSky to ask great questions, which haven't been asked before, and to follow the posting guide. If BlueSky improves the knowledge base whats the problem? This person may well be breaking the posting guide for many other reasons (I haven't looked), and if they are then you could take issue with them on those points, but not for simply writing as "BlueSky". David W has got it right when he replied to "ManInMoon". Shall we stop this thread now, and follow his lead ? I would have picked "ManOnMoon" myself but maybe that one was taken. Its rather difficult to be on a moon, let alone inside it. Matthew "John Sorkin" wrote in message news:4b91068702cb00064...@medicine.umaryland.edu... > The sad part of this interchanges is that Blue Sky does not seem to be > amiable to suggestion. He, or she, has not taken note, or responded to the > fact that a number of people believe it is good manners to give a real > name and affiliation. My mother taught me that when two people tell you > that you are drunk you should lie down until the inebriation goes away. > Blue Sky, several people have noted that you would do well to give us your > name and affiliation. Is this too much to ask given that people are good > enough to help you? > John > > > > > John David Sorkin M.D., Ph.D. > Chief, Biostatistics and Informatics > University of Maryland School of Medicine Division of Gerontology > Baltimore VA Medical Center > 10 North Greene Street > GRECC (BT/18/GR) > Baltimore, MD 21201-1524 > (Phone) 410-605-7119 > (Fax) 410-605-7913 (Please call phone number above prior to faxing)>>> > "Matthew Dowle" 3/5/2010 12:58 PM >>> > Frank, I respect your views but I agree with Gabor. The posting guide > does > not support your views. > > It is not any of our views that are important but we are following the > posting guide. It covers affiliation. It says only that "some" consider > it > "good manners to include a concise signature specifying affiliation". It > does not agree that it is bad manners not to. It is therefore going too > far > to urge R-gurus, whoever they might be, to ignore such postings on that > basis alone. It is up to responders (I think that is the better word > which > is the one used by the posting guide) whether they reply. Missing > affiliation is ok by the posting guide. Users shouldn't be put off from > posting because of that alone. > > Sending from an anonymous email address such as "BioStudent" is also fine > by > the posting guide as far as my eyes read it. It says only that the email > address should work. I would also answer such anonymous posts, providing > they demonstrate they made best efforts to follow the posting guide, as > usual for all requests for help. Its so easy to send from a false, but > apparently real name, why worry about that? > > If you disagree with the posting guide then you could make a suggestion to > get the posting guide changed with respect to these points. But, > currently, > good and practice is defined by the posting guide, and I can't see that > your > view is backed up by it. In fact it seems to me that these points were > carefully considered, and the wording is careful on these points. > > As far as I know you are wrong that there is no moderator. There are in > fact an uncountable number of people who are empowered to moderate i.e. > all > of us. In other words its up to the responders to moderate. The posting > guide is our guide. As a last resort we can alert the list administrator > (which I believe is the correct name for him in that role), who has powers > to remove an email address from the list if he thinks that is appropriate, > or act otherwise, or not at all. It is actually up to responders (i.e. > all > of us) to ensure the posting guide is followed. > > My view is that the problems started with some responders on some > occasions. > They sometimes forgot, a little bit, to encourage and remind posters to > follow the posting guide when it was not followed. This then may have > encouraged more posters to think it was ok not to follow the posting > guide. > That is my own personal view, not a statistical one backed up by any > evidence. > > Matthew > > > "Frank E Harrell Jr" wrote in message > news:4b913880.9020...@vanderbil
Re: [R] fit a gamma pdf using Residual Sum-of-Squares
Thanks for making it quickly reproducible - I was able to see that message in English within a few seconds. The start has x=86, but the data is also called x. Remove x=86 from start and you get a different error. P.S. - please do include the R version information. It saves time for us, and we like it if you save us time. "vincent laperriere" wrote in message news:883644.16455...@web24106.mail.ird.yahoo.com... Hi all, I would like to fit a gamma pdf to my data using the method of RSS (Residual Sum-of-Squares). Here are the data: x <- c(86, 90, 94, 98, 102, 106, 110, 114, 118, 122, 126, 130, 134, 138, 142, 146, 150, 154, 158, 162, 166, 170, 174) y <- c(2, 5, 10, 17, 26, 60, 94, 128, 137, 128, 77, 68, 65, 60, 51, 26, 17, 9, 5, 2, 3, 7, 3) I have typed the following code, using nls method: fit <- nls(y ~ (1/((s^a)*gamma(a))*x^(a-1)*exp(-x/s)), start = c(s=3, a=75, x=86)) But I have the following message error (sorry, this is in German): Fehler in qr(.swts * attr(rhs, "gradient")) : Dimensionen [Produkt 3] passen nicht zur Länge des Objektes [23] Zusätzlich: Warnmeldung: In .swts * attr(rhs, "gradient") : Länge des längeren Objektes ist kein Vielfaches der Länge des kürzeren Objektes Could anyone help me with the code? I would greatly appreciate it. Sincerely yours, Vincent Laperrière. [[alternative HTML version deleted]] > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] IMPORTANT - To remove the null elements from a vector
Welcome to R Barbara. Its quite an incredible community from all walks of life. Your beginner questions are answered in the manual. See Introduction to R. Please read the posting guide again because it contains lots of good advice for you. Some people read it three times before posting because they have so much respect for the community. Sometimes they trip up over themselves to show they have read it. Btw - just to let you know that starting your subject lines with "IMPORTANT" is considered by some people a demanding tone for free help. Not everyone, but some people. Two posts starting IMPORTANT within 5 minutes is another thing that a very large number of people around the world may have just seen you do. I'm just letting you know, in case you were not aware of this. You received answers from four people who clearly don't mind, and you have your answers. Was that your only goal in posting? Did you consider there might be downsides? This is a public list read by many people and one thing the posting guide says is that your questions are saved in the archives forever. Just checking you knew that. I wouldn't want you to reduce your reputation accidentally. A future employer (it might be a company, or it might be a university) anywhere in the world might do a simple search on your name, and thats why you might not get an interview, because you had showed (in their minds) that you didn't have respect for guidlines. I would hate for something like that to happen, all just because you didn't know you were supposed to read the posting guide, it wouldn't be fair on you. So it would be very unfair of me to know that, and suspect that you don't, but not tell you about the posting guide, wouldn't it ? I hope this information helps you. It is entirely up to you. r-help is a great way to increase your reputation, but it can reduce your reputation too. By asking great questions, or even contributing, you can proudly put that on your CV and increase your chances of getting that interview, or getting that position. I have seen on several CVs from students the text "please search for my name on r-help". Just like everything you do in public, r-help is very similar. What you write, you write in the public domain, and you write it free of charge, and free of restriction. All this applies to all us. When asking for help, and when giving help. Matthew wrote in message news:of1a8063a1.fc14f5ff-onc12576e1.00466053-c12576e1.00466...@uniroma1.it... > > I have a vector that have null elements. How to remove these elements? > For example: > x=[10 0 30 40 0 0] I want the vector y=[10 30 40] > Thanks > [[alternative HTML version deleted]] > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed
Your choice of subject line alone shows some people that you missed some small details from the posting guide. The ability to notice small details may be important for you to demonstrate in future. Any answer in this thread is unlikely to be found by a topic search on subject lines alone since "speed" is a single word. One fast way to increase your reputation is to contribute. You now have an opportunity. If you follow Jim's good advice, discover the answer for yourself, and post it back to the group, changing the subject line so that its easier for others to find in future, thats one way you can contribute and increase your reputation. If you don't do that, thats your choice. It is entirely up to you. Whatever action you take next, even doing nothing is an action, it is visible in public for everyone to search back and find out within seconds. HTH "Adam Majewski" wrote in message news:hn6fp4$2g...@dough.gmane.org... > Hi, > > I have found some example of R code : > http://commons.wikimedia.org/wiki/File:Mandelbrot_Creation_Animation_%28800x600%29.gif > > When I run this code on my computer it takes few seconds. > > I wanted to make similar program in Maxima CAS : > > http://thread.gmane.org/gmane.comp.mathematics.maxima.general/29949/focus=29968 > > for example : > > f(x,y,n) := > block([i:0, c:x+y*%i,ER:4,iMax:n,z:0], > while abs(z)do (z:z*z + c,i:i+1), > min(ER,abs(z)))$ > > wxanimate_draw3d( >n, 5, >enhanced3d=true, >user_preamble="set pm3d at b; set view map", >xu_grid=70, >yv_grid=70, >explicit('f(x,y,n), x, -2, 0.7, y, -1.2, 1.2))$ > > > But it takes so long to make even one image ( hours) > > What makes the difference, and why R so fast ? > > Regards > > Adam > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Strange result in survey package: svyvar
This list is the wrong place for that question. The posting guide tells you, in bold, to contact the package maintainer first. If you had already done that, and didn't hear back from him, then you should tell us, so that we know you followed the guide. "Corey Sparks" wrote in message news:c7bd3ca5.206a%corey.spa...@utsa.edu... > Hi R users, > I'm using the survey package to calculate summary statistics for a large > health survey (the Demographic and Health Survey for Honduras, 2006), and > when I try to calculate the variances for several variables, I get > negative > numbers. I thought it may be my data, so I ran the example on the help > page: > > data(api) > ## one-stage cluster sample > dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) > > svyvar(~api00+enroll+api.stu+api99, dclus1) >variance SE > api0011182.8 1386.4 > api0011516.3 1412.9 > api.stu -4547.1 3164.9 > api9912735.2 1450.1 > > If I look at the full matrix for the variances (and covariances): > test<-svyvar(~api00+enroll+api.stu+api99, dclus1) > > print(test, covariance=T) >variance SE > api00:api00 11182.8 1386.4 > enroll:api00 -5492.4 3458.1 > api.stu:api00-4547.1 3164.9 > api99:api00 11516.3 1412.9 > api00:enroll -5492.4 3458.1 > enroll:enroll 136424.3 41377.2 > api.stu:enroll 114035.7 34153.9 > api99:enroll -3922.3 3589.9 > api00:api.stu-4547.1 3164.9 > enroll:api.stu 114035.7 34153.9 > api.stu:api.stu 96218.9 28413.7 > api99:api.stu-3060.0 3260.9 > api00:api99 11516.3 1412.9 > enroll:api99 -3922.3 3589.9 > api.stu:api99-3060.0 3260.9 > api99:api99 12735.2 1450.1 > > > I see that the function is actually returning the covariance for the > api.stu > with the api00 variable. > > I can get the correct variances if I just take > diag(test) > > But I just was wondering if anyone else was having this problem. I'm > using > : >> sessionInfo() > R version 2.10.1 Patched (2009-12-20 r50794) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] survey_3.19 > > loaded via a namespace (and not attached): > [1] tools_2.10.1 > > And have the same error on a linux server. > > Thanks, > Corey > -- > Corey Sparks > Assistant Professor > Department of Demography and Organization Studies > University of Texas at San Antonio > 501 West Durango Blvd > Monterey Building 2.270C > San Antonio, TX 78207 > 210-458-3166 > corey.sparks 'at' utsa.edu > https://rowdyspace.utsa.edu/users/ozd504/www/index.htm > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] leaps error
R version: 2.10.0 platform: i486-pc-linux-gnu I'm trying to perform model selection from a data.frame object (creatively named "data") using the leaps function, and I run across the following error: > leaps(data[,3:7], data[,1], nbest = 10) Error in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = NCOL(x) + int, : character variables must be duplicated in .C/.Fortran Here is a sample of what 'data' looks like: ethanol flask batch delabso delgluc delglyc ph 1 0.00 1 01.41 0.0 0.7 1 2 0.00 2 01.33 0.0 0.6 9 3 0.00 2 01.18 0.0 1.1 1 4 0.00 3 11.58 0.0 3.5 1 5 0.00 4 01.25 0.0 5.0 1 6 0.00 4 0 -0.01 0.0 5.0 1 7 0.32 5 0 -0.08 0.0 1.5 1 8 0.00 6 11.22 0.1 3.0 1 9 0.00 6 11.30 0.3 0.4 1 100.13 7 01.48 0.3 1.4 1 where flask, batch, and ph are factors and the rest of the variables are of class numeric. There are NAs in this data set. Does anyone understand this? Thanks Matt Scholz PhD Candidate Department of Ag. and Biosystems Engineering University of Arizona (520)6266947 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Forecasting with Panel Data
Ricardo, I see you got no public answer so far, on either of the two lists you posted to at the same time yesterday. You are therefore unlikely to ever get a reply. I also see you've been having trouble getting answers in the past, back to Nov 09, at least. For example no reply to "Credit Migration Matrix" (Jan 2010) and no reply to "Help with a Loop in function" (Nov 2009). For your information, this is a public place and it took me about 10 seconds to assess you. Anyone else on the planet can do this too. Please read the posting guide AND the links from it, especially the last link. I suggest you read it fully, and slowly. I think its just that you didn't know about it, or somehow missed it by accident. You were told to read it though, at the time you subscribed to this list, at least. Don't worry, this is not a huge problem. You can build up your reputation again very quickly. With the kindest of regards, Matthew "Ricardo Gonçalves Silva" wrote in message news:df406bd9dbe644a9b8c0642a3c3f8...@ricardopc... > Dear Users, > > Can I perform panel data (fixed effects model) out of sample forecasts > using R? > > Thanks in advance, > > Ricardo. > [[alternative HTML version deleted]] > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [R-SIG-Mac] How to interrupt an R process that hangs
Hi all, Thanks Simon and Duncan for the help. Sorry to be dense, but I'm still unsure how to interrupt such processes. Here's an example: for (i in 1:10){ a <- matrix(rnorm(10*10),ncol=10) b <- svd(a) } If you run this, R will hang (i.e., it's a legitimate execution, it will just take a really long time to execute). The most obvious solution is to write code that doesn't do unintended things, but that's not always possible. Is there a way to interrupt it? I tried: kill -s INT and at least on Mac it had no effect. Thanks again, Matt On Mon, Mar 15, 2010 at 1:19 PM, Simon Urbanek wrote: > > On Mar 15, 2010, at 14:42 , Adam D. I. Kramer wrote: > >> +1--this is the single most-annoying issue with R that I know of. >> >> My usual solution, after accomplishing nothing as R spins idly for a >> couple >> hours, is to kill the process and lose any un-saved work. save.history() >> is >> my friend, but is a big delay when you work with big data sets as I do, so >> I >> don't run it after every command. >> >> I have cc'd r-help here, however, because I experience this problem with >> non-OSX R as well...when I run it in Linux or from the OSX command-line (I >> compile R for Darwin without aqua/R-framework), the same thing happens. >> >> Is there some way around this? Is this a known problem? >> > > "Hanging" for a long period of time is usually caused by poorly written > C/Fortran code. You can always interrupt R as long as it is in the R code. > Once you load a package that uses native code (C/Fortran/..) you have to > rely on the sanity of the developer to call R_CheckUserInterrupt() or > rchkusr() often enough (see 6.12 in R-ext). If you have some particular > package that does not do that, I would suggest alerting the author. By > definition this requires cooperation from authors, because interrupting > random code forcefully (as it was possible many years ago) creates leaks and > unstable states. > > Cheers, > Simon > > > >> Google searching suggests no solution, timeline, or anything, but the >> problem has been annoying users for at least twelve years: >> http://tolstoy.newcastle.edu.au/R/help/9704/0151.html >> >> Cordially, >> Adam >> >> On Mon, 15 Mar 2010, Matthew Keller wrote: >> >>> HI all, >>> >>> Apologies for this question. I'm sure it's been asked many times, but >>> despite 20 minutes of looking, I can't find the answer. I never use >>> the GUI, I use emacs, but my postdoc does, so I don't know what to >>> tell her about the following: >>> >>> Occasionally she'll mess up in her code and cause R to hang >>> indefinitely (e.g., R is trying to do something that will take days). >>> In these situations, is there an option other than killing R (and the >>> work you've done on your script to that point)? >>> >>> Thank you, >>> >>> Matthew Keller >>> >>> >>> -- >>> Matthew C Keller >>> Asst. Professor of Psychology >>> University of Colorado at Boulder >>> www.matthewckeller.com >>> >>> ___ >>> R-SIG-Mac mailing list >>> r-sig-...@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >>> >> >> ___ >> R-SIG-Mac mailing list >> r-sig-...@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >> >> > > -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [R-SIG-Mac] How to interrupt an R process that hangs
Hi all, Thanks for the responses. Ted - thank you for your help. I had to laugh. I'm no computer guru, but I do know unix well enough to know not to type "". But then again, my original code did contain a matrix with >>2^31-1 elements, so maybe your assumption was reasonable ;) Anyway, all your kill statements merely kill R, script included, which doesn't really do what I'd like. Thus, summary of responses: Question: "How do I interrupt an R process that's taking too long?" Answer: "You don't. Kill R. And don't make mistakes." Matthew On Mon, Mar 15, 2010 at 2:49 PM, Ted Harding wrote: > [Though I'm not using a Mac, OS X is a Unix variant and should > have the commands used below installed] > > Did you *literally* do > kill -s INT > without substituting the R PID for " In Mac console, do > > ps aux | grep R > > On my Linux machine this currently responds with (amongst some > irrelevant lines): > > ted 8625 0.0 3.2 41568 34096 pts/6 S+ Mar13 0:07 > /usr/lib/R/bin/exec/R --no-save > > showing that the PID of the R process is 8625. Then you can do > whatever corresponds to > > kill -s INT 8625 > > (replacing "8625" with what you get from ps). However, when I > just tried it, it didn't work for me either. So I changed the > Signal from "INT" to "HUP", and this time it did work. Maybe > try this instead? > > Other ways of using 'kill' include > (a) Use the signal number (1 for HUP, 2 for INT) like > > kill -1 8625 or kill -2 8625 > > (b) Don't search for the numeric Process ID (PID) but kill it > by name ('killall' command): > > killall -1 R or killall -2 R > > However, this will kill every running instance of R (if you > two or more running simultaneously), and you may not want that! > > Hoping this helps, > Ted. > > > > On 15-Mar-10 20:20:29, Matthew Keller wrote: >> Hi all, >> >> Thanks Simon and Duncan for the help. Sorry to be dense, but I'm still >> unsure how to interrupt such processes. Here's an example: >> >> for (i in 1:10){ >> a <- matrix(rnorm(10*10),ncol=10) >> b <- svd(a) } >> >> If you run this, R will hang (i.e., it's a legitimate execution, it >> will just take a really long time to execute). The most obvious >> solution is to write code that doesn't do unintended things, but >> that's not always possible. Is there a way to interrupt it? I tried: >> >> kill -s INT >> >> and at least on Mac it had no effect. Thanks again, >> >> Matt >> >> >> >> On Mon, Mar 15, 2010 at 1:19 PM, Simon Urbanek >> wrote: >>> >>> On Mar 15, 2010, at 14:42 , Adam D. I. Kramer wrote: >>> >>>> +1--this is the single most-annoying issue with R that I know of. >>>> >>>> My usual solution, after accomplishing nothing as R spins idly for a >>>> couple >>>> hours, is to kill the process and lose any un-saved work. >>>> Â_save.history() >>>> is >>>> my friend, but is a big delay when you work with big data sets as I >>>> do, so >>>> I >>>> don't run it after every command. >>>> >>>> I have cc'd r-help here, however, because I experience this problem >>>> with >>>> non-OSX R as well...when I run it in Linux or from the OSX >>>> command-line (I >>>> compile R for Darwin without aqua/R-framework), the same thing >>>> happens. >>>> >>>> Is there some way around this? Is this a known problem? >>>> >>> >>> "Hanging" for a long period of time is usually caused by poorly >>> written >>> C/Fortran code. You can always interrupt R as long as it is in the R >>> code. >>> Once you load a package that uses native code (C/Fortran/..) you have >>> to >>> rely on the sanity of the developer to call R_CheckUserInterrupt() or >>> rchkusr() often enough (see 6.12 in R-ext). If you have some >>> particular >>> package that does not do that, I would suggest alerting the author. By >>> definition this requires cooperation from authors, because >>> interrupting >>> random code forcefully (as it was possible many years ago) creates >>> leaks and >>> unstable states. >>> >>> Cheers, >>> Simon >>> >>> >>> >>>> Google searching suggests no solut
[R] Problem specifying Gamma distribution in lme4/glmer
Dear R and lme4 users- I am trying to fit a mixed-effects model, with the glmer function in lme4, to right-skewed, zero-inflated, non-normal data representing understory grass and forb biomass (continuous) as a function of tree density (indicated by leaf-area). Thus, I have tried to specify a Gamma distribution with a log-link function but consistently receive an error as follows: > total=glmer(total~gla4+(1|plot)+(1|year/month),data=veg,family=Gamma(link=log)) > summary(total) Error in asMethod(object) : matrix is not symmetric [1,2] I have also tried fitting glmm's with lme4 and glmer to other Gamma-distributed data but receive the same error. Has anyone had similar problems and found any solutions? Thank you for your input. Best regards, ___ Matt Giovanni, Ph.D. NSERC Visiting Research Fellow Canadian Wildlife Service 2365 Albert St., Room 300 Regina, SK S4P 4K1 306-780-6121 work 402-617-3764 mobile http://sites.google.com/site/matthewgiovanni/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Merge data frame and keep unmatched
Or if you need it to be fast, try data.table. X[Y] is a join when X and Y are both data.tables. X[Y] is a left join, Y[X] is a right join. 'nomatch' controls the inner/outer join i.e. what happens for unmatched rows. This is much faster than merge(). "Gabor Grothendieck" wrote in message news:971536df0906100704q433f5f99ld3f9c23e69d95...@mail.gmail.com... Try: merge(completedf, partdf, all.x = TRUE) or library(sqldf) # see http://sqldf.googlecode.com sqldf("select * from completedf left join partdf using(beta, alpha)") On Wed, Jun 10, 2009 at 9:56 AM, Etienne B. Racine wrote: > > Hi, > > With two data sets, one complete and another one partial, I would like to > merge them and keep the unmatched lines. The problem is that merge() > dosen't > keep the unmatched lines. Is there another function that I could use to > merge the data frames. > > Example: > > completedf <- expand.grid(alpha=letters[1:3],beta=1:3) > partdf <- data.frame( > alpha= c('a','a','c'), > beta = c(1,3,2), > val = c(2,6,4)) > > mergedf <- merge(x=completedf, y=partdf, by=c('alpha','beta')) > # it only kept the common rows > nrow(mergedf) > > Thanks, > Etienne > -- > View this message in context: > http://www.nabble.com/Merge-data-frame-and-keep-unmatched-tp23962874p23962874.html > Sent from the R help mailing list archive at Nabble.com. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to order an data.table by values of an column?
If the question really meant to say "data.table" (i.e. package "data.table") then its easier than the data.frame answer. dt = data.table(Categ=c(468,351,0,234,117),Perc=c(31.52,27.52,0.77,22.55,15.99)) dt[order(Categ)] Notice there is no dt$ required before dt$Categ. Also note the comma is optional. See help("[.data.table") Another example : dt[Categ>300,cumsum(Perc+Categ)] [1] 499.52 878.04 Thats it. The i and the j are evaluated within the data.table i.e. you can use column names as variables in expressions, like a built-in with() and subset(). A join between 2 data.tables X and Y is just X[Y]. This is much faster than merge(). "Allan Engelhardt" wrote in message news:4a309f8e.4000...@cybaea.com... > See help("order") and help("[.data.frame"). > > > df <- > data.frame(Categ=c(468,351,0,234,117),Perc=c(31.52,27.52,0.77,22.55,15.99)) > df[order(df$Categ),] > # Categ Perc > # 3 0 0.77 > # 5 117 15.99 > # 4 234 22.55 > # 2 351 27.52 > # 1 468 31.52 > > > Lesandro wrote: >> Hello! >> >> Can you help me? How to order an data.table by values of an column? >> >> Per example: >> >> Table no initial >> >> Categ Perc >> 468 31.52 >> 351 27.52 >> 0 0.77 >> 234 22.55 >> 117 15.99 >> >> table final >> >> Categ Perc >> 0 0.77 >> 117 15.99 >> 234 22.55 >> 351 27.52 >> 468 31.52 >> >> Lesandro >> >> >> >> Veja quais são os assuntos do momento no Yahoo! +Buscados >> >> [[alternative HTML version deleted]] >> >> >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory errors when using QCA package
Hi, I have been using the QCA package, in particular the "eqmcc" function and I am having some issues when trying to use this to minimise a particular boolean function. The boolean function in question has 16 variables, and I am providing the full truth table for the function (65536 with 256 true entries), in the following way : library(QCA) func_tt = read.table("func.tt",header=TRUE) eqmcc(func_tt, outcome="O", expl.0=TRUE) However, after calculating for a little while, the system throws up a memory error : Error in vector("double", length) : cannot allocate vector of length 2130706560 However, looking at the memory usage, I seem to have far more than 2GB free. Is there some kind of built-in limit on the size of the heap in R? If so, is there some way I can extend this? Does anyone have any insight into this? Perhaps I am doing something stupid? Thanks Matthew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] If else statements
Here are some references. Please read these first and post again if you are still stuck after reading them. If you do post again, we will need x and y. 1. Introduction to R : 9.2.1 Conditional execution: if statements. 2. R Language Definition : 3.2 Control structures. 3. R for beginners by E Paradis : 6.1 Loops and vectorization 4. Eric Raymond's essay "How to Ask Questions The Smart Way" http://www.catb.org/~esr/faqs/smart-questions.html. HTH Matthew "tj" wrote in message news:1269325933723-1678705.p...@n4.nabble.com... > > Hi everyone! > May I request again for your help? > I need to make some codes using if else statements... > Can I do an "if-else statement" inside an "if-else statement"? Is this the > correct form of writing it? > Thank you.=) > > Example: > > for (v in 1:6) { > for (i in 2:200) { > if (v==1) > (if max(x*v-y*v)>1 break()) > > if (v==2) > (if max(x*v-y*v)>1.8 break()) > > if (v==3) > (if max(x*v-y*v)>2 break()) > } > } > -- > View this message in context: > http://n4.nabble.com/If-else-statements-tp1678705p1678705.html > Sent from the R help mailing list archive at Nabble.com. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mosaic
When you click search on the R homepage, type "mosaic" into the box, and click the button, do the top 3 links seem relevant ? Your previous 2 requests for help : 26 Feb : Response was SuppDists. Yet that is the first hit returned by the subject line you posted : "Hartleys table" 22 Feb : Response was shapiro.test. Yet that is in the second hit returned by the subject line you posted : "normality in split plot design" Spot the pattern ? "Silvano" wrote in message news:a9322645c4f846a3a6a9daaa8b5a2...@ccepc... Hi, I have this data set: obitoss = c( 5.8,17.4,5.9,17.6,5.8,17.5,4.7,15.8, 3.8,13.4,3.8,13.5,3.7,13.4,3.4,13.6, 4.4,17.3,4.3,17.4,4.2,17.5,4.3,17.0, 4.4,13.6,5.1,14.6,5.7,13.5,3.6,13.3, 6.5,19.6,6.4,19.4,6.3,19.5,6.0,19.7) (dados = data.frame( regiao = factor(rep(c('Norte', 'Nordeste', 'Sudeste', 'Sul', 'Centro-Oeste'), each=8)), ano = factor(rep(c('2000','2001','2002','2003'), each=2)), sexo = factor(rep(c('F','M'), 4)), resp=obitoss)) I would like to make a mosaic to represent the numeric variable depending on 3 variables. Does anyone know how to do? -- Silvano Cesar da Costa Departamento de Estatística Universidade Estadual de Londrina Fone: 3371-4346 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] translating SQL statements into data.table operations
Nick, Good question, but just sent to the wrong place. The posting guide asks you to contact the package maintainer first before posting to r-help only if you don't hear back. I guess one reason for that is that if questions about all 2000+ packages were sent to r-help, then r-help's traffic could go through the roof. Another reason could be that some (i.e. maybe many, maybe few) package maintainers don't actually monitor r-help and might miss any messages you post here. I only saw this one thanks to google alerts. Since I'm writing anyway ... are you using the latest version on r-forge which has the very fast grouping? Have you set multi-column keys on both edt and cdt and tried edt[cdt,roll=TRUE] syntax ? We'll help you off list to climb the learning curve quickly. We are working on FAQs and a vignette and they should be ready soon too. Please do follow up with us (myself and Tom Short cc'd are the main developers) off list and one of us will be happy to help further. Matthew "Nick Switanek" wrote in message news:772ec1011003241351v6a3f36efqb0b0787564691...@mail.gmail.com... > I've recently stumbled across data.table, Matthew Dowle's package. I'm > impressed by the speed of the package in handling operations with large > data.frames, but am a bit overwhelmed with the syntax. I'd like to express > the SQL statement below using data.table operations rather than sqldf > (which > was incredibly slow for a small subset of my financial data) or > import/export with a DBMS, but I haven't been able to figure out how to do > it. I would be grateful for your suggestions. > > nick > > > > My aim is to join events (trades) from two datasets ("edt" and "cdt") > where, > for the same stock, the events in one dataset occur between 15 and 75 days > before the other, and within the same time window. I can only see how to > express the "WHERE e.SYMBOL = c.SYMBOL" part in data.table syntax. I'm > also > at a loss at whether I can express the remainder using data.table's > %between% operator or not. > > ctqm <- sqldf("SELECT e.*, > c.DATE 'DATEctrl', > c.TIME 'TIMEctrl', > c.PRICE 'PRICEctrl', > c.SIZE 'SIZEctrl' > > FROM edt e, ctq c > > WHERE e.SYMBOL = c.SYMBOL AND > julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND > 75 AND > strftime('%H:%M:%S',c.TIME) BETWEEN > strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME)") > > [[alternative HTML version deleted]] > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] NA values in indexing
The type of 'NA' is logical. So x[NA] behaves more like x[TRUE] i.e. silent recycling. > class(NA) [1] "logical" > x=101:108 > x[NA] [1] NA NA NA NA NA NA NA NA > x[c(TRUE,NA)] [1] 101 NA 103 NA 105 NA 107 NA > x[as.integer(NA)] [1] NA HTH Matthew "Barry Rowlingson" wrote in message news:d8ad40b51003260509y6b671e53o9f79142d2b52c...@mail.gmail.com... If you index a vector with a vector that has NA in it, you get NA back: > x=101:107 > x[c(NA,4,NA)] [1] NA 104 NA > x[c(4,NA)] [1] 104 NA All well and good. ?"[" says, under NAs in indexing: When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA in the corresponding element of a logical, integer, numeric, complex or character result, and NULL for a list. (It returns 00 for a raw result.] But if the indexing vector is all NA, you get back a vector of length of your original vector rather than of your index vector: > x[c(NA,NA)] [1] NA NA NA NA NA NA NA Maybe it's just me, but I find this surprising, and I can't see it documented. Bug or undocumented feature? Apologies if I've missed something obvious. Barry sessionInfo() R version 2.11.0 alpha (2010-03-25 r51407) i686-pc-linux-gnu locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Combing
Val, Type "combine two data sets" (text you wrote in your post) into www.rseek.org. The first two links are: "Quick-R: Merge" and "Merging data: A tutorial". Isn't it quicker for you to use rseek, rather than the time it takes to write a post and wait for a reply ? Don't you also get more detailed information that way too ? You already received advice from others on this list to look at www.rseek.org on 26 Oct, package 'sos' on 27 Oct, and to 'read the manuals and FAQs before posting' on 5 Nov. This month you have posted 3 times : "Loop", "Renumbering" and "Combing". References : 1. Posting Guide headings : "Do your homework before posting" and "Further resources" 2. Contributed Documentation e.g. 'R Reference Card' by Tom Short http://cran.r-project.org/doc/contrib/Short-refcard.pdf. 3. Eric Raymond's essay http://www.catb.org/~esr/faqs/smart-questions.html. e.g. you posted to r-help 10 times so far, 9 of the 10 subjects were either a single word, or a single function name. HTH Matthew "Val" wrote in message news:cdc083ac1003290413s7e047e25lc4202568af119...@mail.gmail.com... > Hi all, > > I want to combine two data sets (ZA and ZB to get ZAB). > The common variable between the two data sets is ID. > > Data ZA > ID F M > 1 0 0 > 2 0 0 > 3 1 2 > 4 1 0 > 5 3 2 > 6 5 4 > > Data ZB > > ID v1 v2 v3 > 3 2.5 3.4 302 > 4 8.6 2.9 317 > 5 9.7 4.0 325 > 6 7.5 1.9 296 > > Output (ZAB) > > ID F M v1 v2 v3 > 1 0 0 -9 -9 -9 > 2 0 0 -9 -9 -9 > 3 1 2 2.5 3.4 302 > 4 1 0 8.6 2.9 317 > 5 3 2 9.7 4.0 325 > 6 5 4 7.5 1.9 296 > > Any help is highly appreciated in advance, > > Val > > [[alternative HTML version deleted]] > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] single quotes and double quotes in a system() command. What to do?
Hi all, I would like to run the following from within R: awk '{$3=$4="";gsub(" ","");print}' myfile > outfile However, this obviously won't work: system("awk '{$3=$4="";gsub(" ","");print}' myfile > outfile") and this won't either: system("awk '{$3=$4='';gsub(' ','');print}' myfile > outfile") Can anyone help me out? I'm sure there's an obvious solution. Thanks, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error "grid must have equal distances in each direction"
M Joshi, I don't know but I guess that some might have looked at your previous thread on 14 March (also about the geoR package). You received help and good advice then, but it doesn't appear that you are following it. It appears to be a similar problem this time. Also, this list is the wrong place for that question. Please read the posting guide to find out the correct place. Its a question about a package. HTH, Matthew "maddy" wrote in message news:1269974076132-1745651.p...@n4.nabble.com... > > Hello All, > > Can anyone please help me on this error? > > Error in FUN(X[[1L]], ...) : > different grid distances detected, but the grid must have equal distances > in each direction -- try gridtriple=TRUE that avoids numerical errors. > > The program that I am trying to run posted in the previous post of this > thread.After the rows 1021 of my matrix of size 1024*1024, I start getting > all the values as 0s. > How to set the gridtriple as I am using the grf function which does not > take > this parameter as input. > > The maximum vector limit that can be reached in 'R' is 2^30, why does it > not > allow me to create arrays of length even of size 2^17? > > Thanks, > M Joshi > -- > View this message in context: > http://n4.nabble.com/Error-grid-must-have-equal-distances-in-each-direction-tp1695189p1745651.html > Sent from the R help mailing list archive at Nabble.com. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about 'logit' and 'mlogit' in Zelig
Abraham, This appears to be your 3rd unanswered post to r-help in March, all 3 have been about the Zelig package. Please read the posting guide and find out the correct place to send questions about packages. Then you might get an answer. HTH Matthew "Mathew, Abraham T" wrote in message news:281f7a5fdfef844696011cb21185f8ac0be...@mailbox-11.home.ku.edu... I'm running a multinomial logit in R using the Zelig packages. According to str(trade962a), my dependent variable is a factor with three levels. When I run the multinomial logit I get an error message. However, when I run 'model=logit' it works fine. any ideas on whats wrong? ## MULTINOMIAL LOGIT anes96two <- zelig(trade962a ~ age962 + education962 + personal962 + economy962 + partisan962 + employment962 + union962 + home962 + market962 + race962 + income962, model="logit", data=data96) summary(anes96two) #Error in attr(tt, "depFactors")$depFactorVar : # $ operator is invalid for atomic vectors ## LOGIT Call: zelig(formula = trade962a ~ age962 + education962 + personal962 + economy962 + partisan962 + employment962 + union962 + home962 + market962 + race962 + income962, model = "logit", data = data96) Deviance Residuals: Min 1Q Median 3Q Max -2.021 -1.179 0.764 1.032 1.648 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.697675 0.600991 -1.161 0.2457 age962 0.003235 0.004126 0.784 0.4330 education962 -0.065198 0.038002 -1.716 0.0862 . personal9620.006827 0.072421 0.094 0.9249 economy962-0.200535 0.084554 -2.372 0.0177 * partisan9620.092361 0.079005 1.169 0.2424 employment962 -0.009346 0.044106 -0.212 0.8322 union962 -0.016293 0.149887 -0.109 0.9134 home962 -0.150221 0.133685 -1.124 0.2611 market962 0.292320 0.128636 2.272 0.0231 * race9620.205828 0.094890 2.169 0.0301 * income962 0.263363 0.048275 5.455 4.89e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1841.2 on 1348 degrees of freedom Residual deviance: 1746.3 on 1337 degrees of freedom (365 observations deleted due to missingness) AIC: 1770.3 Number of Fisher Scoring iterations: 4 Thanks Abraham __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] zero standard errors with geeglm in geepack
You may not have got an answer because you posted to the wrong place. Its a question about a package. Please read the posting guide. "miriza" wrote in message news:1269886286228-1695430.p...@n4.nabble.com... > > Hi! > > I am using geeglm to fit a Poisson model to a timeseries of count data as > follows. Since there are no clusters I use 73 values of 1 for the ids. > The > problem I have is that I am getting standard errors of zero for the > parameters. What am I doing wrong? > Thanks, Michelle >> N_Base > [1] 95 85 104 88 102 104 91 88 85 115 96 83 91 107 96 116 118 > 103 > 89 88 101 117 82 80 83 103 115 119 95 90 82 91 108 115 93 96 72 > [38] 98 95 98 97 104 86 107 92 94 95 100 107 76 104 101 80 102 > 100 > 91 96 89 71 109 97 113 99 127 115 91 81 73 69 92 90 78 57 >> Year > [1] 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 > 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 > 1961 > [31] 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 > 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 > 1991 > [61] 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2006 > > tes=geese(formula = N_Base ~ Year, id = rep(1, 73), family = "poisson", > corstr = "ar1") >> summary(tes) > > Call: > geese(formula = N_Base ~ Year, id = rep(1, 73), family = "poisson", >corstr = "ar1") > > Mean Model: > Mean Link: log > Variance to Mean Relation: poisson > > Coefficients: >estimate san.se wald p > (Intercept) 7.1131 0 Inf 0 > Year -0.0013 0 Inf 0 > > Scale Model: > Scale Link:identity > > Estimated Scale Parameters: >estimate san.se wald p > (Intercept) 1.79 0 Inf 0 > > Correlation Model: > Correlation Structure: ar1 > Correlation Link: identity > > Estimated Correlation Parameters: > estimate san.se wald p > alpha0.187 0 Inf 0 > > Returned Error Value:0 > Number of clusters: 1 Maximum cluster size: 73 > > -- > View this message in context: > http://n4.nabble.com/zero-standard-errors-with-geeglm-in-geepack-tp1695430p1695430.html > Sent from the R help mailing list archive at Nabble.com. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] GEE for a timeseries of count (one cluster)
Contact the authors of those packages ? "miriza" wrote in message news:1269981675252-1745896.p...@n4.nabble.com... > > Hi! > > I was wondering if there were any packages that would allow me to fit a > GEE > to a single timeseries of counts so that I could account for > autocorrelation > in the data. I tried gee, geepack and yags packages, but I do not get > standard errors for the parameters when using a single cluster. Any tips? > > Thanks, Michelle > -- > View this message in context: > http://n4.nabble.com/GEE-for-a-timeseries-of-count-one-cluster-tp1745896p1745896.html > Sent from the R help mailing list archive at Nabble.com. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mcmcglmm starting value example
Apparently not, since this your 3rd unanswered thread to r-help this month about this package. Please read the posting guide and find out where you should send questions about packages. Then you might get an answer. "ping chen" wrote in message news:975148.47160...@web15304.mail.cnb.yahoo.com... > Hi R-users: > > Can anyone give an example of giving starting values for MCMCglmm? > I can't find any anywhere. > I have 1 random effect (physicians, and there are 50 of them) > and family="ordinal". > > How can I specify starting values for my fixed effects? It doesn't seem to > have the option to do so. > > Thanks, Ping > > > > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] GLM / large dataset question
Geelman, This appears to be your first post to this list. Welcome to R. Nearly 2 days is quite a long time to wait though, so you are unlikely to get a reply now. Feedback : the question seems quite vague and imprecise. It depends on which R you mean (32bit/64bit) and how much ram you have. It also depends on your data and what you want to do with it. Did you mean 100.000 (i.e. one hundred) or 100,000. Also, '8000 explanatory variables' seems a lot, especially to be stored in 'a factor'. There is no R code in your post so we can't tell if you're using glm correctly or not. You could provide the result of object.size(), and dim() on your data rather than explaining it in words. No reply often, but not always, means you haven't followed some detail of the posting guide or haven't followed this : http://www.catb.org/~esr/faqs/smart-questions.html. HTH Matthew "geelman" wrote in message news:mkedkcmimcmgohidffmbieklcaaa.geel...@zonnet.nl... > LS, > > How large a dataset can glm fit with a binomial link function? I have a > set > of about 100.000 observations and about 8000 explanatory variables (a > factor > with 8000 levels). > > Is there a way to find out how large datasets R can handle in general? > > > > Thanks in advance, > > > geelman > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Adding RcppFrame to RcppResultSet causes segmentation fault
Rob, Please look again at Romain's reply to you on 19th March. He informed you then that Rcpp has its own dedicated mailing list and he gave you the link. Matthew "R_help Help" wrote in message news:ad1ead5f1003291753p68d6ed52q572940f13e1c0...@mail.gmail.com... > Hi, > > I'm a bit puzzled. I uses exactly the same code in RcppExamples > package to try adding RcppFrame object to RcppResultSet. When running > it gives me segmentation fault problem. I'm using gcc 4.1.2 on redhat > 64bit. I'm not sure if this is the cause of the problem. Any advice > would be greatly appreciated. Thank you. > > Rob. > > > int numCol=4; > std::vector colNames(numCol); > colNames[0] = "alpha"; // column of strings > colNames[1] = "beta"; // column of reals > colNames[2] = "gamma"; // factor column > colNames[3] = "delta"; // column of Dates > RcppFrame frame(colNames); > > // Third column will be a factor. In the current implementation the > // level names are copied to every factor value (and factors > // in the same column must have the same level names). The level names > // for a particular column will be factored out (pardon the pun) in > // a future release. > int numLevels = 2; > std::string *levelNames = new std::string[2]; > levelNames[0] = std::string("pass"); // level 1 > levelNames[1] = std::string("fail"); // level 2 > > // First row (this one determines column types). > std::vector row1(numCol); > row1[0].setStringValue("a"); > row1[1].setDoubleValue(3.14); > row1[2].setFactorValue(levelNames, numLevels, 1); > row1[3].setDateValue(RcppDate(7,4,2006)); > frame.addRow(row1); > > // Second row. > std::vector row2(numCol); > row2[0].setStringValue("b"); > row2[1].setDoubleValue(6.28); > row2[2].setFactorValue(levelNames, numLevels, 1); > row2[3].setDateValue(RcppDate(12,25,2006)); > frame.addRow(row2); > > RcppResultSet rs; > rs.add("PreDF", frame); > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Adding RcppFrame to RcppResultSet causes segmentation fault
He could have posted into this thread then at the time to say that. Otherwise it appears like its open. "Romain Francois" wrote in message news:4bb4c4b8.2030...@dbmail.com... The thread has been handled in Rcpp-devel. Rob posted there 7 minutes after posting on r-help. FWIW, I think the problem is fixed on the Rcpp 0.7.11 version (on cran incoming) Romain Le 01/04/10 17:47, Matthew Dowle a écrit : > > Rob, > Please look again at Romain's reply to you on 19th March. He informed you > then that Rcpp has its own dedicated mailing list and he gave you the > link. > Matthew > > "R_help Help" wrote in message > news:ad1ead5f1003291753p68d6ed52q572940f13e1c0...@mail.gmail.com... >> Hi, >> >> I'm a bit puzzled. I uses exactly the same code in RcppExamples >> package to try adding RcppFrame object to RcppResultSet. When running >> it gives me segmentation fault problem. I'm using gcc 4.1.2 on redhat >> 64bit. I'm not sure if this is the cause of the problem. Any advice >> would be greatly appreciated. Thank you. >> >> Rob. >> >> >> int numCol=4; >> std::vector colNames(numCol); >> colNames[0] = "alpha"; // column of strings >> colNames[1] = "beta"; // column of reals >> colNames[2] = "gamma"; // factor column >> colNames[3] = "delta"; // column of Dates >> RcppFrame frame(colNames); >> >> // Third column will be a factor. In the current implementation the >> // level names are copied to every factor value (and factors >> // in the same column must have the same level names). The level names >> // for a particular column will be factored out (pardon the pun) in >> // a future release. >> int numLevels = 2; >> std::string *levelNames = new std::string[2]; >> levelNames[0] = std::string("pass"); // level 1 >> levelNames[1] = std::string("fail"); // level 2 >> >> // First row (this one determines column types). >> std::vector row1(numCol); >> row1[0].setStringValue("a"); >> row1[1].setDoubleValue(3.14); >> row1[2].setFactorValue(levelNames, numLevels, 1); >> row1[3].setDateValue(RcppDate(7,4,2006)); >> frame.addRow(row1); >> >> // Second row. >> std::vector row2(numCol); >> row2[0].setStringValue("b"); >> row2[1].setDoubleValue(6.28); >> row2[2].setFactorValue(levelNames, numLevels, 1); >> row2[3].setDateValue(RcppDate(12,25,2006)); >> frame.addRow(row2); >> >> RcppResultSet rs; >> rs.add("PreDF", frame); -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://tr.im/OIXN : raster images and RImageJ |- http://tr.im/OcQe : Rcpp 0.7.7 `- http://tr.im/O1wO : highlight 0.1-5 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] nlrq parameter bounds
Ashley, This appears to be your first post to this list. Welcome to R. Over 2 days is quite a long time to wait though, so you are unlikely to get a reply now. Feedback: since nlrq is in package quantreg, its a question about a package and should be sent to the package maintainer. Some packages though, over 40 of the 664 on r-forge, have dedicated help/devel/forum lists hosted on r-forge. No reply from r-help often, but not always, means you haven't followed some detail of the posting guide or haven't followed this : http://www.catb.org/~esr/faqs/smart-questions.html. HTH Matthew "Ashley Greenwood" wrote in message news:45708.131.217.6.9.1269916052.squir...@webmail.student.unimelb.edu.au... > Hi there, > Can anyone please tell me if it is possible to limit parameters in nlrq() > to 'upper' and 'lower' bounds as per nls()? If so how?? > > Many thanks in advance > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory error
> someone else on this list may be able to give you a ballpark estimate > of how much RAM this merge would require. I don't have an absolute estimate, but try data.table::merge, as it needs less working memory than base::merge. 20 million rows of 5 columns isn't beyond 32bit : (1*4 + 4*8)*19758564/1024^3 = 0.662GB Also try sqldf to do the join. Matthew "Sharpie" wrote in message news:1270102758449-1747733.p...@n4.nabble.com... > > > Janet Choate-2 wrote: >> >> Thanx for clarification on stating my problem, Charlie. >> >> I am attempting to merge to files, i.e.: >> hi39 = merge(comb[,c("hillID","geo")], hi.h39, by=c("hillID")) >> >> if this is relevant or helps to explain: >> the file 'comb' is 3 columns and 1127 rows >> the file 'hi.h39' is 5 columns and 19758564 rows >> >> i started a new clean R session in which i was able to read those 2 files >> in, but get the following error when i try to merge them: >> >> R(2175) malloc: *** mmap(size=79036416) failed (error code=12) >> *** error: can't allocate region >> *** set a breakpoint in malloc_error_break to debug >> R(2175) malloc: *** mmap(size=79036416) failed (error code=12) >> *** error: can't allocate region >> *** set a breakpoint in malloc_error_break to debug >> R(2175) malloc: *** mmap(size=158068736) failed (error code=12) >> *** error: can't allocate region >> *** set a breakpoint in malloc_error_break to debug >> R(2175) malloc: *** mmap(size=158068736) failed (error code=12) >> *** error: can't allocate region >> *** set a breakpoint in malloc_error_break to debug >> R(2175) malloc: *** mmap(size=158068736) failed (error code=12) >> *** error: can't allocate region >> *** set a breakpoint in malloc_error_break to debug >> Error: cannot allocate vector of size 150.7 Mb >> >> so the final error is "Cannot allocate vector of size 150.7 Mb", as >> suggested when R runs out of memory. >> >> i am running R version 2.9.2, on mac os X 10.5 - leopard. >> >> any suggestion on how to increase R's memory on a mac? >> thanx for any much needed help! >> Janet >> > > Ah, so it is indeed a shortage of memory problem. With R 2.9.2, you are > likely running a 32 bit version of R which will be limited to accessing at > most 4 GB of RAM. You may want to try the newest version of R, 2.10.1, as > it includes a 64 bit version that will allow you to access significantly > more memory- provided you have the RAM installed on your system. > > I'm not too hot on memory usage calculation, but someone else on this list > may be able to give you a ballpark estimate of how much RAM this merge > would > require. If it turns out to be a ridiculous amount, you will need to > consider breaking the merge up into chunks or finding an out-of-core (i.e. > not dependent on RAM for storage) merge tool. > > Hope this helps! > > -Charlie > > - > Charlie Sharpsteen > Undergraduate-- Environmental Resources Engineering > Humboldt State University > -- > View this message in context: > http://n4.nabble.com/memory-error-tp1747357p1747733.html > Sent from the R help mailing list archive at Nabble.com. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] match function or "=="
Please install v1.3 from R-forge : install.packages("data.table",repos="http://R-Forge.R-project.org";) It will be ready for CRAN soon. Please follow up on datatable-h...@lists.r-forge.r-project.org Matthew "bo" wrote in message news:1270689586866-1755876.p...@n4.nabble.com... > > Thank you very much for the help. > > I installed data.table package, but I keep getting the following warnings: > >> setkey(DT,id,date) > Warning messages: > 1: In `[.data.table`(deref(x), o) : > This R session is < 2.4.0. Please upgrade to 2.4.0+. > > I'm using R 2.10, but why I keep getting warnings on upgrades. Thanks > again. > > > -- > View this message in context: > http://n4.nabble.com/match-function-or-tp1754505p1755876.html > Sent from the R help mailing list archive at Nabble.com. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Code is too slow: mean-centering variables in a dataframebysubgroup
Hi Dimitri, A start has been made at explaining .SD in FAQ 2.1. This was previously on a webpage, but its just been moved to a vignette : https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/branch2/inst/doc/faq.pdf?rev=68&root=datatable Please note: that vignette is part of a development branch on r-forge, and as such isn't even released to the r-forge repository yet. Please also see FAQ 4.5 in that vignette and follow up on datatable-h...@lists.r-forge.r-project.org An introduction vignette is taking shape too (again, in the development branch i.e. bleeding edge) : https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/branch2/inst/doc/intro.pdf?rev=68&root=datatable HTH Matthew "Dimitri Liakhovitski" wrote in message news:r2rdae9a2a61004071314xc03ae851n4c9027b28df5a...@mail.gmail.com... Yes, Tom's solution is indeed the fastest! On my PC it took .17-.22 seconds while using ave() took .23-.27 seconds. And of course - the last two methods I mentioned took 1.3 SECONDS, not MINUTES (it was a typo). All that is left to me is to understand what .SD stands for. :-) Dimitri On Wed, Apr 7, 2010 at 4:04 PM, Rob Forler wrote: > Leave it up to Tom to solve things wickedly fast :) > > Just as an fyi Dimitri, Tom is one of the developers of data.table. > > -Rob > > On Wed, Apr 7, 2010 at 2:51 PM, Dimitri Liakhovitski > wrote: >> >> Wow, thank you, Tom! >> >> On Wed, Apr 7, 2010 at 3:46 PM, Tom Short >> wrote: >> > Here's how I would have done the data.table method. It's a bit faster >> > than the ave approach on my machine: >> > >> >> # install.packages("data.table",repos="http://R-Forge.R-project.org";) >> >> library(data.table) >> >> >> >> f3 <- function(frame) { >> > + frame <- as.data.table(frame) >> > + frame[, lapply(.SD[,2:ncol(.SD), with = FALSE], >> > + function(x) x / mean(x, na.rm = TRUE)), >> > + by = "group"] >> > + } >> >> >> >> system.time(new.frame2 <- f2(frame)) # ave >> > user system elapsed >> > 0.50 0.08 1.24 >> >> system.time(new.frame3 <- f3(frame)) # data.table >> > user system elapsed >> > 0.25 0.01 0.30 >> > >> > - Tom >> > >> > Tom Short >> > >> > >> > On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski >> > >> > wrote: >> >> I would like to thank once more everyone who helped me with this >> >> question. >> >> I compared the speed for different approaches. Below are the results >> >> of my comparisons - in case anyone is interested: >> >> >> >> ### Building an EXAMPLE FRAME with N rows - with groups and a lot of >> >> NAs: >> >> N<-10 >> >> set.seed(1234) >> >> >> >> frame<-data.frame(group=rep(paste("group",1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N)) >> >> frame<-frame[order(frame$group),] >> >> >> >> ## Introducing 60% NAs: >> >> names.used<-names(frame)[2:length(frame)] >> >> set.seed(1234) >> >> for(i in names.used){ >> >> i.for.NA<-sample(1:N,round((N*.6),0)) >> >> frame[[i]][i.for.NA]<-NA >> >> } >> >> lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it >> >> worked >> >> ORIGframe<-frame ## placeholder for the unchanged original frame >> >> >> >> ### Objective of the code - divide each value by its group mean >> >> >> >> >> >> ### METHOD 1 - the FASTEST - using >> >> ave():## >> >> frame<-ORIGframe >> >> f2 <- function(frame) { >> >> for(i in 2:ncol(frame)) { >> >> frame[,i] <- ave(frame[,i], frame[,1], >> >> FUN=function(x)x/mean(x,na.rm=TRUE)) >> >> } >> >> frame >> >> } >> >> system.time({new.frame<-f2(frame)}) >> >> # Took me 0.23-0.27 sec >> >> ### >> >> >> >> ### METHOD 2 - fast, just a bit slower - using data.table: >> >> ## >> >> >> >> # If you don't have it - install the package - NOT from CRAN: >> >> install.packages("data.table",repos="http://R-Forge.R-project.org";) >> >> library(data.table) >> >>
[R] Any chance R will ever get beyond the 2^31-1 vector size limit?
Hi all, My institute will hopefully be working on cutting-edge genetic sequencing data by the Fall of 2010. The datasets will be 10's of GB large and growing. I'd like to use R to do primary analyses. This is OK, because we can just throw $ at the problem and get lots of RAM running on 64 bit R. However, we are still running up against the fact that vectors in R cannot contain more than 2^31-1. I know there are "ways around" this issue, and trust me, I think I've tried them all (e.g., bringing in portions of the data at a time; using large-dataset packages in R; using SQL databases, etc). But all these 'solutions' are, at the end of the day, much much more cumbersome, programming-wise, than just doing things in native R. Maybe that's just the cost of doing what I'm doing. But my questions, which may well be naive (I'm not a computer programmer), are: 1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e., in an alternative history of R's development, would it have been feasible for R to not have had this limitation? 2) Is there any possibility that this limit will be overcome in future revisions of R? I'm very very grateful to the people who have spent important parts of their professional lives developing R. I don't think anyone back in, say, 1995, could have foreseen that datasets would be >>2^32-1 in size. For better or worse, however, in many fields of science, that is routinely the case today. *If* it's possible to get around this limit, then I'd like to know whether the R Development Team takes seriously the needs of large data users, or if they feel that (perhaps not mutually exclusively) developing such capacity is best left up to ad hoc R packages and alternative analysis programs. Best, Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Interpreting factor*numeric interaction coefficients
Dear all, I am a relative novice with R, so please forgive any terrible errors... I am working with a GLM that describes a response variable as a function of a categorical variable with three levels and a continuous variable. These two predictor variables are believed to interact. An example of such a model follows at the bottom of this message, but here is a section of its summary table: Estimate Std. Error z value Pr(>|z|) (Intercept)1.220186 0.539475 2.262 0.0237 * var1 0.028182 0.050850 0.554 0.5794 cat2 -0.112454 0.781137 -0.144 0.8855 cat3 0.339589 0.672828 0.505 0.6138 var1:cat2 0.007091 0.068072 0.104 0.9170 var1:cat3 -0.027248 0.064468 -0.423 0.6725 I am having trouble interpreting this output. I think I understand that: # the 'var1' value refers to the slope of the relationship within the first factor level # the 'cat2' and 'cat3' values refer to the difference in intercept from 'cat1' # the interaction terms describe the difference in slope between the relationship in 'cat1' and that in 'cat2' and 'cat3' respectively Therefore, if I wanted a single value to describe the slope in either cat2 or cat3, I would sum the interaction value with that of var1. However, if I wanted to report a standard error for the slope in 'cat2', how would I go about doing this? Is the reported standard error that for the overall slope for that factor level, or is the actual standard error a function of the standard error of var1 and that of the interaction? Any help with this would be much appreciated, Matthew Carroll ### example code resp <- rpois(30, 5) cat <- factor(rep(c(1:3), 10)) var1 <- rnorm(30, 10, 3) mod <- glm(resp ~ var1 * cat, family="poisson") summary(mod) Call: glm(formula = resp ~ var1 * cat, family = "poisson") Deviance Residuals: Min1QMedian3Q Max -1.80269 -0.54107 -0.06169 0.51819 1.58169 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)1.220186 0.539475 2.262 0.0237 * var1 0.028182 0.050850 0.554 0.5794 cat2 -0.112454 0.781137 -0.144 0.8855 cat3 0.339589 0.672828 0.505 0.6138 var1:cat2 0.007091 0.068072 0.104 0.9170 var1:cat3 -0.027248 0.064468 -0.423 0.6725 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 23.222 on 29 degrees of freedom Residual deviance: 22.192 on 24 degrees of freedom AIC: 133.75 Number of Fisher Scoring iterations: 5 -- Matthew Carroll E-mail: mjc...@york.ac.uk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Any chance R will ever get beyond the 2^31-1 vector size limit?
HI Duncan and R users, Duncan, thank you for taking the time to respond. I've had several other comments off the list, and I'd like to summarize what these have to say, although I won't give sources since I assume there was a reason why people chose not to respond to the whole list. The long and short of it is that there is hope for people who want R to get beyond the 2^31-1 vector size limit. First off, I received a couple of responses from people who wanted to commiserate and me to summarize what I learned. Here you go. Second, the package bigmemory and ff can both help with memory issues. I've had success using bigmemory before, and found it to be quite intuitive. Third, one knowledgeable responder doubted that changing the 2^31-1 limit would 'break' old datasets. He says, "This might be true for isolated cases of objects stored in binary formats or in workspaces, but I don't see that as anywhere near as important as the change you (and we) would like to see." Fourth, another knowledgeable responder felt it was likely that, given the demand driven by the huge increases in dataset sizes, this limitation would likely be overcome within the next few years. Best, Matt On Fri, Apr 9, 2010 at 6:36 PM, Duncan Murdoch wrote: > On 09/04/2010 7:38 PM, Matthew Keller wrote: >> >> Hi all, >> >> My institute will hopefully be working on cutting-edge genetic >> sequencing data by the Fall of 2010. The datasets will be 10's of GB >> large and growing. I'd like to use R to do primary analyses. This is >> OK, because we can just throw $ at the problem and get lots of RAM >> running on 64 bit R. However, we are still running up against the fact >> that vectors in R cannot contain more than 2^31-1. I know there are >> "ways around" this issue, and trust me, I think I've tried them all >> (e.g., bringing in portions of the data at a time; using large-dataset >> packages in R; using SQL databases, etc). But all these 'solutions' >> are, at the end of the day, much much more cumbersome, >> programming-wise, than just doing things in native R. Maybe that's >> just the cost of doing what I'm doing. But my questions, which may >> well be naive (I'm not a computer programmer), are: >> >> 1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e., >> in an alternative history of R's development, would it have been >> feasible for R to not have had this limitation? > > The problem is that we use "int" as a vector index. On most platforms, > that's a signed 32 bit integer, with max value 2^31-1. > > >> >> 2) Is there any possibility that this limit will be overcome in future >> revisions of R? > > > Of course, R is open source. You could rewrite all of the internal code > tomorrow to use 64 bit indexing. > > Will someone else do it for you? Even that is possible. One problem are > that this will make all of your data incompatible with older versions of R. > And back to the original question: are you willing to pay for the > development? Then go ahead, you can have it tomorrow (or later, if your > budget is limited). Are you waiting for someone else to do it for free? > Then you need to wait for someone who knows how to do it to want to do it. > > >> I'm very very grateful to the people who have spent important parts of >> their professional lives developing R. I don't think anyone back in, >> say, 1995, could have foreseen that datasets would be >>2^32-1 in >> size. For better or worse, however, in many fields of science, that is >> routinely the case today. *If* it's possible to get around this limit, >> then I'd like to know whether the R Development Team takes seriously >> the needs of large data users, or if they feel that (perhaps not >> mutually exclusively) developing such capacity is best left up to ad >> hoc R packages and alternative analysis programs. > > There are many ways around the limit today. Put your data in a dataframe > with many columns each of length 2^31-1 or less. Put your data in a > database, and process it a block at a time. Etc. > > Duncan Murdoch > >> >> Best, >> >> Matt >> >> >> > > -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] sum specific rows in a data frame
Or try data.table 1.4 on r-forge, its grouping is faster than aggregate : agg datatable X100.012 0.008 X100 0.020 0.008 X1000 0.172 0.020 X1 1.164 0.144 X1e.05 9.397 1.180 install.packages("data.table", repos="http://R-Forge.R-project.org";) require(data.table) dt = as.data.table(df) t3 <- system.time(zz3 <- dt[, list(sumflt=sum(fltval), sumint=sum (intval)), by=id]) Matthew On Thu, 15 Apr 2010 13:09:17 +, hadley wickham wrote: > On Thu, Apr 15, 2010 at 1:16 AM, Chuck wrote: >> Depending on the size of the dataframe and the operations you are >> trying to perform, aggregate or ddply may be better. In the function >> below, df has the same structure as your dataframe. > > Current version of plyr: > > agg ddply > X100.005 0.007 > X100 0.007 0.026 > X1000 0.086 0.248 > X1 0.577 3.136 > X1e.05 4.493 44.147 > > Development version of plyr: > > agg ddply > X100.003 0.005 > X100 0.007 0.007 > X1000 0.042 0.044 > X1 0.410 0.443 > X1e.05 4.479 4.237 > > So there are some big speed improvements in the works. > > Hadley __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] multiple paired t-tests without loops
I am new to R and I suspect my problem is easily solved, but I haven't been able to figure it out without using loops. I am trying to implement Blair & Karniski's (1993) permutation test. I've included a sample data frame below. This data frame represents the conditional means (C1, C2) for 3 subjects in 2 consecutive samples of a continuous data set (e.g. ERP waveform). Each sample includes all possible permuations of the subject means (2^N), which is 8 in this case. The problem: I need to run a paired t-test on each SampleXPermutation set and save the maximum t-value obtained for each sample. The real data set has 16 subjects (216 permutations) and 500 samples, which leads to more than 32 million t-tests. I have a loop version of the program working, but it would take a few weeks to complete the job and I was hoping that someone could tell me how to do it faster? thank you kindly, Matthew Finkbeiner "Sample""C1""C2""PermN" 158perm1 143perm1 164perm1 226perm1 231perm1 274perm1 185perm2 134perm2 164perm2 262perm2 213perm2 274perm2 158perm3 134perm3 164perm3 226perm3 213perm3 274perm3 185perm4 143perm4 146perm4 262perm4 231perm4 247perm4 158perm5 143perm5 146perm5 226perm5 231perm5 247perm5 185perm6 134perm6 146perm6 262perm6 213perm6 247perm6 158perm7 134perm7 146perm7 226perm7 213perm7 247perm7 185perm8 143perm8 164perm8 262perm8 2 31perm8 274perm8 -- Dr. Matthew Finkbeiner Senior Lecturer & ARC Australian Research Fellow Macquarie Centre for Cognitive Science (MACCS) Macquarie University, Sydney, NSW 2109 Phone: +61 2 9850-6718 Fax: +61 2 9850-6059 Homepage: http://www.maccs.mq.edu.au/~mfinkbei Lab Homepage: http://www.maccs.mq.edu.au/laboratories/action/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] multiple paired t-tests without loops
I am new to R and I suspect my problem is easily solved, but I haven't been able to figure it out without using loops. I am trying to implement Blair & Karniski's (1993) permutation test. I've included a sample data frame below. This data frame represents the conditional means (C1, C2) for 3 subjects in 2 consecutive samples of a continuous data set (e.g. ERP waveform). Each sample includes all possible permuations of the subject means (2^N), which is 8 in this case. The problem: I need to run a paired t-test on each SampleXPermutation set and save the maximum t-value obtained for each sample. The real data set has 16 subjects (2^16 permutations) and 500 samples, which leads to more than 32 million t-tests. I have a loop version of the program working, but it would take a few weeks to complete the job and I was hoping that someone could tell me how to do it faster? thank you kindly, Matthew Finkbeiner "Sample" "C1" "C2" "PermN" 1 5 8 perm1 1 4 3 perm1 1 6 4 perm1 2 2 6 perm1 2 3 1 perm1 2 7 4 perm1 1 8 5 perm2 1 3 4 perm2 1 6 4 perm2 2 6 2 perm2 2 1 3 perm2 2 7 4 perm2 1 5 8 perm3 1 3 4 perm3 1 6 4 perm3 2 2 6 perm3 2 1 3 perm3 2 7 4 perm3 1 8 5 perm4 1 4 3 perm4 1 4 6 perm4 2 6 2 perm4 2 3 1 perm4 2 4 7 perm4 1 5 8 perm5 1 4 3 perm5 1 4 6 perm5 2 2 6 perm5 2 3 1 perm5 2 4 7 perm5 1 8 5 perm6 1 3 4 perm6 1 4 6 perm6 2 6 2 perm6 2 1 3 perm6 2 4 7 perm6 1 5 8 perm7 1 3 4 perm7 1 4 6 perm7 2 2 6 perm7 2 1 3 perm7 2 4 7 perm7 1 8 5 perm8 1 4 3 perm8 1 6 4 perm8 2 6 2 perm8 2 3 1 perm8 2 7 4 perm8 -- Dr. Matthew Finkbeiner Senior Lecturer & ARC Australian Research Fellow Macquarie Centre for Cognitive Science (MACCS) Macquarie University, Sydney, NSW 2109 Phone: +61 2 9850-6718 Fax: +61 2 9850-6059 Homepage: http://www.maccs.mq.edu.au/~mfinkbei Lab Homepage: http://www.maccs.mq.edu.au/laboratories/action/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to make read in a vector of 0s and 1s with no space between them
Hi all, Probably a rudimentary question. I have a flat file that looks like this (the real one has ~10e6 elements): 10110100101001011101011 and I want to pull that into R as a vector, but with each digit being it's own element. There are no separators between the digits. How can I accomplish this? Thanks in advance! Matt -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to make read in a vector of 0s and 1s with no space between them
Hi all, Quickly received an answer off the list. To do this is easy. Pull it in using e.g., scan(). Then use strsplit: z <- '10001011010010' strsplit(z,'') On Sun, Apr 25, 2010 at 10:52 AM, Matthew Keller wrote: > Hi all, > > Probably a rudimentary question. I have a flat file that looks like > this (the real one has ~10e6 elements): > > 10110100101001011101011 > > and I want to pull that into R as a vector, but with each digit being > it's own element. There are no separators between the digits. How can > I accomplish this? Thanks in advance! > > Matt > > -- > Matthew C Keller > Asst. Professor of Psychology > University of Colorado at Boulder > www.matthewckeller.com > -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] multiple paired t-tests without loops
Yes, I suspect that I will end up using a sampling approach, but I'd like to use an exact test if it's at all feasible. Here are two samples of data from 3 subjects: Sample SubjC1 C2 44 1 0.0093 0.0077 44 2 0.0089 0.0069 44 3 0.051 0.0432 44 4 0.014 0.0147 44 5 0.0161 0.0117 45 1 0.0103 0.0086 45 2 0.0099 0.0078 45 3 0.0542 0.0458 45 4 0.0154 0.0163 45 5 0.0175 0.0129 and then here is the script I've pieced together from things I've found on the web (sorry for not citing the snippets!). any pointers on how to speed it up would be greatly appreciated. #-- # Utility function # that returns binary representation of 1:(2^n) X SubjN binary.v <- function(n) { x <- 1:(2^n) mx <- max(x) digits <- floor(log2(mx)) ans <- 0:(digits-1); lx <- length(x) x <- matrix(rep(x,rep(digits, lx)),ncol=lx) x <- (x %/% 2^ans) %% 2 } library(plyr) #first some global variables TotalSubjects <- 5 TotalSamples <- 2 StartSample <- 44 EndSample <- ((StartSample + TotalSamples)-1) maxTs <- NULL obsTs <- NULL #create index array that drives the permuations for all samples ind <- binary.v(TotalSubjects) #transpose ind so that the first 2^N items correspond to S1, #the second 2^N correspond to S2 and so on... transind <- t(ind) #get data file that is organized first by sample then by subj (e.g. sample1 subject1 # sample1 subject 2 ... sample 1 subject N) #sampledatafile <- file.choose() samples <- read.table(sampledatafile, header=T) #this is the progress bar pb <- txtProgressBar(min = StartSample, max = EndSample, style = 3) setTxtProgressBar(pb, 1) start.t <- proc.time() #begin loop that analyzes data sample by sample for (s in StartSample:EndSample) { S <- samples[samples$Sample==s,] #pick up data for current sample #reproduce data frame rows once for each permutation to be done expanddata <- S[rep(1:nrow(S), each = 2^TotalSubjects),] #create new array to hold the flipped (permuted) data permdata = expanddata #permute the data permdata[transind==1,3] <- expanddata[transind==1,4] #Cnd1 <- Cnd2 permdata[transind==1,4] <- expanddata[transind==1,3] #Cnd2 <- Cnd1 #create permutation # as a factor in dataframe PermN <- rep(rep(1:2^TotalSubjects, TotalSubjects),2) #create Sample# as a factor Sample <- rep(permdata[,1],2) #Sample# is in the 1st Column #create subject IDs as a factor Subj <- rep(permdata[,2],2) #Subject ID is in the 2nd Column #stack the permutated data StackedPermData <- stack(permdata[,3:4]) #bind all the factors together StackedPermData <- as.data.frame(cbind(Sample, Subj, PermN, StackedPermData)) #sort by perm sortedstack <- as.data.frame(StackedPermData[order(StackedPermData$PermN, StackedPermData$Sample),]) #clear up some memory rm(expanddata, permdata, StackedPermData) #pull out data 1 perm at a time res<-ddply(sortedstack, c("Sample", "PermN"), function(.data){ # Type combinations by Class combs<-t(combn(sort(unique(.data[,5])),2)) # Applying the t-test for them aaply(combs,1, function(.r){ x1<-.data[.data[,5]==.r[1],4] # select first column x2<-.data[.data[,5]==.r[2],4] # select first column tvalue <- t.test(x1,x2, paired = T) res <- c(tvalue$statistic,tvalue$parameter,tvalue$p.value) names(res) <- c('stat','df','pvalue') res } ) } ) # update progress bar setTxtProgressBar(pb, s) #get max T vals maxTs <- c(maxTs, tapply (res$stat, list (res$Sample), max)) #get observed T vals obsTs <- c(obsTs, res$stat[length(res$stat)]) #here we need to save res to a binary file } #close out the progress bar close(pb) end.t <- proc.time() - start.t print(end.t) #get cutoffs #these are the 2-tailed t-vals that maintain experimentwise error at the 0.05 level lowerT <- quantile(maxTs, .025) upperT <- quantile(maxTs, .975) On 4/27/2010 6:53 AM, Greg Snow wrote: The usual way to speed up permutation testing is to sample from the set of possible permutations rather than looking at all possible ones. If you show some code then we may be able to find some inefficiencies for you, but there is not general solution, poorly written uses of apply will be slower than well written for loops. In some cases rewriting critical pieces in C or fortran will help quite a bit, but we need to see what you are already doing to know if that will help or not. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Confusing concept of vector and matrix in R
Rolf: "Well then, why don't you go away and design and build your own statistics and data analysis language/package to replace R?" What a nice reply! The fellow is just trying to understand R. That response reminds me of citizens of my own country who cannot abide by any criticism of the USA: "If you don't like it, why don't you leave?" Classy. I have sympathies with the author. When I first began using R (migrating from Matlab), I also found the vector concept strange, especially because I was doing a lot of matrix algebra back then and didn't like the concept of conflating a row vector with a column vector. But I've since gotten used to it and can hardly remember why I struggled with this early on. Perhaps your experience will be similar. Best of luck! Matt On Mon, Apr 26, 2010 at 7:40 PM, Charles C. Berry wrote: > On Mon, 26 Apr 2010, Stu wrote: > >> Hi all, >> >> One subtlety is that the drop argument only works if you specify 2 or >> more indices e.g. [i, j, ..., drop=F]; but not for a single index e.g >> [i, drop=F]. > > Wrong. > >> a <- structure(1:5,dim=5) >> dim(a) > > [1] 5 >> >> dim(a[2:3,drop=F]) # don't drop regardless > > [1] 2 >> >> dim(a[2,drop=F]) # dont' drop regardless > > [1] 1 >> >> dim(a[2:3,drop=T]) # no extent of length 1 > > [1] 2 >> >> dim(a[2,drop=T]) # drop, extent of length 1 > > NULL > > >> >> Why doesn't R complain about the unused "drop=F" argument in the >> single index case? > > In the example you give (one index for a two-dimension array), vector > indexing is assumed. For vector indexing, drop is irrelevant. > > HTH, > > Chuck >> >> Cheers, >> - Stu >> >> a = matrix(1:10, nrow=1) >> b = matrix(10:1, ncol=1) >> >> # a1 is an vector w/o dim attribute (i.e. drop=F is ignored silently) >> (a1 = a[2:5, drop=F]) >> dim(a1) >> >> # a2 is an vector WITH dim attribute: a row matrix (drop=F works) >> (a2 = a[, 2:5, drop=F]) >> dim(a2) >> >> # b1 is an vector w/o dim attribute (i.e. drop=F is ignored silently) >> (b1 = b[2:5, drop=F]) >> dim(b1) >> >> # b2 is an vector WITH dim attribute: a column matrix (drop=F works) >> (b2 = b[2:5, , drop=F]) >> dim(b2) >> >> >> On Mar 30, 4:08 pm, lith wrote: >>>> >>>> Reframe the problem. Rethink why you need to keep dimensions. I never >>>> ever had to use drop. >>> >>> The problem is that the type of the return value changes if you happen >>> to forget to use drop = FALSE, which can easily turn into a nightmare: >>> >>> m <-matrix(1:20, ncol=4) >>> for (i in seq(3, 1, -1)) { >>> print(class(m[1:i, ]))} >>> >>> [1] "matrix" >>> [1] "matrix" >>> [1] "integer" >>> >>> __ >>> r-h...@r-project.org mailing >>> listhttps://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting >>> guidehttp://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:cbe...@tajo.ucsd.edu UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 > > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Size limitations for model.matrix?
Hi Gerald, A matrix and an array *are* vectors that can be indexed by 2+ indices. Thus, matrices and arrays are also limited to 2^31-1 elements. You might check out the bigmemory package, which can help with these issues... Matt On Wed, Apr 28, 2010 at 11:01 AM, wrote: > > Hello, > > I am running: > > R version 2.10.0 (2009-10-26) > Copyright (C) 2009 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > > on a RedHat Linux box with 48Gb of memory. > > I am trying to create a model.matrix for a big model on a moderately large > data set. It seems there is a size limitation to this model.matrix. > >> dim(coll.train) > [1] 677236 128 >> coll.1st.model.mat <- model.matrix(coll.1st.formula, data = coll.train) >> dim(coll.1st.model.mat) > [1] 581618 169 > > One I saw the resulting model.matrix had fewer rows than the original > data.frame I played with the number of input variables in the model: > >> ttt <- model.matrix(~kmpleasure + vehage + age + gender + marital.status > + > + license.category + minor.conviction + driver.training.certificate + > + admhybrid + anpol + anveh + cie + dblct + faq13c + faq20 + faq27 + > faq43 + > + faq5a + fra2 + frb2 + frb3 + kmaff + kmannuel + kmtravai + lima + > maison + > + nacp + nap + nbcond + nbcondpo + nbvt + rabmlt06 + rabmtve + > rabperprg + > + rabretrai + statnuit + tarcl06 + utilusa + sexeocc + ageocc + napocc, > + data = coll.train) > dim(ttt) > [1] 677236 109 > > ## OK so far, but if I had one more variable there will be missing rows. > >> ttt <- model.matrix(~kmpleasure + vehage + age + gender + marital.status > + > + license.category + minor.conviction + driver.training.certificate + > + admhybrid + anpol + anveh + cie + dblct + faq13c + faq20 + faq27 + > faq43 + > + faq5a + fra2 + frb2 + frb3 + kmaff + kmannuel + kmtravai + lima + > maison + > + nacp + nap + nbcond + nbcondpo + nbvt + rabmlt06 + rabmtve + > rabperprg + > + rabretrai + statnuit + tarcl06 + utilusa + sexeocc + ageocc + napocc > + > + prof.b2, data = coll.train) > dim(ttt) > [1] 676379 110 > > Is there a limit to the size of a matrix and of a data.frame. I know the > limit for the length of a vector to be 2^31, but we are very far from that > here. Am I missing something? > > Thanks for any support, > > Gérald Jean > Conseiller senior en statistiques, > VP Actuariat et Solutions d'assurances, > Desjardins Groupe d'Assurances Générales > télephone : (418) 835-4900 poste (7639) > télecopieur : (418) 835-6657 > courrier électronique: gerald.j...@dgag.ca > > "We believe in God, others must bring Data." > > W. Edwards Deming > > > Le message ci-dessus, ainsi que les documents l'accompagnant, sont destinés > uniquement aux personnes identifiées et peuvent contenir des informations > privilégiées, confidentielles ou ne pouvant être divulguées. Si vous avez > reçu ce message par erreur, veuillez le détruire. > > This communication ( and/or the attachments ) is intended for named > recipients only and may contain privileged or confidential information which > is > not to be disclosed. If you received this communication by mistake please > destroy all copies. > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using plyr::dply more (memory) efficiently?
I don't know about that, but try this : install.packages("data.table", repos="http://R-Forge.R-project.org";) require(data.table) summaries = data.table(summaries) summaries[,sum(counts),by=symbol] Please let us know if that returns the correct result, and if its memory/speed is ok ? Matthew "Steve Lianoglou" wrote in message news:w2kbbdc7ed01004290606lc425e47cs95b36f6bf0a...@mail.gmail.com... > Hi all, > > In short: > > I'm running ddply on an admittedly (somehow) large data.frame (not > that large). It runs fine until it finishes and gets to the > "collating" part where all subsets of my data.frame have been > summarized and they are being reassembled into the final summary > data.frame (sorry, don't know the correct plyr terminology). During > collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB > until I kill it. > > Running a similar piece of code that iterates manually w/o ddply by > using a combo of lapply and a do.call(rbind, ...) uses considerably > less ram (tops out at about 8GB). > > How can I use ddply more efficiently? > > Longer: > > Here's more info: > > * The data.frame itself ~ 15.8 MB when loaded. > * ~ 400,000 rows, 8 columns > > It looks like so: > > exon.start exon.width exon.width.unique exon.anno counts > symbol transcript chr > 14225468 0 utr 0 > WASH5P WASH5P chr1 > 24833 69 0 utr 1 > WASH5P WASH5P chr1 > 3565915238 utr 1 > WASH5P WASH5P chr1 > 46470159 0 utr 0 > WASH5P WASH5P chr1 > 56721198 0 utr 0 > WASH5P WASH5P chr1 > 67096136 0 utr 0 > WASH5P WASH5P chr1 > 77469137 0 utr 0 > WASH5P WASH5P chr1 > 87778147 0 utr 0 > WASH5P WASH5P chr1 > 98131 99 0 utr 0 > WASH5P WASH5P chr1 > 10 14601154 0 utr 0 > WASH5P WASH5P chr1 > 11 19184 50 0 utr 0 > WASH5P WASH5P chr1 > 12 469314036intron 2 > WASH5P WASH5P chr1 > 13 490275736intron 1 > WASH5P WASH5P chr1 > 14 5811659 144intron 47 > WASH5P WASH5P chr1 > 15 6629 9221intron 1 > WASH5P WASH5P chr1 > 16 6919177 0intron 0 > WASH5P WASH5P chr1 > 17 723223735intron 2 > WASH5P WASH5P chr1 > 18 7606172 0intron 0 > WASH5P WASH5P chr1 > 19 7925206 0intron 0 > WASH5P WASH5P chr1 > 20 8230 6371 109intron 67 > WASH5P WASH5P chr1 > 21 14755 442955intron 12 > WASH5P WASH5P chr1 > ... > > I'm "ply"-ing over the "transcript" column and the function transforms > each such subset of the data.frame into a new data.frame that is just > 1 row / transcript that basically has the sum of the "counts" for each > transcript. > > The code would look something like this (`summaries` is the data.frame > I'm referring to): > > rpkm <- ddply(summaries, .(transcript), function(df) { > data.frame(symbol=df$symbol[1], counts=sum(df$counts)) > } > > (It actually calculates 2 more columns that are returned in the > data.frame, but I'm not sure that's really important here). > > To test some things out, I've written another function to manually > iterate/create subsets of my data.frame to summarize. > > I'm using sqldf to dump the data.frame into a db, then I lapply over > subsets of the db `where transcript=x` to summarize each subset of my > data into a list of single-row data.frames (like ddply is doing), and > finish with a `do.call(rbind, the.dfs)` o nthis list. > > This returns the same exact result ddply would return, and by the time > `do.call` finishes, my RAM usage hits about 8gb. > > So, what am I doing wrong with ddply that makes the difference ram > usage in the last step ("collation" -- the equivalent of my final > `do.call(rbind, my.dfs)` be more than 12GB? > > Thanks, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using plyr::dply more (memory) efficiently?
"Steve Lianoglou" wrote in message news:t2ybbdc7ed01004290812n433515b5vb15b49c170f5a...@mail.gmail.com... > Thanks for directing me to the data.table package. I read through some > of the vignettes, and it looks quite nice. > > While your sample code would provide answer if I wanted to just > compute some summary statistic/function of groups of my data.frame > (using `by=symbol`), what's the best way to produces several pieces of > info per subset. > > For instance, I see that I can do something like this: > > summaries[, list(counts=sum(counts), width=sum(exon.width)), by=symbol] Yes, thats it. > But what if I need to do some more complex processing within the > subsets defined in `by=symbol` -- like several lines of programming > logic for 1 result, say. > > I guess I can open a new block that just returns a data.table? Like: > > summaries[, { > cnts <- sum(counts) > ew <- sum(exon.width) > # ... some complex things > complex <- # .. result of complex things > data.table(counts=cnts, width=ew, cplx=complex) >}, by=symbol] > > Is that right? (I mean, it looks like it's working, but maybe there's > a more idiomatic way(?)) Yes, you got it. Rather than a data.table at the end though, just return a list, its faster. Shorter vectors will still be recycled to match any longer ones. Or just this : summaries[, list( counts = sum(counts), width = sum(exon.width), cplx = # .. result of complex things ), by=symbol] Sounds like its working, but could you give us an idea whether it is quick and memory efficient ? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Calculating Random Effects Coefficients from lmer
Hello all, I am new to the list serve but hope to contribute what I can. For now, however, I hope to tap into all your knowledge about mixed effects models. Currently, I am running a mixed effects model on a time series panel data. For this model I want to find out the fixed and random effect for two discrete variables. My Model: m1 <- lmer(mghegdp_who ~ govdisgdp_up_net + pridisgdp_up_net + mgdppc_usd06_imf + drdisgdp + mggegdpwb + hiv_prevalence + (0 + govdisgdp_up_net|country) + (0 + pridisgdp_up_net|country), data) To find the overall effect *with confidence intervals* of govdisgdp_up_net & pridisgdp_up_net I need to account for the fixed and random effects. Has any calculated this? Or does anyone have suggestions? Also, does anyone know of the ability to calculate the postVar for a model with multiple random effects? Thank you and look forward to hearing your insights. Matt 'Lost in Seattle' [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] ARMAtoMA
Hello R Users! I have a question about the output of ARMAtoMA when used to calculate the variance of a model. I have a mixed model of the form ARMA(1,1). The actual model takes the form: X(t) = 0.75X(t-12) + a(t) - 0.4a(t-1) Given that gamma(0) takes the form [(1 + theta^2 - 2*theta*phi)/(1-phi^2)]*sigma(a), I would expect a process variance of 4.02*sigma(a) when I substitute 0.75 for phi and -0.4 for theta. When I run ARMAtoMA, result <- ARMAtoMA(ar=c(0.75), ma=(-0.4), lag.max=40) sum(result^2)+1 I get 1.28. If I input 0.4 instead of -0.4 in ARMAtoMA I get the result I expected. Is there a sign dependence in the R function I am overlooking? Thanks in advance. Matt __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.