I have been through the help file archives a number of times, and still cannot figure out what is wrong. I have a tab-delimited text file. 76Mb, so while it's large.. it's not -that- large. I'm running Win7 x64 w/4G RAM and R 2.10.1
When I open this data in Excel, i have 27 rows and 450932 rows, excluding the first row containing variable names. I am trying to get this into R as a dataset for analysis. z<-"Data/media1y.txt" f=file(zz,'r') # open the file rl = readLines(f,1) # Read the first line colnames<-strsplit(rl, '\t') p = length(colnames[[1]]) # counte the number of columns nobs<-450932 close(f) Using: d1<-matrix(scan(zz,skip=1,sep="\t",fill=TRUE,what=rep("character",p), nlines=nobs),ncol=p,nrow=nobs, byrow=TRUE, dimnames=list(NULL,colnames[[1]])) produces the error Read 5761719 items Warning message: In matrix(scan(zz, skip = 1, sep = "\t", fill = TRUE, what = rep("character", : data length [5761719] is not a sub-multiple or multiple of the number of rows [10] Now, 5761719/27 = 213397. If I change nobs<-213397 it reads in the file with no errors. It produces a matrix that I can work with from here. But the file obviously is not complete. At first I thought it might be reading the first x amount of rows. So I sorted by the first variable alphabetically in Excel before saving it as a txt file and reading it into R. head(d1) shows the correct first 6 rows, but when I ask for tail(d1) the entry for the first variable in the last row is [213397,] "WSAH" The 213397th row in Excel, starts with "MM1" and the actual last row starts with "YE". The "WSA" in question can be found on Excel row # 397548 That, confuses the heck out of me. There are no blank lines. Since there are >1000 categories for that first variable, i'm not going to manually match all of the frequencies, but the first 10 were exact, "MM1" was correct, and the last few before "WSA" was also correct. "WSA" itself had 3001 observations in R, whereas Excel has 3093. That also makes it seem that R is stopping reading the table at some point. It shouldn't be a memory issue.... right? > object.size(d1) 56328480 bytes > memory.size(max=TRUE) [1] 444.06 > memory.size(max=NA) [1] 3583.88 > memory.size(max=FALSE) [1] 251.09 As a side question, i'm reading it all in as characters for now because when i tried to define a vector of column types wht <-list(rep("character",7),0,"logical",0,"character")) to use in scan(), it still read everything in as character. I'm also not sure about the "" 's, I had to put them in to get list() to even accept that. Or c(). Any ideas with this? Thanks! -- Robin Jeffries Dr.P.H. Candidate Department of Biostatistics UCLA School of Public Health [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.