Just use what=''
in your scan since all your data appears to be character. Also use comment.char='', quote='' if there is the possibility of a misplaced quote or a comment character. I have not problem reading in files of that size. Also use 'count.fields' to see what your file looks like. On Wed, Jan 20, 2010 at 8:22 PM, Robin Jeffries <rjeffr...@ucla.edu> wrote: > I have been through the help file archives a number of times, and still > cannot figure out what is wrong. > I have a tab-delimited text file. 76Mb, so while it's large.. it's not > -that- large. I'm running Win7 x64 w/4G RAM and R 2.10.1 > > When I open this data in Excel, i have 27 rows and 450932 rows, excluding > the first row containing variable names. > > I am trying to get this into R as a dataset for analysis. > > z<-"Data/media1y.txt" > f=file(zz,'r') # open the file > rl = readLines(f,1) # Read the first line > colnames<-strsplit(rl, '\t') > p = length(colnames[[1]]) # counte the number of columns > nobs<-450932 > close(f) > > Using: > d1<-matrix(scan(zz,skip=1,sep="\t",fill=TRUE,what=rep("character",p), > nlines=nobs),ncol=p,nrow=nobs, byrow=TRUE, > dimnames=list(NULL,colnames[[1]])) > > produces the error > Read 5761719 items > Warning message: > In matrix(scan(zz, skip = 1, sep = "\t", fill = TRUE, what = > rep("character", : > data length [5761719] is not a sub-multiple or multiple of the number of > rows [10] > > Now, 5761719/27 = 213397. > If I change nobs<-213397 it reads in the file with no errors. It produces a > matrix that I can work with from here. But the file obviously is not > complete. > > At first I thought it might be reading the first x amount of rows. So I > sorted by the first variable alphabetically in Excel before saving it as a > txt file and reading it into R. > head(d1) shows the correct first 6 rows, but when I ask for tail(d1) the > entry for the first variable in the last row is [213397,] "WSAH" > The 213397th row in Excel, starts with "MM1" and the actual last row starts > with "YE". The "WSA" in question can be found on Excel row # 397548 > > That, confuses the heck out of me. There are no blank lines. > > Since there are >1000 categories for that first variable, i'm not going to > manually match all of the frequencies, but the first 10 were exact, "MM1" > was correct, and the last few before "WSA" was also correct. "WSA" itself > had 3001 observations in R, whereas Excel has 3093. That also makes it seem > that R is stopping reading the table at some point. > > > > It shouldn't be a memory issue.... right? >> object.size(d1) > 56328480 bytes >> memory.size(max=TRUE) > [1] 444.06 >> memory.size(max=NA) > [1] 3583.88 >> memory.size(max=FALSE) > [1] 251.09 > > > > As a side question, i'm reading it all in as characters for now because when > i tried to define a vector of column types wht > <-list(rep("character",7),0,"logical",0,"character")) to use in scan(), it > still read everything in as character. I'm also not sure about the "" 's, I > had to put them in to get list() to even accept that. Or c(). Any ideas with > this? > > Thanks! > > -- > Robin Jeffries > Dr.P.H. Candidate > Department of Biostatistics > UCLA School of Public Health > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.