[R] Problems completely reading in a "large" sized data set

Robin Jeffries Wed, 20 Jan 2010 17:23:33 -0800

I have been through the help file archives a number of times, and still
cannot figure out what is wrong.
I have a tab-delimited text file. 76Mb, so while it's large.. it's not
-that- large. I'm running Win7 x64 w/4G RAM and R 2.10.1


When I open this data in Excel, i have 27 rows and 450932 rows, excluding
the first row containing variable names.

I am trying to get this into R as a dataset for analysis.

z<-"Data/media1y.txt"
f=file(zz,'r') # open the file
rl = readLines(f,1) # Read the first line
colnames<-strsplit(rl, '\t')
p = length(colnames[[1]]) # counte the number of columns
nobs<-450932
close(f)

Using:
d1<-matrix(scan(zz,skip=1,sep="\t",fill=TRUE,what=rep("character",p),
nlines=nobs),ncol=p,nrow=nobs, byrow=TRUE,
dimnames=list(NULL,colnames[[1]]))

produces the error
Read 5761719 items
Warning message:
In matrix(scan(zz, skip = 1, sep = "\t", fill = TRUE, what =
rep("character",  :
  data length [5761719] is not a sub-multiple or multiple of the number of
rows [10]

Now, 5761719/27 = 213397.
If I change nobs<-213397 it reads in the file with no errors. It produces a
matrix that I can work with from here. But the file obviously is not
complete.

At first I thought it might be reading the first x amount of rows. So I
sorted by the first variable alphabetically in Excel before saving it as a
txt file and reading it into R.
head(d1) shows the correct first 6 rows, but when I ask for tail(d1) the
entry for the first variable in the last row is [213397,] "WSAH"
The 213397th row in Excel, starts with "MM1" and the actual last row starts
with "YE". The "WSA" in question can be found on Excel row # 397548

That, confuses the heck out of me. There are no blank lines.

Since there are >1000 categories for that first variable, i'm not going to
manually match all of the frequencies, but the first 10 were exact, "MM1"
was correct, and the last few before "WSA" was also correct. "WSA" itself
had 3001 observations in R, whereas Excel has 3093. That also makes it seem
that R is stopping reading the table at some point.



It shouldn't be a memory issue.... right?
> object.size(d1)
56328480 bytes
> memory.size(max=TRUE)
[1] 444.06
> memory.size(max=NA)
[1] 3583.88
> memory.size(max=FALSE)
[1] 251.09



As a side question, i'm reading it all in as characters for now because when
i tried to define a vector of column types wht
<-list(rep("character",7),0,"logical",0,"character")) to use in scan(), it
still read everything in as character. I'm also not sure about the "" 's, I
had to put them in to get list() to even accept that. Or c(). Any ideas with
this?

Thanks!

-- 
Robin Jeffries
Dr.P.H. Candidate
Department of Biostatistics
UCLA School of Public Health

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Problems completely reading in a "large" sized data set

Reply via email to