> * peter dalgaard <cqn...@tznvy.pbz> [2012-02-24 08:41:07 +0100]: > On Feb 24, 2012, at 06:58 , Sam Steingold wrote: > >> batch is a vector of lines returned by readLines from a >> NL-line-terminated file, here is the relevant section: >> ========================================================= >> AA BB CC DD EE FF >> GG H >> >> H JJ KK LL MM >> ========================================================= >> as you can see, a line is corrupt; two CRLF's are inserted. > > Actually, I don't see... (It's pretty hard to count TAB characters by eye.)
how about this? >> ========================================================= >> AA^IBB^ICC^IDD^I^I^IEE^IFF >> GG^IH^M >> ^M >> H^IJJ^IKK^I^I^ILL^IMM >> ========================================================= I replaced TAB with ^I and CR with ^M. is this better? here I use <TAB> and <CR> instead: >> ========================================================= >> AA<TAB>BB<TAB>CC<TAB>DD<TAB><TAB><TAB>EE<TAB>FF >> GG<TAB>H<CR> >> <CR> >> H<TAB>JJ<TAB>KK<TAB><TAB><TAB>LL<TAB>MM >> ========================================================= so, you see, there are two data lines here: A..F - good, with 8 fields. G..M - BAD two CRLF's inserted inside the 2nd field, turning one line into 3 lines. so I must drop 3 input lines from the input. >> This is okay, I drop the bad lines, at least I hope I do: >> >> conn <- textConnection(batch) >> field.counts <- count.fields(conn, sep="\t", comment.char="", quote="") >> close(conn) >> good <- field.counts == 8 # this should drop all bad lines >> if (!all(good)) >> batch <- batch[good] >> conn <- textConnection(batch) >> ret <- read.table(conn, sep="\t", comment.char="", quote="") >> close(conn) >> >> I get this error in read.table(): >> >> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, >> : >> line 7151 did not have 8 elements >> >> how come?! > > You can do better than this in terms of providing clues for us: > "batch" is a character vector, right? So recheck that count.fields > returns all 8's after removal of bad lines. Also check that dimensions > match -- is length(batch) actually the same as length(field.counts)? batch <- lines[807000:808000] conn <- textConnection(batch) field.counts <- count.fields(conn, sep="\t", comment.char="", quote="") close(conn) good <- field.counts == length(col.names) which(!good) [1] 152 153 ## WRONG: it should be 3 lines, 154 is also bad - see above batch[!good] [1] "GG\tH" "" length(batch) [1] 1001 length(good) [1] 1000 ## WRONG: batch, field.counts and good should have the same length AHA! blank.lines.skip !!! I must set it to FALSE!!! and it does fix the problem... > Finally, what is in line 7151? that's the first line with a <CR>: GG<TAB>H<CR> >> also, is there some error recovery? > > Well you can try(). it appears that try gives me access to the error message, not the erroneous data, i.e., I still have to reload the file to get the batch string vector. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://www.childpsy.net/ http://www.memritv.org http://americancensorship.org http://memri.org http://jihadwatch.org http://dhimmi.com http://iris.org.il Democracy is like a car: you can ride it or you can run people over with it. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.