Re: [R] read.table mystery

David Winsemius Sun, 06 Mar 2011 10:49:48 -0800


On Mar 6, 2011, at 12:47 PM, Johannes Graumann wrote:

Thank you for pointing this out. This is really inconvenient as I donotknow a priori how many and where those darn cases containing anadditional
(or more) ":" might be ...


There is a count.fields function that might assist with this task.

You seem to have a multiline (variable number of lines)  format of:

NNNN:>sp|header with "|" AND white space separators
NNNN:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEEEEE
NNNN+60:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEE
NNNN+120:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDE
NNNN+180:EXCEPT_LAST

No way that read.table can work. You might create an index with thelocation of the high-count headers and then reprocess.


log.idx <- count.fields("/tmp/testfile.txt") > 1
corpus <- readLines("/tmp/testfile.txt")

Then parse the headers and rejoin the broken multi-line content. Theremay be worked examples in the archive for variable number multi-linefile formats.


--
David.


The seems to work, but will fail if there's a "1:sdfjhlfkh:2:adlkjf"
somewhere (1 & 2 both integerable).

na.exclude(as.integer(scan("/tmp/testfile.txt",sep=":",what="integer")))


More robust pointers anyone?

Joh

Sarah Goslee wrote:

Not so much a mystery. read.table() only looks at the first 5 lineswhen
decided how many columns your file has (as described in the Details
section of the help).
The easiest solution is to add a col.names argument to read.table()with
the correct number of names.
You may want to also include as.is=TRUE if you don't want your datatobe imported as factors. If you expect character but have factor youmay
get unexpected results later.

Sarah

On Sun, Mar 6, 2011 at 5:04 AM, Johannes Graumann
<johannes_graum...@web.de> wrote:
Hello,

Please have a look at the code below, which I use to read in theattached

file. As line 18 of the file reads "1065:>sp|Q9V3T9|ADRO_DROME
NADPH:adrenodoxin oxidoreductase, mitochondrial OS=Drosophila

melanogaster GN=dare PE=2 SV=1", I expect the code below toproduce a 3

column data frame with most of the last column empty and line 18 to
produce a data.frame row like so:

V1
      1065
V2

sp|Q9V3T9|ADRO_DROME NADPH

V3
      adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
melanogaster GN=dare PE=2 SV=1

Why is that not so?

Thanks for any hint.

Sincerely, Joh

read.table(
"/tmp/testfile.txt",
sep=":",
header=FALSE,
quote="",
fill=TRUE
)[19,]


---
Sarah Goslee
http://www.functionaldiversity.org


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.table mystery

Reply via email to