On Mar 6, 2011, at 12:47 PM, Johannes Graumann wrote:
Thank you for pointing this out. This is really inconvenient as I do
not
know a priori how many and where those darn cases containing an
additional
(or more) ":" might be ...
There is a count.fields function that might assist with this task.
You seem to have a multiline (variable number of lines) format of:
NNNN:>sp|header with "|" AND white space separators
NNNN:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEEEEE
NNNN+60:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEE
NNNN+120:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDE
NNNN+180:EXCEPT_LAST
No way that read.table can work. You might create an index with the
location of the high-count headers and then reprocess.
log.idx <- count.fields("/tmp/testfile.txt") > 1
corpus <- readLines("/tmp/testfile.txt")
Then parse the headers and rejoin the broken multi-line content. There
may be worked examples in the archive for variable number multi-line
file formats.
--
David.
The seems to work, but will fail if there's a "1:sdfjhlfkh:2:adlkjf"
somewhere (1 & 2 both integerable).
na.exclude(as.integer(scan("/tmp/
testfile.txt",sep=":",what="integer")))
More robust pointers anyone?
Joh
Sarah Goslee wrote:
Not so much a mystery. read.table() only looks at the first 5 lines
when
decided how many columns your file has (as described in the Details
section of the help).
The easiest solution is to add a col.names argument to read.table()
with
the correct number of names.
You may want to also include as.is=TRUE if you don't want your data
to
be imported as factors. If you expect character but have factor you
may
get unexpected results later.
Sarah
On Sun, Mar 6, 2011 at 5:04 AM, Johannes Graumann
<johannes_graum...@web.de> wrote:
Hello,
Please have a look at the code below, which I use to read in the
attached
file. As line 18 of the file reads "1065:>sp|Q9V3T9|ADRO_DROME
NADPH:adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
melanogaster GN=dare PE=2 SV=1", I expect the code below to
produce a 3
column data frame with most of the last column empty and line 18 to
produce a data.frame row like so:
V1
1065
V2
sp|Q9V3T9|ADRO_DROME NADPH
V3
adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
melanogaster GN=dare PE=2 SV=1
Why is that not so?
Thanks for any hint.
Sincerely, Joh
read.table(
"/tmp/testfile.txt",
sep=":",
header=FALSE,
quote="",
fill=TRUE
)[19,]
---
Sarah Goslee
http://www.functionaldiversity.org
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.