On Mar 6, 2011, at 12:47 PM, Johannes Graumann wrote:

Thank you for pointing this out. This is really inconvenient as I do not know a priori how many and where those darn cases containing an additional
(or more) ":" might be ...

There is a count.fields function that might assist with this task.

You seem to have a multiline (variable number of lines)  format of:

NNNN:>sp|header with "|" AND white space separators
NNNN:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEEEEE
NNNN+60:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEE
NNNN+120:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDE
NNNN+180:EXCEPT_LAST

No way that read.table can work. You might create an index with the location of the high-count headers and then reprocess.

log.idx <- count.fields("/tmp/testfile.txt") > 1
corpus <- readLines("/tmp/testfile.txt")

Then parse the headers and rejoin the broken multi-line content. There may be worked examples in the archive for variable number multi-line file formats.

--
David.



The seems to work, but will fail if there's a "1:sdfjhlfkh:2:adlkjf"
somewhere (1 & 2 both integerable).

na.exclude(as.integer(scan("/tmp/ testfile.txt",sep=":",what="integer")))

More robust pointers anyone?

Joh

Sarah Goslee wrote:

Not so much a mystery. read.table() only looks at the first 5 lines when
decided how many columns your file has (as described in the Details
section of the help).

The easiest solution is to add a col.names argument to read.table() with
the correct number of names.

You may want to also include as.is=TRUE if you don't want your data to be imported as factors. If you expect character but have factor you may
get unexpected results later.

Sarah

On Sun, Mar 6, 2011 at 5:04 AM, Johannes Graumann
<johannes_graum...@web.de> wrote:
Hello,


Please have a look at the code below, which I use to read in the attached
file. As line 18 of the file reads "1065:>sp|Q9V3T9|ADRO_DROME
NADPH:adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
melanogaster GN=dare PE=2 SV=1", I expect the code below to produce a 3
column data frame with most of the last column empty and line 18 to
produce a data.frame row like so:

V1
      1065
V2
sp|Q9V3T9|ADRO_DROME NADPH
V3
      adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
melanogaster GN=dare PE=2 SV=1

Why is that not so?

Thanks for any hint.

Sincerely, Joh

read.table(
"/tmp/testfile.txt",
sep=":",
header=FALSE,
quote="",
fill=TRUE
)[19,]

---
Sarah Goslee
http://www.functionaldiversity.org

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to