Dear R experts, I have a large table saved in a file called "plant_genome.gff". The file has 481848 lines in nine columns, which are TAB delimited, and is 53 MegaBytes large. For anyone who might know the GFF3 format: The table holds a plant genome's annotation.
If I read in the table with read.table( "plant_genome.gff" ) I get the following error "line 2 did not have 12 elements". If I read in the table with read.table( "plant_genome.gff", sep="\t" ) no error or warning is given, but my resulting table has only 193547 instead of the expected 481848 rows! 60% of the lines are omitted. Also passing in the arguments as.is = TRUE or setting the columns' classes with colClasses = c( "character", …, "integer", "integer", "numeric", "character", … ) # columns 4, and 5 are integers, column 6 is numeric, all others are characters does not resolve the problem. If I read in the file with readLines and then manually split them using strplit(…) and combine them into a data.frame with as.data.frame( do.call( "rbind", splitted.lines ), colClasses=…) I get the expected and correct data.frame, representing my GFF3 data. My questions are: 1) Am I using read.table wrong, or did I miss something in the documentation? 2) Or is this is known problem with large TAB delimited tables, whose columns contain white-spaces and are not surrounded by quotes? Unfortunately due to the unpublished nature of the plant genome I am not allowed to give access to the GFF table that causes this problem. Any ideas, hints, help - or comments on my stupidity having missed something important - will be much appreciated! Cheers! ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.