Ah!!! It was count.fields() which we had overlooked! We discoveered a work-round which involved using
Data0 <- readLines(file) to create a vector of strings, one for each line of the input file, and then using NF <- unlist(lapply(R0,function(x) length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE)))))) to count the number of occurrences of ";" (the separator) in each line. (NF+1) produces the same result as count.fields(file,sep=";"). Thanks for pointing out the existence of count.fields()! Ted. On 31-May-09 15:04:23, jim holtman wrote: > You can do something like this: count the number of fields in each line > of > the file and use the max to determine the number of columns for > read.table: > > file <- '/tempxx.txt' > maxFields <- max(count.fields(file)) # max ># now setup read.table for max number > input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE, > col.names=paste("V", seq(maxFields), sep='')) > > > On Sun, May 31, 2009 at 6:06 AM, Martin Tomko > <martin.to...@geo.uzh.ch>wrote: > >> Dear Jim, >> with the help of Ted, we diagnosed that the cause is in the extreme >> variability in line length during reading in. As the table column >> number is >> apparently determined fro mthe first five lines, what exceeds this >> length >> gets automatically on the next line. >> I am now trying to find a way to read in the data despite this. I have >> no >> control over the table extent, the only thing that would make sense >> according to my data would be to read in a fixed number of columns and >> merge >> all remaining columns as a long string in the last one. No idea how to >> do >> this, though. >> >> Thanks >> Martin >> >> >> jim holtman wrote: >> >>> It is still not clear to me exactly how you want to read the lines >>> in. If >>> the lines have a variable number of fields, and some of the lines >>> might be >>> wrapped, is there some way to determine where the start of each line >>> is. >>> If you are reading them in with read.csv, then the system is >>> assuming >>> that each line starts a new row. If this is not the case, then you >>> will >>> have to state the rules that determine where the lines start. You >>> can >>> always read the data in with 'scan' to separate each line and then do >>> whatever processing is required to put together the rows in a data >>> frame >>> that you want. >>> In one of your examples, you indicated that the line was split >>> starting >>> at the word "kempten"; if this is in the middle of the line, then you >>> would >>> have to create the break after reading the line in with 'scan' and >>> then >>> creating the rows in the dataframe. All of this can be done in R if >>> you can >>> state what the criteria is. >>> On Sat, May 30, 2009 at 4:32 AM, Martin Tomko >>> <martin.to...@geo.uzh.ch<mailto: >>> martin.to...@geo.uzh.ch>> wrote: >>> >>> Jim, >>> the two lines I put in are the actual problematic input lines. >>> In these examples, there are no quotes nor # signs, although I >>> have no means to make sure they do not occur in the inputs (any >>> hints how I could deal with that?). >>> I am trying to avoid as much pre-processing outside R as possible, >>> and I have to process about 500 files with up to 3000 records >>> each, so I need a more or less automated/batch solution. - so any >>> string substitution will have to occur in R. But for the moment, I >>> do not see a reaason for substitution, and the wrapping still >>> occurs. >>> >>> Cheers >>> Martin >>> >>> >>> >>> jim holtman wrote: >>> >>> You need to supply the actual input line so we can see what is >>> happening. Are you sure you do not have unbalanced quotes in >>> your input (try quote='') or do you have comment characters >>> ("#") in your input? >>> >>> On Fri, May 29, 2009 at 3:15 PM, Martin Tomko >>> <martin.to...@geo.uzh.ch <mailto:martin.to...@geo.uzh.ch> >>> <mailto:martin.to...@geo.uzh.ch >>> <mailto:martin.to...@geo.uzh.ch>>> wrote: >>> >>> Dear All, >>> I am observing a strange behavior and searching the >>> archives and >>> help pages didn't help much. >>> I have a csv with a variable number of fields in each line. >>> >>> I use >>> dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill >>> =TRUE); >>> >>> to read it in, and it works. But - some lines are long and >>> 'wrap', >>> or split and continue on the next line. So when I check the >>> dim of >>> the frame, they are not correct and I can see when I do a >>> printout >>> that the lines is split into two in the frame. I checked >>> the input >>> file and all is good. >>> >>> an example of the input is: >>> 37;2175168475;13;8.522729;47.19537;16366...@n00 >>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz >>> erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris >>> mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri >>> otnet; >>> >>> where the last values occurs on the next line in the data >>> frame. >>> >>> It does not have to be the last value, as in the follwong >>> example, >>> the word "kempten" starts the next line: >>> 39;167757703;12;10.309295;47.724545;21903...@n00 >>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar >>> ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss >>> ;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup >>> pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ > aeischeunion;germanio; >>> >>> What could be the reason? >>> >>> I ws thinking about solving the issue by using a different >>> separator, that I would use for the first 7 fields and >>> concatenating all of the remaining values into a single >>> stirng >>> value, but could not figure out how to do such a >>> substitution in >>> R. Unfortunately, on my system I cannot specify a range for >>> sed... >>> >>> Thanks for any help/pointers >>> Martin >>> >>> ______________________________________________ >>> R-help@r-project.org <mailto:R-help@r-project.org> >>> <mailto:R-help@r-project.org <mailto:R-help@r-project.org>> >>> mailing list >>> >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html<http://www.r-pro >>> ject.org/posting-guide.html> >>> <http://www.r-project.org/posting-guide.html> >>> <http://www.r-project.org/posting-guide.html> >>> >>> and provide commented, minimal, self-contained, >>> reproducible code. >>> >>> >>> >>> >>> -- Jim Holtman >>> Cincinnati, OH >>> +1 513 646 9390 >>> >>> What is the problem that you are trying to solve? >>> >>> >>> >>> >>> >>> -- >>> Jim Holtman >>> Cincinnati, OH >>> +1 513 646 9390 >>> >>> What is the problem that you are trying to solve? >>> >> >> >> -- >> Martin Tomko >> Postdoctoral Research Assistant Geographic Information Systems >> Division >> Department of Geography >> University of Zurich - Irchel >> Winterthurerstr. 190 >> CH-8057 Zurich, Switzerland >> >> email: martin.to...@geo.uzh.ch >> site: http://www.geo.uzh.ch/~mtomko >> mob: +41-788 629 558 >> tel: +41-44-6355256 >> fax: +41-44-6356848 >> >> > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 31-May-09 Time: 16:24:27 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.