Re: [R] strange behavior when reading csv - line wraps

jim holtman Sun, 31 May 2009 08:06:41 -0700

You can do something like this: count the number of fields in each line of
the file and use the max to determine the number of columns for read.table:


file <- '/tempxx.txt'
maxFields <- max(count.fields(file))  # max
# now setup read.table for max number
input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE,
    col.names=paste("V", seq(maxFields), sep=''))


On Sun, May 31, 2009 at 6:06 AM, Martin Tomko <martin.to...@geo.uzh.ch>wrote:

> Dear Jim,
> with the help of Ted, we diagnosed that the cause is in the extreme
> variability in line length during reading in. As the table column number is
> apparently determined fro mthe first five lines, what exceeds this length
> gets automatically on the next line.
> I am now trying to find a way to read in the data despite this. I have no
> control over the table extent, the only thing that would make sense
> according to my data would be to read in a fixed number of columns and merge
> all remaining columns as a long string in the last one. No idea how to do
> this, though.
>
> Thanks
> Martin
>
>
> jim holtman wrote:
>
>> It is still not clear to me exactly how you want to read the lines in.  If
>> the lines have a variable number of fields, and some of the lines might be
>> wrapped, is there some way to determine where the start of each line is.
>>  If you are reading them in with read.csv, then the system is assuming
>> that each line starts a new row.  If this is not the case, then you will
>> have to state the rules that determine where the lines start.  You can
>> always read the data in with 'scan' to separate each line and then do
>> whatever processing is required to put together the rows in a data frame
>> that you want.
>>  In one of your examples, you indicated that the line was split starting
>> at the word "kempten"; if this is in the middle of the line, then you would
>> have to create the break after reading the line in with 'scan' and then
>> creating the rows in the dataframe.  All of this can be done in R if you can
>> state what the criteria is.
>> On Sat, May 30, 2009 at 4:32 AM, Martin Tomko 
>> <martin.to...@geo.uzh.ch<mailto:
>> martin.to...@geo.uzh.ch>> wrote:
>>
>>    Jim,
>>    the two lines I put in are the actual problematic input lines.
>>    In these examples, there are no quotes nor # signs, although I
>>    have no means to make sure they do not occur in the inputs (any
>>    hints how I could deal with that?).
>>    I am trying to avoid as much pre-processing outside R as possible,
>>    and I have to process about 500 files with up to 3000 records
>>    each, so I need a more or less automated/batch solution. - so any
>>    string substitution will have to occur in R. But for the moment, I
>>    do not see a reaason for substitution, and the wrapping still occurs.
>>
>>    Cheers
>>    Martin
>>
>>
>>
>>    jim holtman wrote:
>>
>>        You need to supply the actual input line so we can see what is
>>        happening.  Are you sure you do not have unbalanced quotes in
>>        your input (try quote='') or do you have comment characters
>>        ("#") in your input?
>>
>>        On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
>>        <martin.to...@geo.uzh.ch <mailto:martin.to...@geo.uzh.ch>
>>        <mailto:martin.to...@geo.uzh.ch
>>        <mailto:martin.to...@geo.uzh.ch>>> wrote:
>>
>>           Dear All,
>>           I am observing a strange behavior and searching the
>>        archives and
>>           help pages didn't help much.
>>           I have a csv with a variable number of fields in each line.
>>
>>           I use
>>           dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
>>        =TRUE);
>>
>>           to read it in, and it works. But - some lines are long and
>>        'wrap',
>>           or split and continue on the next line. So when I check the
>>        dim of
>>           the frame, they are not correct and I can see when I do a
>>        printout
>>           that the lines is split into two in the frame. I checked
>>        the input
>>           file and all is good.
>>
>>           an example of the input is:
>>                 37;2175168475;13;8.522729;47.19537;16366...@n00
>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switzerland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;tourismus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitriotnet;
>>
>>           where the last values occurs on the next line in the data
>>        frame.
>>
>>           It does not have to be the last value, as in the follwong
>>        example,
>>           the word "kempten" starts the next line:
>>                 39;167757703;12;10.309295;47.724545;21903...@n00
>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavaria;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kuppel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europaeischeunion;germanio;
>>
>>           What could be the reason?
>>
>>           I ws thinking about solving the issue by using a different
>>           separator, that I would use for the first 7 fields and
>>           concatenating all of the remaining values into a single stirng
>>           value, but could not figure out how to do such a
>>        substitution in
>>           R. Unfortunately, on my system I cannot specify a range for
>>        sed...
>>
>>           Thanks for any help/pointers
>>           Martin
>>
>>           ______________________________________________
>>           R-help@r-project.org <mailto:R-help@r-project.org>
>>        <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>
>>        mailing list
>>
>>           https://stat.ethz.ch/mailman/listinfo/r-help
>>           PLEASE do read the posting guide
>>           
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>>        <http://www.r-project.org/posting-guide.html>
>>           <http://www.r-project.org/posting-guide.html>
>>
>>           and provide commented, minimal, self-contained,
>>        reproducible code.
>>
>>
>>
>>
>>        --        Jim Holtman
>>        Cincinnati, OH
>>        +1 513 646 9390
>>
>>        What is the problem that you are trying to solve?
>>
>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>
>
> --
> Martin Tomko
> Postdoctoral Research Assistant   Geographic Information Systems Division
> Department of Geography
> University of Zurich - Irchel
> Winterthurerstr. 190
> CH-8057 Zurich, Switzerland
>
> email:  martin.to...@geo.uzh.ch
> site:   http://www.geo.uzh.ch/~mtomko
> mob:    +41-788 629 558
> tel:    +41-44-6355256
> fax:    +41-44-6356848
>
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] strange behavior when reading csv - line wraps

Reply via email to