Re: [R] strange behavior when reading csv - line wraps

Martin Tomko Sun, 31 May 2009 13:50:28 -0700

Big thanks to Ted and Jim for all the help.
Martin

(Ted Harding) wrote:

Ah!!! It was count.fields() which we had overlooked! We discoveered

a work-round which involved using

  Data0 <- readLines(file)


to create a vector of strings, one for each line of the input file,
and then using

  NF <- unlist(lapply(R0,function(x)
        length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE))))))

to count the number of occurrences of ";" (the separator) in each line.

(NF+1) produces the same result as count.fields(file,sep=";").

Thanks for pointing out the existence of count.fields()!
Ted.

On 31-May-09 15:04:23, jim holtman wrote:

You can do something like this: count the number of fields in each line
of
the file and use the max to determine the number of columns for
read.table:

file <- '/tempxx.txt'
maxFields <- max(count.fields(file))  # max
# now setup read.table for max number
input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE,
    col.names=paste("V", seq(maxFields), sep=''))


On Sun, May 31, 2009 at 6:06 AM, Martin Tomko
<martin.to...@geo.uzh.ch>wrote:

Dear Jim,
with the help of Ted, we diagnosed that the cause is in the extreme
variability in line length during reading in. As the table column
number is
apparently determined fro mthe first five lines, what exceeds this
length
gets automatically on the next line.
I am now trying to find a way to read in the data despite this. I have
no
control over the table extent, the only thing that would make sense
according to my data would be to read in a fixed number of columns and
merge
all remaining columns as a long string in the last one. No idea how to
do
this, though.

Thanks
Martin


jim holtman wrote:

It is still not clear to me exactly how you want to read the lines
in.  If
the lines have a variable number of fields, and some of the lines
might be
wrapped, is there some way to determine where the start of each line
is.
 If you are reading them in with read.csv, then the system is
 assuming
that each line starts a new row.  If this is not the case, then you
will
have to state the rules that determine where the lines start.  You
can
always read the data in with 'scan' to separate each line and then do
whatever processing is required to put together the rows in a data
frame
that you want.
 In one of your examples, you indicated that the line was split
 starting
at the word "kempten"; if this is in the middle of the line, then you
would
have to create the break after reading the line in with 'scan' and
then
creating the rows in the dataframe.  All of this can be done in R if
you can
state what the criteria is.
On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
<martin.to...@geo.uzh.ch<mailto:
martin.to...@geo.uzh.ch>> wrote:

   Jim,
   the two lines I put in are the actual problematic input lines.
   In these examples, there are no quotes nor # signs, although I
   have no means to make sure they do not occur in the inputs (any
   hints how I could deal with that?).
   I am trying to avoid as much pre-processing outside R as possible,
   and I have to process about 500 files with up to 3000 records
   each, so I need a more or less automated/batch solution. - so any
   string substitution will have to occur in R. But for the moment, I
   do not see a reaason for substitution, and the wrapping still
   occurs.

   Cheers
   Martin



   jim holtman wrote:

       You need to supply the actual input line so we can see what is
       happening.  Are you sure you do not have unbalanced quotes in
       your input (try quote='') or do you have comment characters
       ("#") in your input?

       On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
       <martin.to...@geo.uzh.ch <mailto:martin.to...@geo.uzh.ch>
       <mailto:martin.to...@geo.uzh.ch
       <mailto:martin.to...@geo.uzh.ch>>> wrote:

          Dear All,
          I am observing a strange behavior and searching the
       archives and
          help pages didn't help much.
          I have a csv with a variable number of fields in each line.

          I use
          dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
       =TRUE);

          to read it in, and it works. But - some lines are long and
       'wrap',
          or split and continue on the next line. So when I check the
       dim of
          the frame, they are not correct and I can see when I do a
       printout
          that the lines is split into two in the frame. I checked
       the input
          file and all is good.

          an example of the input is:
                37;2175168475;13;8.522729;47.19537;16366...@n00
;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
otnet;

          where the last values occurs on the next line in the data
       frame.

          It does not have to be the last value, as in the follwong
       example,
          the word "kempten" starts the next line:
                39;167757703;12;10.309295;47.724545;21903...@n00
;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ

aeischeunion;germanio;

          What could be the reason?

          I ws thinking about solving the issue by using a different
          separator, that I would use for the first 7 fields and
          concatenating all of the remaining values into a single
          stirng
          value, but could not figure out how to do such a
       substitution in
          R. Unfortunately, on my system I cannot specify a range for
       sed...

          Thanks for any help/pointers
          Martin

          ______________________________________________
          R-help@r-project.org <mailto:R-help@r-project.org>
       <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>
       mailing list

          https://stat.ethz.ch/mailman/listinfo/r-help
          PLEASE do read the posting guide
          http://www.R-project.org/posting-guide.html<http://www.r-pro
          ject.org/posting-guide.html>
       <http://www.r-project.org/posting-guide.html>
          <http://www.r-project.org/posting-guide.html>

          and provide commented, minimal, self-contained,
       reproducible code.




       --        Jim Holtman
       Cincinnati, OH
       +1 513 646 9390

       What is the problem that you are trying to solve?





--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

--
Martin Tomko
Postdoctoral Research Assistant   Geographic Information Systems
Division
Department of Geography
University of Zurich - Irchel
Winterthurerstr. 190
CH-8057 Zurich, Switzerland

email:  martin.to...@geo.uzh.ch
site:   http://www.geo.uzh.ch/~mtomko
mob:    +41-788 629 558
tel:    +41-44-6355256
fax:    +41-44-6356848

--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

      [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 31-May-09                                       Time: 16:24:27
------------------------------ XFMail ------------------------------



--
Martin Tomko

Postdoctoral Research AssistantGeographic Information Systems Division

Department of Geography
University of Zurich - Irchel
Winterthurerstr. 190
CH-8057 Zurich, Switzerland

email:  martin.to...@geo.uzh.ch
site:   http://www.geo.uzh.ch/~mtomko
mob:    +41-788 629 558
tel:    +41-44-6355256
fax:    +41-44-6356848

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] strange behavior when reading csv - line wraps

Reply via email to