This looks like a bug. Specifically, inside read.table

    lines <- .External(C_readtablehead, file, nlines, comment.char, 
        blank.lines.skip, quote, sep, skipNul)

returns "lines" as

[1] "ID\tValue"                         "=\"Total\"\t1000"                 
[3] "=\"CJ01   \"\t550\n=\"CF02\"\t450"

Notice the embedded \n in the 3rd line. I.e., there are really 4 lines there. 
This gets pushed back twice and the first 3 (not 4) lines get read again as 
part of the header logic. Then when it comes to reading the data proper, the 
4th line has ended up duplicated as the top row...

As you suggest, it seems that something is up with the quote matching logic.

-pd


> On 4 Feb 2018, at 23:45 , Michael <michael77al...@gmail.com> wrote:
> 
> I’ve been struggling with seemingly ‘corrupt’ data.frames for a few days, and 
> believe I’ve narrowed the problem down to some odd behaviour from read.table
> 
> I receive a tab delimited file from an external provider where strings are 
> encoded as =“content”. Not sure why, perhaps as most users open it in Excel. 
> My specific issue is that trailing spaces in any of the strings are causing 
> strange results from read.table
> 
> # No trailing spaces
> read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01\"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t’)
>      V1    V2
> 1     ID Value
> 2 =Total  1000
> 3  =CJ01   550
> 4  =CF02   450
> 
> # Now with trailing spaces in line 3
> read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01   
> \"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t')
>        V1    V2
> 1    =CF02   450
> 2       ID Value
> 3   =Total  1000
> 4 =CJ01      550
> 5    =CF02   450
> 
> I solved my specific problem by setting quote=‘’, and extracting the string 
> content after calling read.table. As my original code had header=TRUE, I was 
> finding random rows were being used as column names! 
> 
> Flagging a potential issue with read.table, although I can easily accept I'm 
> missing something obvious here. 
> 
> Best,
> Michael
> 
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-apple-darwin15.6.0 (64-bit)  / x86_64-pc-linux-gnu (64-bit)
> Running under: macOS High Sierra 10.13.2 /  Ubuntu 16.04.3 LTS
> 
> 
> 
> 
> 
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd....@cbs.dk  Priv: pda...@gmail.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to