On May 18, 2012, at 20:19 , Patrick Callier wrote:

> Hi all,
> 
> I am running 64-bit R 2.15.0 on windows 7.  I am trying to use read.delim
> to read from a file that has 2-byte unicode (CJK) characters.
> 
> Here is an example of the data (it is tab-delimited if that gets messed up):
> HITId HITTypeId Title
> 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z 看看句子,写写想法
> 请看以下的句子,再回答问
> 
> So read.delim (code below) doesn't read in correctly.  It reads up until it
> hits the CJK characters and then terminates with a warning:
> Warning messages:
> 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>  invalid input found on input connection 'minimal.txt'
> 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>  incomplete final line found by readTableHeader on 'minimal.txt'
> 
> The "Title" field gets filled with an NA.  I played around with scan() a
> little bit and it can read the file correctly if i send it an open file
> with the correct encoding given for the "encoding" parameter. It barfs with
> the same warnings if I just send it the filename and set the fileEncoding
> parameter.
> 
> Here is some test code with the above text in a file "minimal.txt"
> # works
> scan(file("minimal.txt",encoding="UTF-16LE"),what=character(),nlines=2)
> # don't work
> scan("minimal.txt",what=character(),nlines=2)     # output is in wrong
> encoding
> scan("minimal.txt",what=character(),nlines=2,fileEncoding="UTF-16LE")
> #"invalid input found on input connection"
> read.delim(file("minimal.txt",encoding="UTF-16LE"), sep = "\t",
> header=TRUE)    # ditto
> 
> Is this a bug? Or am I just doing something wrong?  Thanks for any help you
> can provide.

This stuff is highly locale dependent (and locales are OS dependent). As I 
understand things, the encoding= argument to scan() or read.table() says that 
the file is in a foreign encoding and you want to treat strings in that 
encoding, whereas fileEncoding= means that you want to convert to your current 
encoding and then treat the converted data. In the first case, you need to get 
the encoding right, in the other, in addition, you need to be in a locale that 
allows the conversion. 

For file(), requesting an encoding means asking for conversion, so if that 
doesn't work, you are out of luck (and you're just confusing the issue anyway). 
Here are a couple of examples in Latin1; notice that if you can't convert 
Chinese characters to your current locale, then the <U+1234> style output is 
the best you can hope for.

Peter-Dalgaards-MacBook-Air:minimal pd$ LC_ALL="da_DK.ISO8859-1" R --vanilla < 
minimal2.R

R version 2.14.2 (2012-02-29)
....
> read.delim(file("minimal.txt",encoding="UTF-8"), sep = "\t", 
> header=TRUE,encoding="UTF-8")
                           HITId                      HITTypeId Title Question
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z    NA       NA
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'minimal.txt'
> read.delim(file="minimal.txt", encoding="UTF-8")
                           HITId                      HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
                                                                     Title
1 <U+770B><U+770B><U+53E5><U+5B50><U+FF0C><U+5199><U+5199><U+60F3><U+6CD5>
                                                                                
          Question
1 
<U+8BF7><U+770B><U+4EE5><U+4E0B><U+7684><U+53E5><U+5B50><U+FF0C><U+518D><U+56DE><U+7B54><U+95EE>
> read.delim(file="minimal.txt")
                           HITId                      HITTypeId
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
                                                                                
                         Title
1 ?\234\213?\234\213?\217??\220?\214?\206\231?\206\231?\203??\225
                                                                                
                                                          Question
1 请?\234\213以?\213?\232\204?\217??\220?\214?\206\215?\233\236?\224?\227?
> read.delim(file="minimal.txt", fileEncoding="UTF-8")
                           HITId                      HITTypeId Title Question
1 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z    NA       NA
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'minimal.txt'
> 

 


> 
> --Pat
> 
> -- 
> Patrick Callier
> Georgetown University
> http://www12.georgetown.edu/students/prc23/
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd....@cbs.dk  Priv: pda...@gmail.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to