[R] UTF-16 input and read.delim/scan

Patrick Callier Fri, 18 May 2012 11:21:42 -0700

Hi all,

I am running 64-bit R 2.15.0 on windows 7.  I am trying to use read.delim
to read from a file that has 2-byte unicode (CJK) characters.


Here is an example of the data (it is tab-delimited if that gets messed up):
HITId HITTypeId Title
2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z 
ççå¥åï¼ååæ³æ³
è¯·çä»¥ä¸çå¥åï¼ååçé®

So read.delim (code below) doesn't read in correctly.  It reads up until it
hits the CJK characters and then terminates with a warning:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'minimal.txt'

The "Title" field gets filled with an NA.  I played around with scan() a
little bit and it can read the file correctly if i send it an open file
with the correct encoding given for the "encoding" parameter. It barfs with
the same warnings if I just send it the filename and set the fileEncoding
parameter.

Here is some test code with the above text in a file "minimal.txt"
# works
scan(file("minimal.txt",encoding="UTF-16LE"),what=character(),nlines=2)
# don't work
scan("minimal.txt",what=character(),nlines=2)     # output is in wrong
encoding
scan("minimal.txt",what=character(),nlines=2,fileEncoding="UTF-16LE")
 #"invalid input found on input connection"
read.delim(file("minimal.txt",encoding="UTF-16LE"), sep = "\t",
header=TRUE)    # ditto

Is this a bug? Or am I just doing something wrong?  Thanks for any help you
can provide.

--Pat

-- 
Patrick Callier
Georgetown University
http://www12.georgetown.edu/students/prc23/

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] UTF-16 input and read.delim/scan

Reply via email to