Hi all,
I am running 64-bit R 2.15.0 on windows 7. I am trying to use read.delim
to read from a file that has 2-byte unicode (CJK) characters.
Here is an example of the data (it is tab-delimited if that gets messed up):
HITId HITTypeId Title
2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z
ççå¥åï¼ååæ³æ³
请ç以ä¸çå¥åï¼ååçé®
So read.delim (code below) doesn't read in correctly. It reads up until it
hits the CJK characters and then terminates with a warning:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'minimal.txt'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'minimal.txt'
The "Title" field gets filled with an NA. I played around with scan() a
little bit and it can read the file correctly if i send it an open file
with the correct encoding given for the "encoding" parameter. It barfs with
the same warnings if I just send it the filename and set the fileEncoding
parameter.
Here is some test code with the above text in a file "minimal.txt"
# works
scan(file("minimal.txt",encoding="UTF-16LE"),what=character(),nlines=2)
# don't work
scan("minimal.txt",what=character(),nlines=2) # output is in wrong
encoding
scan("minimal.txt",what=character(),nlines=2,fileEncoding="UTF-16LE")
#"invalid input found on input connection"
read.delim(file("minimal.txt",encoding="UTF-16LE"), sep = "\t",
header=TRUE) # ditto
Is this a bug? Or am I just doing something wrong? Thanks for any help you
can provide.
--Pat
--
Patrick Callier
Georgetown University
http://www12.georgetown.edu/students/prc23/
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.