Hi all, I am running 64-bit R 2.15.0 on windows 7. I am trying to use read.delim to read from a file that has 2-byte unicode (CJK) characters.
Here is an example of the data (it is tab-delimited if that gets messed up): HITId HITTypeId Title 2Q69Z6KW4ZMAGKKFRT6Q4ONO6MJF68 2LVJ1LY58B72OP36GNBHH16YF7RS7Z ççå¥åï¼ååæ³æ³ è¯·ç以ä¸çå¥åï¼ååçé® So read.delim (code below) doesn't read in correctly. It reads up until it hits the CJK characters and then terminates with a warning: Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : invalid input found on input connection 'minimal.txt' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'minimal.txt' The "Title" field gets filled with an NA. I played around with scan() a little bit and it can read the file correctly if i send it an open file with the correct encoding given for the "encoding" parameter. It barfs with the same warnings if I just send it the filename and set the fileEncoding parameter. Here is some test code with the above text in a file "minimal.txt" # works scan(file("minimal.txt",encoding="UTF-16LE"),what=character(),nlines=2) # don't work scan("minimal.txt",what=character(),nlines=2) # output is in wrong encoding scan("minimal.txt",what=character(),nlines=2,fileEncoding="UTF-16LE") #"invalid input found on input connection" read.delim(file("minimal.txt",encoding="UTF-16LE"), sep = "\t", header=TRUE) # ditto Is this a bug? Or am I just doing something wrong? Thanks for any help you can provide. --Pat -- Patrick Callier Georgetown University http://www12.georgetown.edu/students/prc23/ [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.