First off, thank you very much for taking a look at this. I didn't know "raw=TRUE" would be necessary here.
Unfortunately, I'm stuck with the embedded nulls in the source data at this point. If worst comes to worst, does R have a way to do something like -- 1. Read the entire file in as raw binary. 2. Replace all embedded nulls with spaces. 3. Output the revised file (as binary) somewhere else. ? I imagine it'd take a big performance penalty, but at least then I proceed with importing the revised file. Thanks again! On Thu, Feb 5, 2015 at 2:06 PM, John McKown <john.archie.mck...@gmail.com> wrote: > On Thu, Feb 5, 2015 at 2:08 PM, Brian Trautman <btrautma...@gmail.com> > wrote: > >> I'm trying to read some mainframe data encoded as EBCDIC into R, and am at >> a loss. I'd like to avoid using an external program to convert the files, >> since I'm operating in a corporate environment. >> >> You can find the example files at at the link below, with both ASCII and >> EBCDIC versions. Note that there are no linebreaks in the EBCDIC versions >> of the file -- instead, I'd be specifying the width of each line manually. >> R has the IBM500 encoding available in my environment, which should be the >> correct one for these files. >> >> However, when I run the following commands, R seems to fail entirely. It >> loads a single record with garbage characters, regardless of the encoding >> I >> specified. >> >> >> layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80), fileEncoding='ibm500') >> >> data <- read.fwf("EBCDIC_ZIPCODE", widths = c(32), >> fileEncoding='ibm500') >> >> >> Where might I go from here? >> >> Related -- some of the files I expect to use will be fairly large (1 GB or >> so). Preferably, I'd like a solution that scales reasonably well. (I tried >> packages like LaF, but they don't have the option to select encoding.) >> >> Thank you very much! >> >> >> Example files -- >> https://drive.google.com/open?id=0ByvX1v-WqaaASTdwV2ZYS0pBV00&authuser=0 >> >> > > I gave this a short try. What killed me (see below) is that your file > EBCDIC_ZIPCODE has embedded NULL characters, \0. My transcript: > > > file<-file("EBCDIC_ZIPCODE",encoding="IBM500", raw=TRUE); > > data=read.fwf(file,widths=c(32)); > Warning messages: > 1: In readLines(file, n = thisblock) : > line 1 appears to contain an embedded nul > 2: In readLines(file, n = thisblock) : > incomplete final line found on 'EBCDIC_ZIPCODE' > > View(data) > > I don't know how to get past the embedded NULL. I'm a UNIX user, so my > thought (not applicable with your restriction of "pure R"), would be to use > "tr" to convert the \0 to spaces, then use the above. > > > -- > He's about as useful as a wax frying pan. > > 10 to the 12th power microphones = 1 Megaphone > > Maranatha! <>< > John McKown > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.