> Pragmatically, one can zap the BOM from the output with > > language.ISO.table[1,1] <- substring(language.ISO.table[1,1],2)
On Windows with locale "Englist_United States.1252" my R-2.15.1 could not get that far: > socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt", + open="r",encoding="utf-8"); > read.table(socket, quote="", sep="|") V1 1 ? Warning messages: 1: In read.table(socket, quote = "", sep = "|") : invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt' 2: In read.table(socket, quote = "", sep = "|") : incomplete final line found by readTableHeader on 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt' > str(.Last.value) 'data.frame': 1 obs. of 1 variable: $ V1: Factor w/ 1 level "?": 1 An initial readChar was the only way I could get it to work there. Since Windows software seems to put a BOM at the top of a file to indicate that it is using UTF-<something>, it would be nice if the connection code at least had an option to deal with it. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: peter dalgaard [mailto:pda...@gmail.com] > Sent: Thursday, September 13, 2012 1:43 PM > To: William Dunlap > Cc: s...@gnu.org; r-help@r-project.org > Subject: Re: [R] cannot read iso639 table > > Pragmatically, one can zap the BOM from the output with > > language.ISO.table[1,1] <- substring(language.ISO.table[1,1],2) > > and be gone with it. > > It would be nicer to zap the BOM before read.table, though. It does work for > me with the > below (notice that the BOM is a single character if you don't use useBytes=). > > > get.language.ISO.table > function () { > socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt", > open="r",encoding="utf-8"); > readChar(socket, nchar=1) > data <- read.table(socket, as.is = TRUE, sep = "|", header = FALSE, > col.names = c("a3bibliographic","a3terminologic", > "a2","english","french"), quote=""); > close(socket); > data > } > > > On Sep 13, 2012, at 22:26 , William Dunlap wrote: > > > It would be helpful if you showed your commands and printed > > outputs, copied directly from your R session, from the beginning > > to the end. I put the call to sessionInfo() in my message because > > it is probably relevant. It is nice to completely include the original > > email when responding to it so others can see the whole story in > > one place. > > > > Bill Dunlap > > Spotfire, TIBCO Software > > wdunlap tibco.com > > > > > >> -----Original Message----- > >> From: Sam Steingold [mailto:sam.steing...@gmail.com] On Behalf Of Sam > >> Steingold > >> Sent: Thursday, September 13, 2012 1:18 PM > >> To: William Dunlap > >> Cc: peter dalgaard; r-help@r-project.org > >> Subject: Re: [R] cannot read iso639 table > >> > >>> * William Dunlap <jqha...@gvopb.pbz> [2012-09-13 19:50:21 +0000]: > >>> > >>> On Windows with R-2.15.1 in a 1252 locale, I had to read (and toss) out > >>> the initial 3 bytes (the byte-order mark?) to make things work: > >>> > >>>> socket <- > >>>> url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf- > >> 8.txt",open="r",encoding="utf-8") > >>>> readChar(socket, nchars=3, useBytes=TRUE) > >>> [1] "" > >> > >> confirmed - first 3 bytes are "\357\273\277" > >> > >>>> d <- read.table(socket, quote="", sep="|", stringsAsFactors=FALSE) > >>>> dim(d) > >>> [1] 485 5 > >>>> head(d) > >>> V1 V2 V3 V4 V5 > >>> 1 aar aa Afar afar > >>> 2 abk ab Abkhazian abkhaze > >>> 3 ace Achinese aceh > >>> 4 ach Acoli acoli > >>> 5 ada Adangme adangme > >>> 6 ady Adyghe; Adygei adyghé > >> > >> alas, this is all I get: > >> > >> Warning message: > >> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > >> invalid input found on input connection > >> 'http://www.loc.gov/standards/iso639- > 2/ISO- > >> 639-2_utf-8.txt' > >> > >> a3bibliographic a3terminologic a2 english french > >> 1 aar NA aa Afar afar > >> 2 abk NA ab Abkhazian abkhaze > >> 3 ace NA Achinese aceh > >> 4 ach NA Acoli acoli > >> 5 ada NA Adangme adangme > >> 6 ady NA Adyghe; Adygei adygh > >> > >> note that the first non-ASCII character terminates the input. > >> > >> so, I still cannot read the data from the URL. > >> > >> I can read the file though - with quote="" (thanks Peter!) - > >> except that the first record is "\357\273\277aar". > >> > >> > >> -- > >> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X > >> 11.0.11103000 > >> http://www.childpsy.net/ http://thereligionofpeace.com > >> http://mideasttruth.com http://iris.org.il http://jihadwatch.org > >> The only thing worse than X Windows: (X Windows) - X > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd....@cbs.dk Priv: pda...@gmail.com > > > > > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.