On Fri, 26 Jun 2020 15:57:06 -0700 Toby Hocking <tdho...@gmail.com> wrote:
>invalid multibyte string at '<e4>gel-A<6b>iyoshi' >https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html The server says that the text is UTF-8: curl -sI \ https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \ grep Content-Type # Content-Type: text/html; charset=UTF-8 But it's not, at least not all of it. If you ask readLines to mark the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the mojibake and invalid multi-byte characters: x <- readLines( 'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html', encoding = 'latin1' )[28] substr(x, 1, 100) # [1] "<I>Jens Oehlschlägel-Akiyoshi" The behaviour we observe when encoding = 'latin1' is not specified results from returned lines having "unknown" encoding. The substr() implementation tries to interpret such strings according to multi-byte C locale rules (using mbrtowc(3)). On my system (yours too, probably, if it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8, and this Latin-1 string does not result in valid code points when decoded as UTF-8. -- Best regards, Ivan ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel