Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF
Can we use the "bytes" encoding for such environment variables invalid in the current locale? The following patch preserves CE_NATIVE for strings valid in the current UTF-8 or multibyte locale (or non-multibyte strings) but sets CE_BYTES for those that are invalid: Index: src/main/sysutils.c === --- src/main/sysutils.c (revision 83731) +++ src/main/sysutils.c (working copy) @@ -393,8 +393,16 @@ char **e; for (i = 0, e = environ; *e != NULL; i++, e++); PROTECT(ans = allocVector(STRSXP, i)); - for (i = 0, e = environ; *e != NULL; i++, e++) - SET_STRING_ELT(ans, i, mkChar(*e)); + for (i = 0, e = environ; *e != NULL; i++, e++) { + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 : + known_to_be_utf8 ? CE_UTF8 : + CE_NATIVE; + if ( + (utf8locale && !utf8Valid(*e)) + || (mbcslocale && !mbcsValid(*e)) + ) enc = CE_BYTES; + SET_STRING_ELT(ans, i, mkCharCE(*e, enc)); + } #endif } else { PROTECT(ans = allocVector(STRSXP, i)); @@ -416,11 +424,14 @@ if (s == NULL) SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0)); else { - SEXP tmp; - if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1); - else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8); - else tmp = mkChar(s); - SET_STRING_ELT(ans, j, tmp); + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 : + known_to_be_utf8 ? CE_UTF8 : + CE_NATIVE; + if ( + (utf8locale && !utf8Valid(s)) + || (mbcslocale && !mbcsValid(s)) + ) enc = CE_BYTES; + SET_STRING_ELT(ans, j, mkCharCE(s, enc)); } #endif } Here are the potential problems with this approach: * I don't know whether known_to_be_utf8 can disagree with utf8locale. known_to_be_utf8 was the original condition for setting CE_UTF8 on the string. I also need to detect non-UTF-8 multibyte locales, so I'm checking for utf8locale and mbcslocale. Perhaps I should be more careful and test for (enc == CE_UTF8) || (utf8locale && enc == CE_NATIVE) instead of just utf8locale. * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid strings in the environment with this patch applied, but now print.Dlist does, because formatDL wants to compute the width of the string which has the 'bytes' encoding. If this is a good way to solve the problem, I can work on suggesting a fix for formatDL to avoid the error. -- Best regards, Ivan __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF
On 1/31/23 01:27, Simon Urbanek wrote: Tomas, I think you're not addressing the actual issue which is a clear regression in Sys.getenv() [because it used to work and still works for single env var, but not a list] and the cryptic error due to that regression (caused by changes in R-devel). So in either case, Sys.getenv needs fixing (i.e., this should really go to the bugzilla). Its behavior is currently inconsistent. The quoted Python discussion diverges very quickly into file name discussions, but I think that is not relevant here - environment variables are essentially data "blobs" that have application-specific meaning. They just chose to drop them because they didn't want to make a decision. Well, yes, dropping such variables is a hack/work-around, but I would not rule that out from consideration. It would be good to have a look hat even other language runtimes do, but I can understand dropping them from getenv() a work-around allowing such environment variables to exist and be inherited by child processes, without breaking what happens in R (the language runtime at hand), makes some sense. The times when we could have (almost) arbitrary bytes in blobs are over, one has to choose and separate the two. Going back to that we allow blobs in multi-byte strings without special treatment is not possible. That's a common thing, not just for R, we can't just "stop checking" as before. R has also the "bytes" encoding which can be used to work with non-strings (encoding agnostic operations on non-ASCII data), and there have been a number of improvements that are now ready in R-devel. In theory, of course, getenv() could allow also to return the results as "bytes" for those who insist they want to treat environment variables with invalid strings. That would be technically possible, but could not be made the default: people would have to opt-in for that by an argument to the function. I am not sure it is worth it, it would not do what Henrik is asking for, but it is possible and would be a conceptually sound solution allowing people to use such variables (with newly written code for it that cannot just use the values as "normal" strings), or to save the whole environment profile. Some of us have spent very long time debating how to go further dealing with invalid strings, and it will still take time to decide and reach a consensus. It is a very hard problem with serious consequences and limitations. In principle one possible solution is to ban invalid strings fully, completely, don't allow them to be created. Another technically valid position is to allow invalid sequences in UTF-8 strings and support a subset of string operations in them, while throwing errors in other (probably only after/when/if R can rely on that UTF-8 is always the native encoding). Another approach discussed was what some of the regular expressions do, when invalid strings automatically become "bytes", to some extent it is the current behavior of some regex operations. The improvements for "bytes" encoding handling so far in R-devel a consequence of what we have already agreed on. So, it is not hard today to find inconsistencies in R in that some functions check for string validity while others don't. It is simply because we have not yet gone all the way to a solution to this big problem. getenv() is just one of the many. In case of getenv(), indeed, it is because of how it is implemented, it wraps OS functions and the OS don't require string validity. Btw Windows have both a single-byte and multi-byte variant of the environment profile which may disagree. R is not at the source of this problem, it is the operating systems which couldn't (yet?) find the decision and where the ambiguity remains. In practice, so far, the problem is small, because [1] (reported by Henrik) is behavior due to an obvious design bug in other software, and luckily this is rare. I don't have a strong opinion on this, but Sys.getenv("FOO") and Sys.getenv()[["FOO"]] should not yield two different results. I would argue that if we want to make specific checks, we should make them conditional - even if the default is to add them. Again, the error is due to the implementation of Sys.getenv() breaking in R-devel, not due to any design decision. The key design decision (and common one) behind is that environment variables are strings. Tomas Cheers, Simon On Jan 31, 2023, at 1:04 PM, Tomas Kalibera wrote: On 1/30/23 23:01, Henrik Bengtsson wrote: /Hello. SUMMARY: $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()" Error in substring(x, m + 1L) : invalid multibyte string at '' $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')" [1] "\xff" BACKGROUND: I launch R through an Son of Grid Engine (SGE) scheduler, where the R process is launched on a compute host via 'qrsh', which part of SGE. Without going into details, 'mpirun' is also involved. Rega
Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF
On 1/31/23 09:48, Ivan Krylov wrote: Can we use the "bytes" encoding for such environment variables invalid in the current locale? The following patch preserves CE_NATIVE for strings valid in the current UTF-8 or multibyte locale (or non-multibyte strings) but sets CE_BYTES for those that are invalid: Index: src/main/sysutils.c === --- src/main/sysutils.c (revision 83731) +++ src/main/sysutils.c (working copy) @@ -393,8 +393,16 @@ char **e; for (i = 0, e = environ; *e != NULL; i++, e++); PROTECT(ans = allocVector(STRSXP, i)); - for (i = 0, e = environ; *e != NULL; i++, e++) - SET_STRING_ELT(ans, i, mkChar(*e)); + for (i = 0, e = environ; *e != NULL; i++, e++) { + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 : + known_to_be_utf8 ? CE_UTF8 : + CE_NATIVE; + if ( + (utf8locale && !utf8Valid(*e)) + || (mbcslocale && !mbcsValid(*e)) + ) enc = CE_BYTES; + SET_STRING_ELT(ans, i, mkCharCE(*e, enc)); + } #endif } else { PROTECT(ans = allocVector(STRSXP, i)); @@ -416,11 +424,14 @@ if (s == NULL) SET_STRING_ELT(ans, j, STRING_ELT(CADR(args), 0)); else { - SEXP tmp; - if(known_to_be_latin1) tmp = mkCharCE(s, CE_LATIN1); - else if(known_to_be_utf8) tmp = mkCharCE(s, CE_UTF8); - else tmp = mkChar(s); - SET_STRING_ELT(ans, j, tmp); + cetype_t enc = known_to_be_latin1 ? CE_LATIN1 : + known_to_be_utf8 ? CE_UTF8 : + CE_NATIVE; + if ( + (utf8locale && !utf8Valid(s)) + || (mbcslocale && !mbcsValid(s)) + ) enc = CE_BYTES; + SET_STRING_ELT(ans, j, mkCharCE(s, enc)); } #endif } Here are the potential problems with this approach: * I don't know whether known_to_be_utf8 can disagree with utf8locale. known_to_be_utf8 was the original condition for setting CE_UTF8 on the string. I also need to detect non-UTF-8 multibyte locales, so I'm checking for utf8locale and mbcslocale. Perhaps I should be more careful and test for (enc == CE_UTF8) || (utf8locale && enc == CE_NATIVE) instead of just utf8locale. * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid strings in the environment with this patch applied, but now print.Dlist does, because formatDL wants to compute the width of the string which has the 'bytes' encoding. If this is a good way to solve the problem, I can work on suggesting a fix for formatDL to avoid the error. Thanks, indeed, type instability is a big problem of the approach "turn invalid strings to bytes". It is something what is historically being done in regular expression operations, but it is brittle and not user friendly: writing code to be agnostic to whether we are dealing with "bytes" or a regular string is very tedious. Pieces of type instability come also from that ASCII strings are always flagged "native" (never "bytes"). Last year I had to revert a big change which broke existing code by introducing some more of this instability due to better dealing with invalid strings in regular expressions. I've made some additions to R-devel allowing to better deal with such instability but it is still a pain and existing code has to be changed (and made more complicated). So, I don't think this is the way to go. Tomas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF
> Tomas Kalibera > on Tue, 31 Jan 2023 10:53:21 +0100 writes: > On 1/31/23 09:48, Ivan Krylov wrote: >> Can we use the "bytes" encoding for such environment variables invalid >> in the current locale? The following patch preserves CE_NATIVE for >> strings valid in the current UTF-8 or multibyte locale (or >> non-multibyte strings) but sets CE_BYTES for those that are invalid: >> >> Index: src/main/sysutils.c >> === >> --- src/main/sysutils.c (revision 83731) . >> >> Here are the potential problems with this approach: >> >> * I don't know whether known_to_be_utf8 can disagree with utf8locale. >> known_to_be_utf8 was the original condition for setting CE_UTF8 on >> the string. I also need to detect non-UTF-8 multibyte locales, so >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more >> careful and test for (enc == CE_UTF8) || (utf8locale && enc == >> CE_NATIVE) instead of just utf8locale. >> >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid >> strings in the environment with this patch applied, but now >> print.Dlist does, because formatDL wants to compute the width of the >> string which has the 'bytes' encoding. If this is a good way to >> solve the problem, I can work on suggesting a fix for formatDL to >> avoid the error. > Thanks, indeed, type instability is a big problem of the approach "turn > invalid strings to bytes". It is something what is historically being > done in regular expression operations, but it is brittle and not user > friendly: writing code to be agnostic to whether we are dealing with > "bytes" or a regular string is very tedious. Pieces of type instability > come also from that ASCII strings are always flagged "native" (never > "bytes"). Last year I had to revert a big change which broke existing > code by introducing some more of this instability due to better dealing > with invalid strings in regular expressions. I've made some additions to > R-devel allowing to better deal with such instability but it is still a > pain and existing code has to be changed (and made more complicated). > So, I don't think this is the way to go. > Tomas hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, but of course you are right that type stability is really a valid goal In general, what about behaving close to "old R" and replace all such strings by NA_character_ (and typically raising one warning)? This would keep the result a valid character vector, just with some NA entries. Specifically for Sys.getenv(), I still think Simon has a very valid point of "requiring" (of our design) that Sys.getenv()[["BOOM"]] {double `[[`} should be the same as Sys.getenv("BOOM") Also, as typical R user, I'd definitely want to be able to get all the valid environment variables, even if there are one or more invalid ones. ... and similarly in other cases, it may be a cheap strategy to replace invalid strings ("string" in the sense of length 1 STRSXP, i.e., in R, a "character" of length 1) by NA_character_ and keep all valid parts of the character vector in a valid encoding. - - - - - - In addition to the above cheap "replace by NA"-strategy, and at least once R is "all UTF-8", we could also consider a more expensive strategy that would try to replace invalid characters/byte-sequences by one specific valid UTF-8 character, i.e., glyph (think of a special version of "?") that we would designate as replacement for "invalid-in-current-encoding". Probably such a glyph already exists and we have seen it used in some console's output when having to print such things. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF
On 31/01/2023 5:50 a.m., Martin Maechler wrote: Tomas Kalibera on Tue, 31 Jan 2023 10:53:21 +0100 writes: > On 1/31/23 09:48, Ivan Krylov wrote: >> Can we use the "bytes" encoding for such environment variables invalid >> in the current locale? The following patch preserves CE_NATIVE for >> strings valid in the current UTF-8 or multibyte locale (or >> non-multibyte strings) but sets CE_BYTES for those that are invalid: >> >> Index: src/main/sysutils.c >> === >> --- src/main/sysutils.c (revision 83731) . >> >> Here are the potential problems with this approach: >> >> * I don't know whether known_to_be_utf8 can disagree with utf8locale. >> known_to_be_utf8 was the original condition for setting CE_UTF8 on >> the string. I also need to detect non-UTF-8 multibyte locales, so >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more >> careful and test for (enc == CE_UTF8) || (utf8locale && enc == >> CE_NATIVE) instead of just utf8locale. >> >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid >> strings in the environment with this patch applied, but now >> print.Dlist does, because formatDL wants to compute the width of the >> string which has the 'bytes' encoding. If this is a good way to >> solve the problem, I can work on suggesting a fix for formatDL to >> avoid the error. > Thanks, indeed, type instability is a big problem of the approach "turn > invalid strings to bytes". It is something what is historically being > done in regular expression operations, but it is brittle and not user > friendly: writing code to be agnostic to whether we are dealing with > "bytes" or a regular string is very tedious. Pieces of type instability > come also from that ASCII strings are always flagged "native" (never > "bytes"). Last year I had to revert a big change which broke existing > code by introducing some more of this instability due to better dealing > with invalid strings in regular expressions. I've made some additions to > R-devel allowing to better deal with such instability but it is still a > pain and existing code has to be changed (and made more complicated). > So, I don't think this is the way to go. > Tomas hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, but of course you are right that type stability is really a valid goal In general, what about behaving close to "old R" and replace all such strings by NA_character_ (and typically raising one warning)? This would keep the result a valid character vector, just with some NA entries. Specifically for Sys.getenv(), I still think Simon has a very valid point of "requiring" (of our design) that Sys.getenv()[["BOOM"]] {double `[[`} should be the same as Sys.getenv("BOOM") Also, as typical R user, I'd definitely want to be able to get all the valid environment variables, even if there are one or more invalid ones. ... and similarly in other cases, it may be a cheap strategy to replace invalid strings ("string" in the sense of length 1 STRSXP, i.e., in R, a "character" of length 1) by NA_character_ and keep all valid parts of the character vector in a valid encoding. - - - - - - In addition to the above cheap "replace by NA"-strategy, and at least once R is "all UTF-8", we could also consider a more expensive strategy that would try to replace invalid characters/byte-sequences by one specific valid UTF-8 character, i.e., glyph (think of a special version of "?") that we would designate as replacement for "invalid-in-current-encoding". Probably such a glyph already exists and we have seen it used in some console's output when having to print such things. I think it is U+FFFD, described here: https://www.fileformat.info/info/unicode/char/fffd/index.htm Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF
On 1/31/23 11:50, Martin Maechler wrote: Tomas Kalibera on Tue, 31 Jan 2023 10:53:21 +0100 writes: > On 1/31/23 09:48, Ivan Krylov wrote: >> Can we use the "bytes" encoding for such environment variables invalid >> in the current locale? The following patch preserves CE_NATIVE for >> strings valid in the current UTF-8 or multibyte locale (or >> non-multibyte strings) but sets CE_BYTES for those that are invalid: >> >> Index: src/main/sysutils.c >> === >> --- src/main/sysutils.c (revision 83731) . >> >> Here are the potential problems with this approach: >> >> * I don't know whether known_to_be_utf8 can disagree with utf8locale. >> known_to_be_utf8 was the original condition for setting CE_UTF8 on >> the string. I also need to detect non-UTF-8 multibyte locales, so >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more >> careful and test for (enc == CE_UTF8) || (utf8locale && enc == >> CE_NATIVE) instead of just utf8locale. >> >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid >> strings in the environment with this patch applied, but now >> print.Dlist does, because formatDL wants to compute the width of the >> string which has the 'bytes' encoding. If this is a good way to >> solve the problem, I can work on suggesting a fix for formatDL to >> avoid the error. > Thanks, indeed, type instability is a big problem of the approach "turn > invalid strings to bytes". It is something what is historically being > done in regular expression operations, but it is brittle and not user > friendly: writing code to be agnostic to whether we are dealing with > "bytes" or a regular string is very tedious. Pieces of type instability > come also from that ASCII strings are always flagged "native" (never > "bytes"). Last year I had to revert a big change which broke existing > code by introducing some more of this instability due to better dealing > with invalid strings in regular expressions. I've made some additions to > R-devel allowing to better deal with such instability but it is still a > pain and existing code has to be changed (and made more complicated). > So, I don't think this is the way to go. > Tomas hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, but of course you are right that type stability is really a valid goal In general, what about behaving close to "old R" and replace all such strings by NA_character_ (and typically raising one warning)? This would keep the result a valid character vector, just with some NA entries. Specifically for Sys.getenv(), I still think Simon has a very valid point of "requiring" (of our design) that Sys.getenv()[["BOOM"]] {double `[[`} should be the same as Sys.getenv("BOOM") Also, as typical R user, I'd definitely want to be able to get all the valid environment variables, even if there are one or more invalid ones. ... and similarly in other cases, it may be a cheap strategy to replace invalid strings ("string" in the sense of length 1 STRSXP, i.e., in R, a "character" of length 1) by NA_character_ and keep all valid parts of the character vector in a valid encoding. In case of specifically getenv(), yes, we could return NA for variables containing invalid strings, both when obtaining a value for a single variable and for multiple, partially matching undocumented and unintentional behavior R had before 4.1, and making getenv(var) and getenv()[[var]] consistent even with invalid strings. Once we decide on how to deal with invalid strings in general, we can change this again accordingly, breaking code for people who depend on these things (but so far I only know about this one case). Perhaps this would be a logical alternative to the Python approach that would be more suitable for R (given we have NAs and given that we happened to have that somewhat similar alternative before). Conceptually it is about the same thing as omitting the variable in Python: R users would not be able to use such variables, but if they don't touch them, they could be inherited to child processes, etc. - - - - - - In addition to the above cheap "replace by NA"-strategy, and at least once R is "all UTF-8", we could also consider a more expensive strategy that would try to replace invalid characters/byte-sequences by one specific valid UTF-8 character, i.e., glyph (think of a special version of "?") that we would designate as replacement for "invalid-in-current-encoding". Probably such a glyph already exists and we have seen it used in some console's output when having to print such things. This is part of the big discussion where we couldn't find a consensus, and I think this happened so recently that it is too early for opening it aga
Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF
> On 31 Jan 2023, at 12:51 , Tomas Kalibera wrote: > > > On 1/31/23 11:50, Martin Maechler wrote: >> hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, >> but of course you are right that type stability is really a >> valid goal >> >> In general, what about behaving close to "old R" and replace all such >> strings by NA_character_ (and typically raising one warning)? >> This would keep the result a valid character vector, just with some NA >> entries. >> >> Specifically for Sys.getenv(), I still think Simon has a very >> valid point of "requiring" (of our design) that >> Sys.getenv()[["BOOM"]] {double `[[`} should be the same as >> Sys.getenv("BOOM") >> >> Also, as typical R user, I'd definitely want to be able to get all the valid >> environment variables, even if there are one or more invalid >> ones. ... and similarly in other cases, it may be a cheap >> strategy to replace invalid strings ("string" in the sense of >> length 1 STRSXP, i.e., in R, a "character" of length 1) by >> NA_character_ and keep all valid parts of the character vector >> in a valid encoding. > In case of specifically getenv(), yes, we could return NA for variables > containing invalid strings, both when obtaining a value for a single variable > and for multiple, partially matching undocumented and unintentional behavior > R had before 4.1, and making getenv(var) and getenv()[[var]] consistent even > with invalid strings. Once we decide on how to deal with invalid strings in > general, we can change this again accordingly, breaking code for people who > depend on these things (but so far I only know about this one case). Perhaps > this would be a logical alternative to the Python approach that would be more > suitable for R (given we have NAs and given that we happened to have that > somewhat similar alternative before). Conceptually it is about the same thing > as omitting the variable in Python: R users would not be able to use such > variables, but if they don't touch them, they could be inherited to child > processes, etc. Hum, I'm out of my waters here, but offhand I would be wary about approaches that lead to loss of information. Presumably someone will sooner or later actually want to deal with the content of an environment variable with invalid bytes inside. I.e. it would be preferable to keep the content and mark the encoding as something not-multibyte. In fact this is almost what happens (for me...) if I just add Encoding(x) <- "bytes" for the return value of .Internal(Sys.getenv(character(), "")): > Sys.getenv()[["BOOM"]] [1] "\\xff" > Encoding(Sys.getenv()) [1] "unknown" "unknown" "bytes" "unknown" "unknown" "unknown" "unknown" [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" ... but I suppose that breaks if I have environment variables that actually _are_ utf8, because only plain-ASCII becomes "unknown"? And nchar(Sys.getenv()) also does not work. (And of course I agree that the QRSH thing is Just Wrong; people using 0xff as a separator between utf8 strings deserve the same fate as those who used comma separation between numbers with decimal commas.) -pd -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF
On 1/31/23 14:37, peter dalgaard wrote: On 31 Jan 2023, at 12:51 , Tomas Kalibera wrote: On 1/31/23 11:50, Martin Maechler wrote: hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, but of course you are right that type stability is really a valid goal In general, what about behaving close to "old R" and replace all such strings by NA_character_ (and typically raising one warning)? This would keep the result a valid character vector, just with some NA entries. Specifically for Sys.getenv(), I still think Simon has a very valid point of "requiring" (of our design) that Sys.getenv()[["BOOM"]] {double `[[`} should be the same as Sys.getenv("BOOM") Also, as typical R user, I'd definitely want to be able to get all the valid environment variables, even if there are one or more invalid ones. ... and similarly in other cases, it may be a cheap strategy to replace invalid strings ("string" in the sense of length 1 STRSXP, i.e., in R, a "character" of length 1) by NA_character_ and keep all valid parts of the character vector in a valid encoding. In case of specifically getenv(), yes, we could return NA for variables containing invalid strings, both when obtaining a value for a single variable and for multiple, partially matching undocumented and unintentional behavior R had before 4.1, and making getenv(var) and getenv()[[var]] consistent even with invalid strings. Once we decide on how to deal with invalid strings in general, we can change this again accordingly, breaking code for people who depend on these things (but so far I only know about this one case). Perhaps this would be a logical alternative to the Python approach that would be more suitable for R (given we have NAs and given that we happened to have that somewhat similar alternative before). Conceptually it is about the same thing as omitting the variable in Python: R users would not be able to use such variables, but if they don't touch them, they could be inherited to child processes, etc. Hum, I'm out of my waters here, but offhand I would be wary about approaches that lead to loss of information. Presumably someone will sooner or later actually want to deal with the content of an environment variable with invalid bytes inside. I.e. it would be preferable to keep the content and mark the encoding as something not-multibyte. In fact this is almost what happens (for me...) if I just add Encoding(x) <- "bytes" for the return value of .Internal(Sys.getenv(character(), "")): Sys.getenv()[["BOOM"]] [1] "\\xff" Encoding(Sys.getenv()) [1] "unknown" "unknown" "bytes" "unknown" "unknown" "unknown" "unknown" [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" ... but I suppose that breaks if I have environment variables that actually _are_ utf8, because only plain-ASCII becomes "unknown"? And nchar(Sys.getenv()) also does not work. Yes, that way you would get even valid UTF-8 strings represented as "bytes". But that is not the main problem. It would technically be possible to keep valid strings in the native encoding (typically UTF-8) as "native", but only those invalid as "bytes", as Ivan also suggested. But the key problem is that it breaks because of that "type instability": other string functions later will start failing on "bytes", resulting in a mess. We could provide new API (e.g. argument "useBytes=TRUE") which would provide all variables as "bytes" (all-ASCII would be "native" per how "bytes" works) and let the users decide whether they want to use iconv() to turn some of them into strings, how to do such conversions (e.g. error, warn, substitute NA, substitute something else). That would allow working with such variables. That would be probably a "clean" solution at least for POSIX system but I doubt anyone would use that. For Windows it would still be questionable (due to the two environment profiles in two different encodings, which may not match). (And of course I agree that the QRSH thing is Just Wrong; people using 0xff as a separator between utf8 strings deserve the same fate as those who used comma separation between numbers with decimal commas.) Indeed. And then I am afraid I have to make my position stronger based on reading more what Windows do. Their API clearly implies that variables are strings, because it automatically converts them between "wide" and "multi-byte". An application can have both of these profiles, then the C runtime manages both, and they may get out of sync and the documentation explicitly warns about confusion due to that some characters cannot be converted to some encodings. Also, the Windows approach that environment values are strings is "compatible" with that different applications on Windows may and often do use different native encoding (some use UTF-8 such as R, but some use the legacy encoding, e.g. Latin1, but a multi-byte encoding for other languages). Imagine that an