Ah, sorry, this is my bad - I was trying on an older version of R and it seemed
to work, but it was apparently my fault for not setting the environment
variable correctly in that test, so I take the regression comment back (I had
the recent encoding changes in mind so). I still think it may be worth thinking
about what we want it to do - the inconsistency is still true:
> Sys.getenv("BOOM")
[1] "\xff"
> Sys.getenv()[["BOOM"]]
Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
In addition: Warning message:
In regexpr("=", x, fixed = TRUE) : input string 2 is invalid in this locale
so the rest of my comments stand ;).
Thanks,
Simon
> On Jan 31, 2023, at 2:11 PM, Henrik Bengtsson <[email protected]>
> wrote:
>
> Thanks for the quick replies.
>
> For me, the main issue is that the value of an environment variable
> can causes errors in R. Say that there's an environment variable that
> is used as a raw data blob (as Simon mentions) and where that data is
> flavor of the day. Then, this would affect R in a random,
> non-predictable way. Unless it can be argued that environment
> variables must not contain random bytes, I argue that R needs to be
> robust against this. If not, we're effectively asking everyone to use
> tryCatch() whenever calling Sys.getenv(). Even so, they wouldn't not
> be able to query *all* environment variables, or even know which they
> are, because Sys.getenv() gives an error. So, I think the only
> reasonable thing to do is to make Sys.getenv() robust against this.
>
> One alternative is to have Sys.getenv() (1) detect for which
> environment variables this happens, (2) produce an informative warning
> that R cannot represent the value correctly, and (3) either drop it,
> or return an adjusted value (e.g. NA_character_). Something like:
>
>> envs <- Sys.getenv()
> Warning message:
> Environment variable 'BOOM' was dropped, because its
> value cannot be represented as a string in R
>
> As you see from my examples below, R used to drop such environment
> variables in R (< 4.1.0).
>
> Simon, your comment saying it used to work, maybe me look back at
> older versions of R. Although I didn't say so, I used R 4.2.2 in my
> original report. It looks like it was in R 4.1.0 that this started to
> fail. I get:
>
> # R 2.15.0 - R 3.1.0
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']"
> <NA>
> NA
> Warning message:
> In strsplit(.Internal(Sys.getenv(character(), "")), "=", fixed = TRUE) :
> input string 8 is invalid in this locale
>
> # R 3.2.0 - R 4.0.4
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']"
> NA NA
> Warning message:
> In strsplit(.Internal(Sys.getenv(character(), "")), "=", fixed = TRUE) :
> input string 137 is invalid in this locale
>
> # R 4.1.0 - ...
> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']"
> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
> Calls: Sys.getenv -> substring
> In addition: Warning message:
> In regexpr("=", x, fixed = TRUE) :
> input string 137 is invalid in this locale
> Execution halted
>
> /Henrik
>
> On Mon, Jan 30, 2023 at 4:27 PM Simon Urbanek
> <[email protected]> wrote:
>>
>> Tomas,
>>
>> I think you're not addressing the actual issue which is a clear regression
>> in Sys.getenv() [because it used to work and still works for single env var,
>> but not a list] and the cryptic error due to that regression (caused by
>> changes in R-devel). So in either case, Sys.getenv needs fixing (i.e., this
>> should really go to the bugzilla). Its behavior is currently inconsistent.
>>
>> The quoted Python discussion diverges very quickly into file name
>> discussions, but I think that is not relevant here - environment variables
>> are essentially data "blobs" that have application-specific meaning. They
>> just chose to drop them because they didn't want to make a decision.
>>
>> I don't have a strong opinion on this, but Sys.getenv("FOO") and
>> Sys.getenv()[["FOO"]] should not yield two different results. I would argue
>> that if we want to make specific checks, we should make them conditional -
>> even if the default is to add them. Again, the error is due to the
>> implementation of Sys.getenv() breaking in R-devel, not due to any design
>> decision.
>>
>> Cheers,
>> Simon
>>
>>
>>> On Jan 31, 2023, at 1:04 PM, Tomas Kalibera <[email protected]>
>>> wrote:
>>>
>>>
>>> On 1/30/23 23:01, Henrik Bengtsson wrote:
>>>> /Hello.
>>>>
>>>> SUMMARY:
>>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
>>>> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
>>>>
>>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
>>>> [1] "\xff"
>>>>
>>>> BACKGROUND:
>>>> I launch R through an Son of Grid Engine (SGE) scheduler, where the R
>>>> process is launched on a compute host via 'qrsh', which part of SGE.
>>>> Without going into details, 'mpirun' is also involved. Regardless, in
>>>> this process, an 'qrsh'-specific environment variable 'QRSH_COMMAND'
>>>> is set automatically. The value of this variable comprise of a string
>>>> with \xff (ASCII 255) injected between the words. This is by design
>>>> of SGE [1]. Here is an example of what this environment variable may
>>>> look like:
>>>>
>>>> QRSH_COMMAND=
>>>> orted\xff--hnp-topo-sig\xff2N:2S:32L3:128L2:128L1:128C:256H:x86_64\xff-mca\xffess\xff\"env\"\xff-mca\xfforte_ess_jobid\xff\"3473342464\"\xff-mca\xfforte_ess_vpid\xff1\xff-mca\xfforte_ess_num_procs\xff\"3\"\xff-mca\xfforte_hnp_uri\xff\"3473342464.0;tcp://192.168.1.13:50847\"\xff-mca\xffplm\xff\"rsh\"\xff-mca\xfforte_tag_output\xff\"1\"\xff--tree-spawn"
>>>>
>>>> where each \xff is a single byte 255=0xFF=\xFF.
>>>>
>>>>
>>>> ISSUE:
>>>> An environment variable with embedded 0xFF bytes in its value causes
>>>> calls to Sys.getenv() to produce an error when running R in a UTF-8
>>>> locale. Here is a minimal example on Linux:
>>>>
>>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
>>>> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
>>>> Calls: Sys.getenv -> substring
>>>> In addition: Warning message:
>>>> In regexpr("=", x, fixed = TRUE) :
>>>> input string 134 is invalid in this locale
>>>> Execution halted
>>>>
>>>>
>>>> WORKAROUND:
>>>> The workaround is to (1) identify any environment variables with
>>>> invalid UTF-8 symbols, and (2) prune or unset those variables before
>>>> launching R, e.g. in my SGE case, launching R using:
>>>>
>>>> QRSH_COMMAND= Rscript --vanilla -e "Sys.getenv()"
>>>>
>>>> avoid the problem. Having to unset/modify environment variables
>>>> because R doesn't like them, see a bit of an ad-hoc hack to me. Also,
>>>> if you are not aware of this problem, or not a savvy R user, it can be
>>>> quite tricky to track down the above error message, especially if
>>>> Sys.getenv() is called deep down in some package dependency.
>>>>
>>>>
>>>> DISCUSSION/SUGGESTION/ASK:
>>>> My suggestion would be to make Sys.getenv() robust against any type of
>>>> byte values in environment variable strings.
>>>>
>>>> The error occurs in Sys.getenv() from:
>>>>
>>>> x <- .Internal(Sys.getenv(character(), ""))
>>>> m <- regexpr("=", x, fixed = TRUE) ## produces a warning
>>>> n <- substring(x, 1L, m - 1L)
>>>> v <- substring(x, m + 1L) ## produces the error
>>>>
>>>> I know too little about string encodings, so I'm not sure what the
>>>> best approach would be here, but maybe falling back to parsing strings
>>>> that are invalid in the current locale using the C locale would be
>>>> reasonable? Maybe Sys.getenv() should always use the C locale for
>>>> this. It looks like Sys.getenv(name) does this, e.g.
>>>>
>>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
>>>> [1] "\xff"
>>>>
>>>> I'd appreciate any comments and suggestions. I'm happy to file a bug
>>>> report on BugZilla, if this is a bug.
>>> This discussion comes from Python: https://bugs.python.org/issue4006
>>> (it says Python skips such environment variables)
>>>
>>> The problem of invalid strings in environment variables is a similar to the
>>> problem of invalid strings in file names. Both variables and file names are
>>> something people want to use as strings in programs, scripts, texts, but at
>>> the same time these may in theory not be valid strings. Working with
>>> (potentially) invalid strings (almost) transparently is much harder than
>>> with valid strings; even if R decided to do that, it would be hard to
>>> implement and take long and only work for some operations, most will still
>>> throw errors. In addition, in practice invalid strings are almost always
>>> due to an error, particularly so in file names or environment variables.
>>> Such errors are often worth catching (wrong encoding declaration, etc),
>>> even though perhaps not always.
>>>
>>> In practice, this instance can only be properly fixed at the source, [1]
>>> should not do this. split_command() will run into problems with different
>>> software, not just R.
>>>
>>> There should be a way to split the commands in ASCII (using some sort of
>>> quoting/escaping). Using \xFF is flawed also simply because it may be
>>> present in the commands, if we followed the same logic of that every byte
>>> is fine. So the code is buggy even regardless of multi-byte encodings.
>>>
>>> Re difficulty to debug, I think the error message is clear and if packages
>>> catch and hide errors, that'd be bad design of such packages, R couldn't
>>> really do much about that. This needs to be fixed at [1].
>>>
>>> Tomas
>>>
>>>>
>>>> Henrik
>>>>
>>>> [1]
>>>> https://github.com/gridengine/gridengine/blob/master/source/clients/qrsh/qrsh_starter.c#L462-L466
>>>>
>>>> ______________________________________________
>>>> [email protected] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> ______________________________________________
>>> [email protected] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel