OK, let's take a step back. We have identified at least three degrees
of freedom that have been sources of friction with existing string literals:
- Sometimes we don't want traditional escaping (\n, etc);
- Sometimes we don't want unicode escaping (\unnnn);
- Sometimes we want to represent multiple lines of text as a single
String.
Traditional strings could be described as (false, false, false) on these
axes; the propose raw strings are (true, true, true). As a first
evaluation (if these really are the axes), this is encouraging; if
you're going to pick 2 of 2^N prepackaged options, its often best to
pick the ones with the biggest hamming distance.
I have a hard time imagining that people really need, for example,
traditional escaping but not unicode escaping, with any frequency. So
offering all 2^n combinations is not likely to carry its weight.
I think what you are suggesting is that its fine to lump the first two,
but it might have been a premature move to lump them with the third. (A
second question is: are these the only axes we should be concerned with
right now.) So, let's examine that.
We explored allowing double-quoted strings to span lines too; this gives
you a different stacking: { escaping multi-line, raw multi-line }. But
I think the part that's still unexplored is: do we need to explicitly
surface how source lines are combined into strings?
The assumption we've been working off of is: \n has won (this wasn't
true when Java got started.) Is this wishful thinking? And if not, can
the library approach serve this purpose here too:
`a long
string`.toPlatformLineEnding()
(which, as has been observed, can be optimized either by compile-time
evaluation or by link-time evaluation using LDC and ConstantDynamic, so
I think we can ignore the "but then I'm doing work at runtime" aspect of
this.)
On 2/5/2018 1:39 PM, Guy Steele wrote:
On Feb 5, 2018, at 1:39 PM, Brian Goetz <[email protected]> wrote:
However, I also note that the broad problem may two or three distinct symptoms,
and:
(1) A solution that addresses one symptom may not address the others, and
(2) On the other hand, it may (or may not) be perfectly reason to address the
most painful symptoms in different ways, rather than insisting that a single
solution cover them all.
Indeed so. This is one reason why we resisted the call to do string interpolation (which
many developers conflate with multi-line strings, as many languages with one also have
the other) at the same time. Another way to ask this question is: are we yet
sufficiently minimal? We boiled it down quite a lot already, but are we at
"minimal" yet? Or, did we take a wrong turn in boiling it down, and find
ourselves only a local minimum?
In particular, I happen to think that the problem of distinguishing snippet
indentation from encoding-program indentation may require a rather different
kind of solution from the problem of escape characters in embedded snippets.
The reason is that in both these cases the painful symptom is visual in nature
rather than logical. That’s why I can understand what drove Tagir to pursue
the pipe-character approach (even though I think it may not be the best
solution to the problem). We may want to use ```…``` to enclose regexes but
also want to use some other approach to solve the multi-line / indentation
problems.
OK, so what you're saying here is that it might be a clever self-deception to count
newline handling as "just another aspect of raw-ness"?
Bingo.
Back in the day (I’m talking 1960s) it was ugly and wasteful but predictable:
if there were line breaks at all (as opposed to record-oriented I/O), they were
represented by two characters, CR and then LF, held over from the mechanical
abilities/requirements of Teletype machines.
Then in mid-1960s an ISO standard allowed plain LF (eventually semi-renamed
Newline) as an alternative, and Multics and then Unix spread this idea (and
eventually to Apple).
But another branch of the world, notably the CP/M to MS-DOS to Windows line,
continued to use CR/LF. Worse yet, some software came to use CR along (perhaps
a natural enough theory when you consider that the “Return” key on keyboards
usually generates the CR character rather than the LF character).
It is simply impossible to be compatible with everyone on this issue, and we
are fooling ourselves if we think that raw string representations can solve
this problem in all contexts. Much better, I think, in the absence of
consensus to have explicit software gatekeepers at the points where data
transitions among these disparate worlds.