Excerpts from Lewis Hyatt via Gcc-patches's message of January 29, 2021 4:46 pm: > Q1: What is the input charset? > A1: > > libcpp: Whatever was passed to -finput-charset (note, for Fortran, > -finput-charset is not supported though.) > > go: Assume UTF-8. > > D: UTF-8, UTF-16, or UTF-32 (the latter two in either > endianness); determined by inspecting the first bytes of the file. > > Q2: How should a UTF-8 BOM be handled? > A2: > > libcpp: Treat entirely the same, as if it was not present at all. So > a diagnostic referring to the first non-BOM character in the file will > point to column 1, not to column 4. > > go: Treat it like whitespace, ignored for parsing purposes, but still > logically part of the file. A diagnostic referring to the first non-BOM > character in the file will point to column 4. > > D: Same as libcpp. > > So input.c's current behavior (with or without my patch) actually matches > the "go" frontend exactly and does the right thing for it. As you > thought, though, for D it would be desirable to skip over the UTF-8 BOM > too. > > D adds an additional wrinkle in that, instead of using -finput-charset, it > uses its own detection logic -- based on knowledge that the first codepoint > in the file must be ASCII, it is able to deduce the encoding from the first > few bytes. This means that in D, each source file can have a different > encoding, which is not expressible in libcpp frontends currently, and hence > the notion of a global input_cpp_context with a constant charset is not > sufficient to capture this. >
Yes, that is correct for D. > In this iteration of the patch, I replaced the input_cpp_context with a more > general input_context that can handle all of these cases. The context now > consists of a callback function and a bool, which the frontend is supposed > to configure. The bool determines whether or not to skip the BOM. The > callback, when passed a filename, returns the input charset needed to > convert that file. For libcpp, the filename is not used as the charset is > determined from -finput-charset for all files. For D, the callback function > is currently a stub, but it could be expanded to open the file, read the > first few bytes, and determine the charset to be used. I have not > implemented it for now, because it seems inelegant to duplicate the logic D > already has for this detection, but this logic is part of the dmd/ tree > which I think is maintained external to GCC, and so it didn't seem right to > attempt to refactor it. I figured there may not be much interest in this > feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in > D), and this patch makes no change to it. This patch does fix the handling > of a UTF-8 BOM in D diagnostics, however. > And yes, dmd/ is maintained in a separate upstream repository. I wouldn't worry about UTF-16 and UTF-32. I expect all the sources written in those charsets are found only in the D testsuite. > gcc/d/ChangeLog: > > PR other/93067 > * d-lang.cc (d_input_charset_callback): New function. > (d_init): Call new function input_initialize_context(). > > diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc > index 0fd207da7f3..9f170c09886 100644 > --- a/gcc/d/d-lang.cc > +++ b/gcc/d/d-lang.cc > @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3. If not see > #include "output.h" > #include "print-tree.h" > #include "debug.h" > +#include "input.h" > > #include "d-tree.h" > #include "id.h" > @@ -362,6 +363,19 @@ d_option_lang_mask (void) > return CL_D; > } > > +/* Implements input charset and BOM skipping configuration for > + diagnostics. */ > +static const char *d_input_charset_callback (const char * /*filename*/) > +{ > + /* TODO: The input charset is automatically determined by code in > + dmd/dmodule.c based on the contents of the file. If this detection > + logic were factored out and could be reused here, then we would be able > + to return UTF-16 or UTF-32 as needed here. For now, we return always > + NULL, which means no conversion is necessary, i.e. the input is assumed > + to be UTF-8 when diagnostics read this file. */ > + return NULL; > +} > + OK from me. I can have a look into this, but it will have to wait until after I do an impending switch-over of the dmd/ sources from C++ to D. Then I'd have to convince upstream to be open to the change (refactoring has been getting a bad rep as of late for sneaking regressions into upstream D). Iain.