Excerpts from Lewis Hyatt via Gcc-patches's message of January 29, 2021 4:46 pm:
> Q1: What is the input charset?
> A1:
> 
>     libcpp: Whatever was passed to -finput-charset (note, for Fortran,
>     -finput-charset is not supported though.)
> 
>     go: Assume UTF-8.
> 
>     D: UTF-8, UTF-16, or UTF-32 (the latter two in either
>        endianness); determined by inspecting the first bytes of the file.
> 
> Q2: How should a UTF-8 BOM be handled?
> A2:
> 
>     libcpp: Treat entirely the same, as if it was not present at all. So
>     a diagnostic referring to the first non-BOM character in the file will
>     point to column 1, not to column 4.
> 
>     go: Treat it like whitespace, ignored for parsing purposes, but still
>     logically part of the file. A diagnostic referring to the first non-BOM
>     character in the file will point to column 4.
> 
>     D: Same as libcpp.
> 
> So input.c's current behavior (with or without my patch) actually matches
> the "go" frontend exactly and does the right thing for it. As you
> thought, though, for D it would be desirable to skip over the UTF-8 BOM
> too.
> 
> D adds an additional wrinkle in that, instead of using -finput-charset, it
> uses its own detection logic -- based on knowledge that the first codepoint
> in the file must be ASCII, it is able to deduce the encoding from the first
> few bytes. This means that in D, each source file can have a different
> encoding, which is not expressible in libcpp frontends currently, and hence
> the notion of a global input_cpp_context with a constant charset is not
> sufficient to capture this.
> 

Yes, that is correct for D.

> In this iteration of the patch, I replaced the input_cpp_context with a more
> general input_context that can handle all of these cases. The context now
> consists of a callback function and a bool, which the frontend is supposed
> to configure. The bool determines whether or not to skip the BOM. The
> callback, when passed a filename, returns the input charset needed to
> convert that file. For libcpp, the filename is not used as the charset is
> determined from -finput-charset for all files. For D, the callback function
> is currently a stub, but it could be expanded to open the file, read the
> first few bytes, and determine the charset to be used. I have not
> implemented it for now, because it seems inelegant to duplicate the logic D
> already has for this detection, but this logic is part of the dmd/ tree
> which I think is maintained external to GCC, and so it didn't seem right to
> attempt to refactor it. I figured there may not be much interest in this
> feature (diagnostics are already unusable for UTF-16 and UTF-32 sources in
> D), and this patch makes no change to it. This patch does fix the handling
> of a UTF-8 BOM in D diagnostics, however.
> 

And yes, dmd/ is maintained in a separate upstream repository.

I wouldn't worry about UTF-16 and UTF-32.  I expect all the sources
written in those charsets are found only in the D testsuite.

> gcc/d/ChangeLog:
> 
>       PR other/93067
>       * d-lang.cc (d_input_charset_callback): New function.
>       (d_init): Call new function input_initialize_context().
> 

> diff --git a/gcc/d/d-lang.cc b/gcc/d/d-lang.cc
> index 0fd207da7f3..9f170c09886 100644
> --- a/gcc/d/d-lang.cc
> +++ b/gcc/d/d-lang.cc
> @@ -50,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "print-tree.h"
>  #include "debug.h"
> +#include "input.h"
>  
>  #include "d-tree.h"
>  #include "id.h"
> @@ -362,6 +363,19 @@ d_option_lang_mask (void)
>    return CL_D;
>  }
>  
> +/* Implements input charset and BOM skipping configuration for
> +   diagnostics.  */
> +static const char *d_input_charset_callback (const char * /*filename*/)
> +{
> +  /* TODO: The input charset is automatically determined by code in
> +     dmd/dmodule.c based on the contents of the file.  If this detection
> +     logic were factored out and could be reused here, then we would be able
> +     to return UTF-16 or UTF-32 as needed here.  For now, we return always
> +     NULL, which means no conversion is necessary, i.e. the input is assumed
> +     to be UTF-8 when diagnostics read this file.  */
> +  return NULL;
> +}
> +

OK from me.

I can have a look into this, but it will have to wait until after I do
an impending switch-over of the dmd/ sources from C++ to D.  Then I'd
have to convince upstream to be open to the change (refactoring has been
getting a bad rep as of late for sneaking regressions into upstream D).

Iain.

Reply via email to