[Bug libfortran/35863] [F2003] Implement ENCODING="UTF-8"

burnus at gcc dot gnu dot org Tue, 15 Apr 2008 12:47:09 -0700


------- Comment #3 from burnus at gcc dot gnu dot org  2008-04-15 19:46 -------
> > Front end and library are ready to handle this when implemented.
> Front-end is ready?
Yes, it is: ENCODING= is supported and the rest is neither in the library nor
in the front-end implemented. Though I would not call this "ready".


> Is ENCODING="UTF-8" related to UCS-4 support?

I think it is at the end. You can easily use UTF-8 encoding already now, but
'(a2)' might print one (non-ascii) or two (ascii) characters. To have something
well-defined, only one-byte-wide characters can be used currently. For anything
beyond, UCS4 is needed in the front end.

Actually, I do not understand how to write things like 

   character(kind=myUCS4,len=20) :: foo = myUCS4_'Some UCS4 string'

(The problem is switching the encoding within the same file; good luck in
finding an editor which supports this.)

If one does not need non-ascii character literals (i.e. reading from / writing
to files), there is no problem.

Possible solutions?
a) Have a UCS-4 input file; then both default_'foo' and ucs4_'foo' work.
b) Expect that for myUCS4_'foo' literals the characters in the quotes are
actually UTF-8.

I'm personally in favour of (b). I'm not quite sure whether this is really
compatible with the Fortran standard, but I like the way of inputting the
string.

Otherwise, I think Fortran misses a good way of inputting non-ascii characters
in an ASCII file. C99 offers '\uXXXX' but unless I missed something in Fortran
the equivalent would be:

I think (c) is what most programmers want, but I actually do not see how this
should work syntax wise; or should an ascii literal automatically handled as
UTF-8? Then it would work: when assigning to a ucs8 string, the UTF-8 gets
properly converted a non-ascii character has then the length one (len(char()
while if one assigns to a ASCII string, non-ascii characters of cause need more
bytes and thus "len('§') == 2".

(b) is also an interesting problem. And (a) of cause works, but it is quite
cumbersome to use - Fortran misses the \uXXXX way of C for specifying an
unicode character; one can probably work with
   myUCS4string = char(int(z/A0FF/),kind=myUCS4)
but this is awful. (Actually, I think the standard does not even guarantee that
it does this as "char" is processor dependent.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35863

[Bug libfortran/35863] [F2003] Implement ENCODING="UTF-8"

Reply via email to