Re: [Rd] latin1,utf-8...encoding and data

Prof Brian Ripley Wed, 25 Oct 2006 02:49:25 -0700

This is indeed unfortunate, but expecting Chinese speakers (20% of theworld's population) to write in Latin-1 was also unfortunate.

What I had (and still have) some hope of doing is being able to markcharacter strings as UTF-8, probably via a flag bit on the CHARSXP. Thenoutput routines could be made to convert (if possible) to the currentlocale. But that was before I found out how hard it was to get non-ASCIIcharacters displayed correctly. Also, any solution to this almostcertainly means abandoning Windows 95/98/ME as they don't have support forUnicode (and although we could add such support at C level, they would nothave the Unicode fonts needed). (It might be OK to do that now, but itwas not a couple of years ago.)

Don't underestimate the font problem. Last week I gave a seminar aboutstatistical computing in my own dept, and I thought I would show Roperating in Chinese (we have quite a number of Chinese speakers, indeedmore than from Latin-1 languages). It did not work correctly in mypre-seminar tests, because there were no CJK fonts installed on thelecture-room computer. There was no warning nor error, justunintelligible output.

If you are only concerned with Latin-1 and UTF-8, there is something youcan do. Rather than have a .rda file, store your datasets as .R files,with another .R file as a driver. So you would need something like


ex1.R:
source("ex1_dat.R", encoding="latin1")

ex_dat.R:
dump of the object, converted to latin1.

If you don't specify lazydata, this will ensure the object gets convertedto the current locale when the data() statement is executed. If you dospecify lazydata, the conversion will happen when the package isinstalled, which is fine if you (and any other users) always use the samelocale (or at least always use a locale with the same encoding, e.g.always use a UTF-8 locale). However, this is really only of use inlocales that will have font coverage of Latin-1, and R installationswithout iconv will not do any necessary conversion (which is why Isuggest dumping in latin1 and not in UTF-8).



On Thu, 19 Oct 2006, Martin Maechler wrote:

"Stéphane" == Stéphane Dray <[EMAIL PROTECTED]>
    on Thu, 19 Oct 2006 09:46:49 +0200 writes:


   Stéphane> Thanks a lot for this clear answer. So there is no way to preserve 
our
   Stéphane> french cultural exception (accented characters),

I agree that there are many French cultural exceptions ;-)
--- and as a Swiss, I highly estimate several of them ---
however "accented" characters (with the appropriate meaning of "accented")
are not at all a French exception, rather almost a continental
European one {as long as we are staying in the "latin" alphabet
context}.  If I think of what I know of Europe, the only
country/language *not* using some version of "accented"
characters are the British and (I think) the Dutch/Flamish.
Everyone else (? probably I forgot some, and don't know about others
like gaelic,...)  has some kind of accents...

I agree with Stéphane that this is unfortunate for quite a few
of us, and it came as a big surprise to me when I first heard
about this from Brian.  .. aah, life was easy when we western
chauvinists could behave as if the whole relevant part of the
world was happy with iso-latin1...

Martin


   Stéphane> if we want to be international... I have thought
   Stéphane> that the inclusion of a parameter encoding in data
   Stéphane> function (e.g. data(mydata,encoding="latin1"))
   Stéphane> like in the function 'file' could be an way to
   Stéphane> solve the problem. Apparently, the problem is much
   Stéphane> more complicated...

   Stéphane> Sincerely.


   Stéphane> Prof Brian Ripley wrote:

   >> Only ASCII letters are portable: those accented characters do not even
   >> exist in many of the encodings used for R, e.g. Russian and Japanese
   >> on Windows machines.
   >>
   >> There is no way to associate an encoding with a character string in
   >> R.  We considered it, but it would have had severe back-compatibility
   >> problems and little advantage (you cannot display non-ASCII character
   >> strings portably: even if you have a Unicode encoding you still need
   >> to select a suitable font).
   >>
   >> 'B. Ripley' (sic)
   >>
   >>
   >> On Wed, 18 Oct 2006, Stéphane Dray wrote:
   >>
   >>> Hello,
   >>> I have some questions concerning encoding and package distribution. We
   >>> develop the ade4 package. For some data sets included in the package,
   >>> there are accentued character (e.g. é,è...). The data sets have been
   >>> saved using latin1 encoding, but some of us use utf-8 and can not see
   >>> some data sets which contains accented chracters.
   >>> e.g:
   >>>
   >>> librarry(ade4)
   >>> data(rankrock)
   >>> rankrock
   >>>
   >>> in this case, characters are in rownames. Other data sets have such
   >>> characters in data (e.g. levels of factors..). A solution is to use
   >>> iconv... this is quite easy for us but perhaps more difficult for a user
   >>> which can have no idea of the problem. This problem is quite marginal
   >>> for the moment but some linux distribution are utf-8 by default (e.g.
   >>> ubuntu) and I suppose that the problem will be more and more present in
   >>> the future.
   >>>
   >>> So we wonder if there is a proper way to code and save these data sets.
   >>> I have found some documents of B. Ripley and this note :
   >>>
   >>> http://developer.r-project.org/210update.txt
   >>>
   >>> -  Names in data objects (e.g. in .rda files) are problematic.  It
   >>> is likely that by release time these will be treated as in
   >>> Latin-1.
   >>>
   >>> If I am correct, I did not find an answer to this problem.
   >>>
   >>> What are the plans of R gurus on this question ?
   >>> Thanks a lot.
   >>> Sincerely.
   >>>
   >>> Please add my adress in answers as I am not subsciber of this list.

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] latin1,utf-8...encoding and data

Reply via email to