This is indeed unfortunate, but expecting Chinese speakers (20% of the world's population) to write in Latin-1 was also unfortunate.

What I had (and still have) some hope of doing is being able to mark character strings as UTF-8, probably via a flag bit on the CHARSXP. Then output routines could be made to convert (if possible) to the current locale. But that was before I found out how hard it was to get non-ASCII characters displayed correctly. Also, any solution to this almost certainly means abandoning Windows 95/98/ME as they don't have support for Unicode (and although we could add such support at C level, they would not have the Unicode fonts needed). (It might be OK to do that now, but it was not a couple of years ago.)

Don't underestimate the font problem. Last week I gave a seminar about statistical computing in my own dept, and I thought I would show R operating in Chinese (we have quite a number of Chinese speakers, indeed more than from Latin-1 languages). It did not work correctly in my pre-seminar tests, because there were no CJK fonts installed on the lecture-room computer. There was no warning nor error, just unintelligible output.

If you are only concerned with Latin-1 and UTF-8, there is something you can do. Rather than have a .rda file, store your datasets as .R files, with another .R file as a driver. So you would need something like

ex1.R:
source("ex1_dat.R", encoding="latin1")

ex_dat.R:
dump of the object, converted to latin1.

If you don't specify lazydata, this will ensure the object gets converted to the current locale when the data() statement is executed. If you do specify lazydata, the conversion will happen when the package is installed, which is fine if you (and any other users) always use the same locale (or at least always use a locale with the same encoding, e.g. always use a UTF-8 locale). However, this is really only of use in locales that will have font coverage of Latin-1, and R installations without iconv will not do any necessary conversion (which is why I suggest dumping in latin1 and not in UTF-8).


On Thu, 19 Oct 2006, Martin Maechler wrote:

"Stéphane" == Stéphane Dray <[EMAIL PROTECTED]>
    on Thu, 19 Oct 2006 09:46:49 +0200 writes:

   Stéphane> Thanks a lot for this clear answer. So there is no way to preserve 
our
   Stéphane> french cultural exception (accented characters),

I agree that there are many French cultural exceptions ;-)
--- and as a Swiss, I highly estimate several of them ---
however "accented" characters (with the appropriate meaning of "accented")
are not at all a French exception, rather almost a continental
European one {as long as we are staying in the "latin" alphabet
context}.  If I think of what I know of Europe, the only
country/language *not* using some version of "accented"
characters are the British and (I think) the Dutch/Flamish.
Everyone else (? probably I forgot some, and don't know about others
like gaelic,...)  has some kind of accents...

I agree with Stéphane that this is unfortunate for quite a few
of us, and it came as a big surprise to me when I first heard
about this from Brian.  .. aah, life was easy when we western
chauvinists could behave as if the whole relevant part of the
world was happy with iso-latin1...

Martin


   Stéphane> if we want to be international... I have thought
   Stéphane> that the inclusion of a parameter encoding in data
   Stéphane> function (e.g. data(mydata,encoding="latin1"))
   Stéphane> like in the function 'file' could be an way to
   Stéphane> solve the problem. Apparently, the problem is much
   Stéphane> more complicated...

   Stéphane> Sincerely.


   Stéphane> Prof Brian Ripley wrote:

   >> Only ASCII letters are portable: those accented characters do not even
   >> exist in many of the encodings used for R, e.g. Russian and Japanese
   >> on Windows machines.
   >>
   >> There is no way to associate an encoding with a character string in
   >> R.  We considered it, but it would have had severe back-compatibility
   >> problems and little advantage (you cannot display non-ASCII character
   >> strings portably: even if you have a Unicode encoding you still need
   >> to select a suitable font).
   >>
   >> 'B. Ripley' (sic)
   >>
   >>
   >> On Wed, 18 Oct 2006, Stéphane Dray wrote:
   >>
   >>> Hello,
   >>> I have some questions concerning encoding and package distribution. We
   >>> develop the ade4 package. For some data sets included in the package,
   >>> there are accentued character (e.g. é,è...). The data sets have been
   >>> saved using latin1 encoding, but some of us use utf-8 and can not see
   >>> some data sets which contains accented chracters.
   >>> e.g:
   >>>
   >>> librarry(ade4)
   >>> data(rankrock)
   >>> rankrock
   >>>
   >>> in this case, characters are in rownames. Other data sets have such
   >>> characters in data (e.g. levels of factors..). A solution is to use
   >>> iconv... this is quite easy for us but perhaps more difficult for a user
   >>> which can have no idea of the problem. This problem is quite marginal
   >>> for the moment but some linux distribution are utf-8 by default (e.g.
   >>> ubuntu) and I suppose that the problem will be more and more present in
   >>> the future.
   >>>
   >>> So we wonder if there is a proper way to code and save these data sets.
   >>> I have found some documents of B. Ripley and this note :
   >>>
   >>> http://developer.r-project.org/210update.txt
   >>>
   >>> -  Names in data objects (e.g. in .rda files) are problematic.  It
   >>> is likely that by release time these will be treated as in
   >>> Latin-1.
   >>>
   >>> If I am correct, I did not find an answer to this problem.
   >>>
   >>> What are the plans of R gurus on this question ?
   >>> Thanks a lot.
   >>> Sincerely.
   >>>
   >>> Please add my adress in answers as I am not subsciber of this list.

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to