Re: [R] How to remove non-UTF-8 characters from a string

Prof Brian Ripley Fri, 26 Oct 2007 07:47:32 -0700

That is not a well-defined concept.  To define 'character' you need to 
know the encoding, since that determines how to split the bytes into 
characters.  So only whole strings can be UTF-8 or not.  You can say which 
bytes in a stream of bytes would be valid in UTF-8, but if not all of them 
are then almost certainly it would be incorrect to interpret any of them 
in UTF-8.

You can find out if a stream of bytes is valid in a UTF-8 locale by
nchar(x, "c", allowNA=TRUE) and testing for NA elements in the result.

On Fri, 26 Oct 2007, Bos, Roger wrote:

> All,
>
> I am trying to post text from an XLS spread to my wiki and I need to
> remove any characters that are not UTF-8.  Is there an easy gsub command
> that can do this?
>
> (I previously sent this same email to r-sig-gui.  That was a mistake and
> I apologize for the duplication.)
>
> Thanks, Roger J. Bos

-- 
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to remove non-UTF-8 characters from a string

Reply via email to