Re: [R] reading a character translation table into R

Duncan Murdoch Sat, 08 Jun 2013 13:54:06 -0700

On 13-06-08 4:31 PM, Michael Friendly wrote:

I have a txt file (attached) that defines equivalents among characters
in latin1 (or iso-8859-1), numeric &#xxx; codes, HTML entities
and latex equivalents.  A portion of the file is shown inline below, but
may not be rendered well in this email.


I'd like to read this into R to use as a character translation table,
but am stuck on two things:
- The 5 fields in the file are column-aligned and are separated by 2+
white space characters.
In perl this is trivial to read and parse via something like
          @entries = split("\n", $charTable);
          foreach (@entries) {
                  ($desc, $char, $code, $html, $tex) = split(/\s\s+/);
          }
AFAIK, the only function for reading such data is utils::read.fwf, but I
have to specify the field widths.
I don't know of any function that allows even a simple regrex like this
as a sep= argument.

I see two ways to do this. Work out the column numbers and useread.fwf, or read whole lines, and use sub() to extract columns. Thelatter is pretty close to the spirit of the Perl method, e.g.


lines <- readLines( filename )
regex <- paste(rep("([^[:space:]]*)[[:space:]]*", 5), collapse="")
desc <- sub(regex, "\\1", lines)
char <- sub(regex, "\\2", lines)
etc.

(Actually, this doesn't work, because the desc field contains embeddedspaces; I don't think the Perl would work either. But if you can workout the regexp to match the first field, or just extract it usingsubstr(), you're good.)


Duncan Murdoch


- The TeX field contains many backslashed codes that need to be escaped
in R. Is it necessarty
to manually edit the file to change '\pounds' --> '\\pounds', '\S' -->
'\\S', etc. or is there something
like raw mode input that would do this where necessary?

Description                         Char
   Code      HTML        TeX
double quote                         "    &#034; &quot;
ampersand                            &    &#038; &amp        \&
apostrophe                           '    &#039; &apos;
less than                            <    &#060; &lt;        $<$
greater than                         >    &#062; &gt;        $>$
non-breaking space                   .    &#160; &nbsp;      ~
inverted exclamation                 ¡    &#161; &iexcl;     !'
cent sign                            ¢    &#162; &cent;
pound sterling                       £    &#163; &pound;     \pounds
general currency sign                ¤    &#164; &curren;
yen sign                             ¥    &#165; &yen;
broken vertical bar                  ¦    &#166; &brvbar;
section sign                         §    &#167; &sect;      \S
umlaut (dieresis)                    ¨    &#168; &uml;       \"{}
copyright                            ©    &#169; &copy;      \copyright
feminine ordinal                     ª    &#170; &ordf;      $^a$
left angle quote, guillemotleft      «    &#171; &laquo;     \guillemotleft
not sign                             ¬    &#172; &not;
soft hyphen                              &#173; &shy;
registered trademark                 ®    &#174; &reg;       \textregistered
macron accent                        ¯    &#175; &macr;
degree sign                          °    &#176; &deg;       $^o$
plus or minus                        ±    &#177; &plusmn;    $\pm$
superscript two                      ²    &#178; &sup2;      $^2$
superscript three                    ³    &#179; &sup3;      $^3$
acute accent                         ´    &#180; &acute;     \'{}
micro sign                           µ    &#181; &micro;     $\mu$
paragraph sign                       ¶    &#182; &para;      \P
middle dot                           ·    &#183; &middot;    $\cdot$
cedilla                              ¸    &#184; &cedil;     \c{}
superscript one                      ¹    &#185; &sup1;      $^1$
masculine ordinal                    º    &#186; &ordm;      $^o$
right angle quote, guillemotright    »    &#187; &raquo;     \guillemotright
fraction one-fourth                  ¼    &#188; &frac14;    $\frac14$
fraction one-half                    ½    &#189; &frac12;    $\frac12$
fraction three-fourths               ¾    &#190; &frac34;    $\frac34$
inverted question mark               ¿    &#191; &iquest;    ?'
capital A, grave accent              À    &#192; &Agrave;    \`A
capital A, acute accent              Á    &#193; &Aacute;    \'A
capital A, circumflex accent         Â    &#194; &Acirc;     \^A
capital A, tilde                     Ã    &#195; &Atilde;    \~A
capital A, dieresis or umlaut mark   Ä    &#196; &Auml;      \"A
capital A, ring                      Å    &#197; &Aring;     \AA
capital AE diphthong (ligature)      Æ    &#198; &AElig;     \AE



______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading a character translation table into R

Reply via email to