On 13-06-08 4:31 PM, Michael Friendly wrote:
I have a txt file (attached) that defines equivalents among characters
in latin1 (or iso-8859-1), numeric &#xxx; codes, HTML entities
and latex equivalents. A portion of the file is shown inline below, but
may not be rendered well in this email.
I'd like to read this into R to use as a character translation table,
but am stuck on two things:
- The 5 fields in the file are column-aligned and are separated by 2+
white space characters.
In perl this is trivial to read and parse via something like
@entries = split("\n", $charTable);
foreach (@entries) {
($desc, $char, $code, $html, $tex) = split(/\s\s+/);
}
AFAIK, the only function for reading such data is utils::read.fwf, but I
have to specify the field widths.
I don't know of any function that allows even a simple regrex like this
as a sep= argument.
I see two ways to do this. Work out the column numbers and use
read.fwf, or read whole lines, and use sub() to extract columns. The
latter is pretty close to the spirit of the Perl method, e.g.
lines <- readLines( filename )
regex <- paste(rep("([^[:space:]]*)[[:space:]]*", 5), collapse="")
desc <- sub(regex, "\\1", lines)
char <- sub(regex, "\\2", lines)
etc.
(Actually, this doesn't work, because the desc field contains embedded
spaces; I don't think the Perl would work either. But if you can work
out the regexp to match the first field, or just extract it using
substr(), you're good.)
Duncan Murdoch
- The TeX field contains many backslashed codes that need to be escaped
in R. Is it necessarty
to manually edit the file to change '\pounds' --> '\\pounds', '\S' -->
'\\S', etc. or is there something
like raw mode input that would do this where necessary?
Description Char
Code HTML TeX
double quote " " "
ampersand & & & \&
apostrophe ' ' '
less than < < < $<$
greater than > > > $>$
non-breaking space .   ~
inverted exclamation ¡ ¡ ¡ !'
cent sign ¢ ¢ ¢
pound sterling £ £ £ \pounds
general currency sign ¤ ¤ ¤
yen sign ¥ ¥ ¥
broken vertical bar ¦ ¦ ¦
section sign § § § \S
umlaut (dieresis) ¨ ¨ ¨ \"{}
copyright © © © \copyright
feminine ordinal ª ª ª $^a$
left angle quote, guillemotleft « « « \guillemotleft
not sign ¬ ¬ ¬
soft hyphen ­ ­
registered trademark ® ® ® \textregistered
macron accent ¯ ¯ ¯
degree sign ° ° ° $^o$
plus or minus ± ± ± $\pm$
superscript two ² ² ² $^2$
superscript three ³ ³ ³ $^3$
acute accent ´ ´ ´ \'{}
micro sign µ µ µ $\mu$
paragraph sign ¶ ¶ ¶ \P
middle dot · · · $\cdot$
cedilla ¸ ¸ ¸ \c{}
superscript one ¹ ¹ ¹ $^1$
masculine ordinal º º º $^o$
right angle quote, guillemotright » » » \guillemotright
fraction one-fourth ¼ ¼ ¼ $\frac14$
fraction one-half ½ ½ ½ $\frac12$
fraction three-fourths ¾ ¾ ¾ $\frac34$
inverted question mark ¿ ¿ ¿ ?'
capital A, grave accent À À À \`A
capital A, acute accent Á Á Á \'A
capital A, circumflex accent    \^A
capital A, tilde à à à \~A
capital A, dieresis or umlaut mark Ä Ä Ä \"A
capital A, ring Å Å Å \AA
capital AE diphthong (ligature) Æ Æ Æ \AE
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.