Re: [R] TM reader with text

Mickael R problem Thu, 01 Mar 2012 09:25:21 -0800

Hi Richard,
clearly there is a problem with latin ligature because the word resulting
from my ask with  findFreqTerms give me some words >           "<U+FB01>n"      
    
"<U+FB01>nancement"
>> "<U+FB01>nancier"     "<U+FB01>nanciÃ¨re"    "<U+FB01>nanciÃ¨res"
>> "<U+FB01>nanciers"    "<U+FB01>xe" 
 where U+FB01 is a code for latin ligature. The problem is well identified
ok.


Now, how can I tretaed it. The package TAU seems to offer a solution for
text but not for corpus. 

quoation TAU " translate Translate Unicode Latin Ligatures Description
Translate Unicode “Latin ligature” characters to their respective
constituents. Usage translate_Unicode_latin_ligatures(x) Arguments
x a character vector in UTF-8 encoding. 
Details In typography, a ligature occurs where two or more graphemes are
joined as a single glyph. (See
http://en.wikipedia.org/wiki/Typographic_ligature for more information.)
Unicode (http://www.unicode.org/) lists the following “Latin” ligatures:
Code Name
0132 LATIN CAPITAL LIGATURE IJ
0133 LATIN SMALL LIGATURE IJ
0152 LATIN CAPITAL LIGATURE OE
0153 LATIN SMALL LIGATURE OE
FB00 LATIN SMALL LIGATURE FF
util 9
FB01 LATIN SMALL LIGATURE FI
FB02 LATIN SMALL LIGATURE FL
FB03 LATIN SMALL LIGATURE FFI
FB04 LATIN SMALL LIGATURE FFL
FB05 LATIN SMALL LIGATURE LONG S T
FB06 LATIN SMALL LIGATURE ST

translate_Unicode_latin_ligatures translates these to their respective
constituent characters.

I need this king of fonction for corpus not only text or characters. Any
ideas ?
Thank's for comments and your answers. We are in progress!
Mickaël

--
View this message in context: 
http://r.789695.n4.nabble.com/TM-reader-with-text-tp4433394p4435229.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] TM reader with text

Reply via email to