Le jeudi 01 mars 2012 à 07:07 -0800, Mickael R problem a écrit : > Hi Richard, > clearly there is a problem with latin ligature because the word resulting > from my ask with findFreqTerms give me some words > "<U+FB01>n" > > "<U+FB01>nancement" > >> "<U+FB01>nancier" "<U+FB01>nancière" "<U+FB01>nancières" > >> "<U+FB01>nanciers" "<U+FB01>xe" > where U+FB01 is a code for latin ligature. The problem is well identified > ok. > > Now, how can I tretaed it. The package TAU seems to offer a solution for > text but not for corpus. > > quoation TAU " translate Translate Unicode Latin Ligatures Description > Translate Unicode “Latin ligature” characters to their respective > constituents. Usage translate_Unicode_latin_ligatures(x) Arguments > x a character vector in UTF-8 encoding. > Details In typography, a ligature occurs where two or more graphemes are > joined as a single glyph. (See > http://en.wikipedia.org/wiki/Typographic_ligature for more information.) > Unicode (http://www.unicode.org/) lists the following “Latin” ligatures: > Code Name > 0132 LATIN CAPITAL LIGATURE IJ > 0133 LATIN SMALL LIGATURE IJ > 0152 LATIN CAPITAL LIGATURE OE > 0153 LATIN SMALL LIGATURE OE > FB00 LATIN SMALL LIGATURE FF > util 9 > FB01 LATIN SMALL LIGATURE FI > FB02 LATIN SMALL LIGATURE FL > FB03 LATIN SMALL LIGATURE FFI > FB04 LATIN SMALL LIGATURE FFL > FB05 LATIN SMALL LIGATURE LONG S T > FB06 LATIN SMALL LIGATURE ST > > translate_Unicode_latin_ligatures translates these to their respective > constituent characters. > > I need this king of fonction for corpus not only text or characters. Any > ideas ? Try: corpus <- tm_map(corpus, translate_Unicode_latin_ligatures) (with 'corpus' your corpus, of course ;-)
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.