Hi Richard, clearly there is a problem with latin ligature because the word resulting from my ask with findFreqTerms give me some words > "<U+FB01>n" "<U+FB01>nancement" >> "<U+FB01>nancier" "<U+FB01>nancière" "<U+FB01>nancières" >> "<U+FB01>nanciers" "<U+FB01>xe" where U+FB01 is a code for latin ligature. The problem is well identified ok.
Now, how can I tretaed it. The package TAU seems to offer a solution for text but not for corpus. quoation TAU " translate Translate Unicode Latin Ligatures Description Translate Unicode “Latin ligature” characters to their respective constituents. Usage translate_Unicode_latin_ligatures(x) Arguments x a character vector in UTF-8 encoding. Details In typography, a ligature occurs where two or more graphemes are joined as a single glyph. (See http://en.wikipedia.org/wiki/Typographic_ligature for more information.) Unicode (http://www.unicode.org/) lists the following “Latin” ligatures: Code Name 0132 LATIN CAPITAL LIGATURE IJ 0133 LATIN SMALL LIGATURE IJ 0152 LATIN CAPITAL LIGATURE OE 0153 LATIN SMALL LIGATURE OE FB00 LATIN SMALL LIGATURE FF util 9 FB01 LATIN SMALL LIGATURE FI FB02 LATIN SMALL LIGATURE FL FB03 LATIN SMALL LIGATURE FFI FB04 LATIN SMALL LIGATURE FFL FB05 LATIN SMALL LIGATURE LONG S T FB06 LATIN SMALL LIGATURE ST translate_Unicode_latin_ligatures translates these to their respective constituent characters. I need this king of fonction for corpus not only text or characters. Any ideas ? Thank's for comments and your answers. We are in progress! Mickaël -- View this message in context: http://r.789695.n4.nabble.com/TM-reader-with-text-tp4433394p4435229.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.