"Try this before running removePuncutation():
corpus <- tm_map(corpus, function(x) gsub("[\'\U2019]«»", " ", x))"
It will replace quotation marks with a space, and that's enough to
separate them from the rest of the word.
I try to use your solution. It's work only for characters, not for a Corpus,
Hello everybody,
I don't give up the fight, but it's hard. I have finded a solution for the
ligature with a best converter wich tranlated more precisely PDF to plain
text. But a new problem has occured. In french particulary, but it should be
the case in english too, I have a big problem ' " bracke
Hi Richard,
clearly there is a problem with latin ligature because the word resulting
from my ask with findFreqTerms give me some words > "n"
"nancement"
>> "nancier" "nancière""nancières"
>> "nanciers""xe"
where U+FB01 is a code for latin ligature. The problem
my computer run under windows vista 64 sp2. The question about encoding, I
don't understand it, sorry ?
--
View this message in context:
http://r.789695.n4.nabble.com/TM-reader-with-text-tp4433394p4433526.html
Sent from the R help mailing list archive at Nabble.com.
Hello everybody,
I work, I try, with TM but I have a problem with some special words in
french. I think this is due to the manner to transform PDF to text, but I'm
not perfectly sure.
Let's see to the example :
findFreqTerms(tdm1,30)
[33] """n" "nancement"
"nancier"
5 matches
Mail list logo