[R] Word occurrence rate in a tweet

Bembi Prima Wed, 10 Jul 2013 22:51:41 -0700

Hi all,

Currently I am working on a code that will calculate word occurrence rate
in a tweet.
First, I have 'tweets' that contains all the tweet I grabbed and I make
'words' that contains all unique word in 'tweets'.
After that I use sapply to calculate probability of a word appearing in
'tweets'.
The main problems is speed, before using sapply, I use simple for loop that
takes a really long time to finish but I can make simple ETA in the loop.
After I learn to use sapply and implement it on the code, speed is
improving greatly but I don't know the ETA so I just waiting for the result
to appear.
Using just 5% of the data I have waited for hours and R is still busy with
no output.
Is there a faster solution or useful package to help on my problem?


Here is my code :

sample.num<-100000

tweets<-read.csv('data_conv.csv', sep=',', header=TRUE, stringsAsFactors =
FALSE)
tweets.num<-dim(tweets)[1]
tweets<-tweets[sample(1:tweets.num,sample.num,replace=FALSE)]
tweets.num<-length(tweets)

words<-paste(tweets,collapse=' ')
words<-gsub("\\\r\\\n", " ", words,ignore.case=TRUE,perl=TRUE) # remove
newlines
words<-gsub(" *\\d+ *", " ", words,ignore.case=TRUE,perl=TRUE) # remove
digits
words<-gsub("[^\\w@]+", " ", words,ignore.case=TRUE,perl=TRUE) # remove
nonwords
words<-unique(as.data.frame(strsplit(tolower(words),split=' '))) # unique
words
words<-words[order(words),] # sort it
words<-as.character(words)
words.num<-length(words)

result<-as.data.frame(words)
result$prob<-0
result$prob<-sapply(1:words.num,function(i)sum(grepl(sprintf('\\b%s\\b',words[i]),
tweets, ignore.case = TRUE, perl = TRUE))/tweets.num) # Loooong time here

Thank you,
Bembi

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Word occurrence rate in a tweet

Reply via email to