Dear experts,
I’ve been on this for weeks now, and couldn’t find a solution..Sorry for the 
long description. I figured I post many details, so you get the problem 
entirely, although it’s not hard to grasp.

**Situation:**
Data frame consisting of 4 million entries (total size: 250 MB). Two columns: 
`ID` and `TEXT`. Text strings are each up to 200 characters.


**Task:**
Preprocessing the text strings

Example Data:


    +——————+—————————————————+
    |  ID    |                     Text                                         
        |  
    +——+—————————————————————+
    | 123  | $AAPL is up +5%                                                |  
    | 456  | $MSFT , $EBAY doing great.  www.url.com       |
                                              ..
    +——+—————————————————————+

Should become

    +——————+——————————————————————————————————-——+
    |  ID    |                     Text clean                                   
     |  First Ticker  |  All Ticker       |   Ticker Count      
    +——+————————————————————+——————+———— +———————-—+
    | 123  | [ticker] is up [positive_percentage]                       |       
$aapl       |   $aapl            |          1
    | 456  | [ticker] [ticker] doing great [url] [pos_emotion]     |       
$msft       |   $msft,$ebay  |          2
                                              ..
    +——+————————————————————+——————-+——————+——————+



**Problem:**
It takes too long. On my 8GB RAM Dual-Core machine: Cancelled after 1 day. On a 
70GB 8-Core Amazon EC2 instance: Cancelled after 1 day.


**Details:**
I am basically 

 - Counting how often certain words appear in one string
 - Write this number into a new column (COUNT)
 - Replace this (counted) word
 - Replace other words (which I don't need to count before)
 - Replace some regular expressions

The vectors which are used as patterns look like this:

    "\\bWORD1\\b|\\bWORD2\\b|\\bWORD3\\b|\\bWORD4\\b..."

Thus, those 'replacement vectors' are character vectors of length 1, each 
containing up to 800 words



**Main code:**

    library("parallel")
    library("stringr")

    preprocessText<-function(x){
      
      # Replace the 'html-and'
      arguments<-list(pattern="\\&amp\\;",replacement="and",x=x, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)  
      
      # Remove some special characters
       
arguments<-list(pattern="[^-[:alnum:]\\'\\:\\/\\$\\%\\.\\,\\+\\-\\#\\@\\_\\!\\?+[:space:]]",replacement="",x=y,
 ignore.case=TRUE)
      y<-do.call(gsub, arguments)  
      
      # Lowercase 
      arguments<-list(string=y,pattern=tolower(rep_ticker))
      first<-do.call(str_match,arguments)  
      
      # Identify signal words and count them
      # Need to be done in parts, because otherwise R can't handle this many at 
once
      arguments<-list(string=x, pattern=rep_words_part1)
      t1<-do.call(str_extract_all,arguments)
   
      arguments<-list(string=x, pattern=rep_words_part2)
      t2<-do.call(str_extract_all,arguments)
      
      arguments<-list(string=x, pattern=rep_words_part3)
      t3<-do.call(str_extract_all,arguments)
      
      arguments<-list(string=x, pattern=rep_words_part4)
      t4<-do.call(str_extract_all,arguments)
      
      count=length(t1[[1]])+length(t2[[1]])+length(t3[[1]])+length(t4[[1]])
      signal_words=c(t1[[1]],t2[[1]],t3[[1]],t4[[1]])
      

      # Replacements
      
      arguments<-list(pattern=rep_wordsA,replacement="[ticker]",x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments) 
       
      arguments<-list(pattern=rep_wordB_part1,replacement="[ticker] ",x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)   

      arguments<-list(pattern=rep_wordB_part2,replacement="[ticker] ",x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)   

      arguments<-list(pattern=rep_wordB_part3,replacement="[ticker2] ",x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)   

      arguments<-list(pattern=rep_wordB_part4,replacement=“[ticker2] ",x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern=rep_email,replacement=" [email_address] ",x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern=rep_url,replacement=" [url] ",x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
         
      arguments<-list(pattern=rep_wordC,replacement=" [wordC] ",x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      # Some regular expressions
      arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+%",replacement=" 
[positive_percentage] ",x=y, ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern="-[[:digit:]]*.?[[:digit:]]+%",replacement=" 
[negative_percentage] ",x=y, ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+%",replacement=" 
[percentage] ",x=y, ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern="\\$[[:digit:]]*.?[[:digit:]]+",replacement=" 
[dollar_value] ",x=y,ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+",replacement=" 
[pos_number] ",x=y, ignore.case=TRUE)# remaining numbers 
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern="\\-[[:digit:]]*.?[[:digit:]]+",replacement=" 
[neg_number] ",x=y, ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+",replacement=" 
[number] ",x=y, ignore.case=TRUE)
      y<-do.call(gsub, arguments)   
      
      arguments<-list(pattern=rep_question,replacement=" [question] ", x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)    
      
      
      # Unify synonyms
      arguments<-list(pattern=rep_syno1,replacement="happy", x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)  
      
      arguments<-list(pattern=rep_syno2,replacement="sad", x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)  
      
      arguments<-list(pattern=rep_syno3,replacement="people", x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)  
      
      arguments<-list(pattern=rep_syno4,replacement="father", x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)  
      
      arguments<-list(pattern=rep_syno5,replacement="mother", x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)  
      
      arguments<-list(pattern=rep_syno6,replacement="money", x=y, 
ignore.case=TRUE)
      y<-do.call(gsub, arguments)  
      
      # Remove words
      # Punctuation (I know there a pre-defined R commands for this, but I need 
to customize this
      arguments<-list(pattern=rem_punct,replacement="", x=y, ignore.case=TRUE) 
      y<-do.call(gsub, arguments)  
      
      arguments<-list(pattern=rem_linebreak,replacement=" ", x=y, 
ignore.case=TRUE) #Remove line breaks
      y<-do.call(gsub, arguments) 
     
      #Append Positive or Negative Emotion  
      arguments<-list(x=y)
      y<-do.call(appendEmotion, arguments)  
      
 
      # Output
      result<-list(
        textclean=y,
        first_ticker=first,
        all_ticker=signal_words,
        ticker_count=count)
      
      return(result)
    }
   
    resultList<-mclapply(dataframe$text_column,preprocessText)

** end main code **

(The return would be a list, which I plan to convert to a data.frame. Don’t get 
that far though).


Before, I also tried to call each `gsub` seperately, thus performing the first 
`gsub` on every text string, then the second `gsub` and so on.. but I guess 
that this was even less efficient.

The code itself works, but for me it seems that this can be speeded up. 
Unfortunately I'm not familiar with hash tables, which is what I heard could be 
a solution.

Appreciate your ideas and help very much!




*Definition of the one function called inside `preprocessText`*

    appendEmotion<-function(x){
      
      if (grepl(app_pos,x)){
        x<-paste(x," [pos_emotion] ")
      } 
      if(grepl(app_neg,x)){
        x<-paste(x," [neg_emotion] ")
      }  
      #Output
      return(x)
    }

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to