Re: [R] speed issue: gsub on large data frame

2013-11-06 Thread Carl Witthoft
If you could, please identify which responder's idea you used, as well as the "strsplit" -- related code you ended up with. That may help someone who browses the mail archives in the future. Carl SPi wrote > I'll answer myself: > using strsplit with fixed=true took like 2minutes! -- View th

Re: [R] speed issue: gsub on large data frame

2013-11-06 Thread SPi
I'll answer myself: using strsplit with fixed=true took like 2minutes! -- View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679905.html Sent from the R help mailing list archive at Nabble.com. ___

Re: [R] speed issue: gsub on large data frame

2013-11-06 Thread SPi
Good idea! I'm trying your approach right now, but I am wondering if using str_split (package: 'stringr') or strsplit is the right way to go in terms of speed? I ran str_split over the text column of the data frame and it's processing for 2 hours now..? I did: splittedStrings<-str_split(datafr

Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Carl Witthoft
My feeling is that the **result** you want is far more easily achievable via a substitution table or a hash table. Someone better versed in those areas may want to chime in. I'm thinking more or less of splitting your character strings into vectors (separate elements at whitespace) and chunking a

Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Simon Pickert
Thanks everybody! Now I understand the need for more details: the patterns for the gsubs are of different kinds.First, I have character strings, I need to replace. Therefore, I have around 5000 stock ticker symbols (e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors. Second, I have four vec

Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Prof Brian Ripley
But note too what the help says: Performance considerations: If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and ‘fixed

Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Jim Holtman
what is missing is any idea of what the 'patterns' are that you are searching for. Regular expressions are very sensitive to how you specify the pattern. you indicated that you have up to 500 elements in the pattern, so what does it look like? alternation and backtracking can be very expensiv

Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Simon Pickert
How’s that not reproducible? 1. Data frame, one column with text strings 2. Size of data frame= 4million observations 3. A bunch of gsubs in a row ( gsub(patternvector, “[token]“,dataframe$text_column) ) 4. General question: How to speed up string operations on ‘large' data sets? Please let m

Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Jeff Newmiller
It is not reproducible [1] because I cannot run your (representative) example. The type of regex pattern, token, and even the character of the data you are searching can affect possible optimizations. Note that a non-memory-resident tool such as sed or perl may be an appropriate tool for a prob

Re: [R] speed issue: gsub on large data frame

2013-11-04 Thread Jeff Newmiller
Example not reproducible. Communication fail. Please refer to Posting Guide. --- Jeff NewmillerThe . . Go Live... DCN:Basics: ##.#. ##.#. Live Go...

[R] speed issue: gsub on large data frame

2013-11-04 Thread Simon Pickert
Hi R’lers, I’m running into speeding issues, performing a bunch of „gsub(patternvector, [token],dataframe$text_column)" on a data frame containing >4millionentries. (The “patternvectors“ contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands in a row