If you could, please identify which responder's idea you used, as well as the
"strsplit" -- related code you ended up with.
That may help someone who browses the mail archives in the future.
Carl
SPi wrote
> I'll answer myself:
> using strsplit with fixed=true took like 2minutes!
--
View th
I'll answer myself:
using strsplit with fixed=true took like 2minutes!
--
View this message in context:
http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679905.html
Sent from the R help mailing list archive at Nabble.com.
___
Good idea!
I'm trying your approach right now, but I am wondering if using str_split
(package: 'stringr') or strsplit is the right way to go in terms of speed? I
ran str_split over the text column of the data frame and it's processing for
2 hours now..?
I did:
splittedStrings<-str_split(datafr
My feeling is that the **result** you want is far more easily achievable via
a substitution table or a hash table. Someone better versed in those areas
may want to chime in. I'm thinking more or less of splitting your character
strings into vectors (separate elements at whitespace) and chunking a
Thanks everybody! Now I understand the need for more details:
the patterns for the gsubs are of different kinds.First, I have character
strings, I need to replace. Therefore, I have around 5000 stock ticker symbols
(e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors.
Second, I have four vec
But note too what the help says:
Performance considerations:
If you are doing a lot of regular expression matching, including
on very long strings, you will want to consider the options used.
Generally PCRE will be faster than the default regular expression
engine, and ‘fixed
what is missing is any idea of what the 'patterns' are that you are searching
for. Regular expressions are very sensitive to how you specify the pattern.
you indicated that you have up to 500 elements in the pattern, so what does it
look like? alternation and backtracking can be very expensiv
How’s that not reproducible?
1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row ( gsub(patternvector,
“[token]“,dataframe$text_column) )
4. General question: How to speed up string operations on ‘large' data sets?
Please let m
It is not reproducible [1] because I cannot run your (representative) example.
The type of regex pattern, token, and even the character of the data you are
searching can affect possible optimizations. Note that a non-memory-resident
tool such as sed or perl may be an appropriate tool for a prob
Example not reproducible. Communication fail. Please refer to Posting Guide.
---
Jeff NewmillerThe . . Go Live...
DCN:Basics: ##.#. ##.#. Live Go...
Hi R’lers,
I’m running into speeding issues, performing a bunch of
„gsub(patternvector, [token],dataframe$text_column)"
on a data frame containing >4millionentries.
(The “patternvectors“ contain up to 500 elements)
Is there any better/faster way than performing like 20 gsub commands in a row
11 matches
Mail list logo