"Please, consider that some SKUs have "-" in the middle, for example: "PG-9021".
Then you need to include these in the list of patterns you gave. Try it again -- this time with a **complete** list. -- Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Aug 27, 2017 at 10:01 PM, Omar André Gonzáles Díaz < oma.gonza...@gmail.com> wrote: > Hi Bert, > > I would say that the delimitir is "blank", every other row with "-" as > delimiter should be ignore. Please, consider that some SKUs have "-" > in the middle, for example: "PG-9021". > > As for the <end of character string>, it's now corrected. There > shouldn't be any case of this (if there are, just ignore them). > > I've tried to apply different gsub operations to capture different > cases, for example: > > ecommerce$sku <- > gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > ecommerce$producto) > > > ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", > "\\2", ecommerce$sku) > > > ecommerce$sku <- > gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{1}[a-zA-Z]{1})(.*)", "\\2", > ecommerce$sku) > > ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2})(.*)", > "\\2", ecommerce$sku) > > > ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{3,4})(.*)", "\\2", > ecommerce$sku) > > > I don't know if that is the best approache, but I couldn't capture the > case in the initial question. And as I've said, the important thing is > to capture as many SKUs as possibe. > > Thank you for your time, Sir. > > > > > 2017-08-27 18:01 GMT-05:00 Bert Gunter <bgunter.4...@gmail.com>: > > Omar: > > > > I don't think this can work. For example number-letter patterns 4), > > 5), and 6) would all be matched by pattern 6). > > > > As Jeff indicated, you need to provide the delimiters -- what > > characters come before and after the SKU patterns -- to be able to > > recognize them. In a quick look at the text file you attached, the > > delimiters appeared to be either "-" or " " (blank) and perhaps <end > > of character string>. If that is correct or if you can tell us how to > > make it correct, then it's straightforward to proceed. Otherwise, I am > > unable to help. Maybe someone else can. > > > > Cheers, > > Bert > > > > > > > > > > > > > > On Sun, Aug 27, 2017 at 11:47 AM, Omar André Gonzáles Díaz > > <oma.gonza...@gmail.com> wrote: > >> Hi Jeff, Bert, thank you for your input. > >> > >> I'm attaching a sample of the data, feel free to explore it. > >> > >> As I said, I need to extract the SKUs of the products (a key that > >> identifies every product). Not every producto (row) has a SKU, in this > >> case "no SKU" should be the output. > >> > >> I've identify these patterns so far: > >> > >> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter. > >> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter. > >> 3.-MT48AF: 2 letters, 2 numbers, 2 letters. > >> 4.-LH5000: 2 letters, 4 numbers. > >> 5.-B8500: 1 letters, 4 numbers. > >> 6.-E310: 1 letter, 3 numbers. > >> 7.-X541UJ: 1 letter, 3 numbers, 2 letters. > >> > >> > >> I think those cover the mayority of skus. So I would appreciate a a > >> guidence on how to extract all those different patterns. > >> > >> Relate but not the question asked: The idea is that after extracting > >> the skus, there should be skus repeted accros the different ecommerce. > >> Those skus would permit us to compare the products and their prices. > >> > >> > >> Thank you in advance. > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4...@gmail.com>: > >>> You may have to provide us more detail on **exactly** the sorts of > >>> patterns you wish to "capture" -- including exactly what you mean by > >>> "capture" (what vaue do you wish to return?) -- as the "obvious" > >>> answer is probably not sufficient: > >>> > >>> ## using your example -- thankyou > >>> > >>>> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]]) > >>> [1] "49MU6300" "LE32S5970" > >>> > >>> > >>> Cheers, > >>> Bert > >>> Bert Gunter > >>> > >>> "The trouble with having an open mind is that people keep coming along > >>> and sticking things into it." > >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > >>> > >>> > >>> On Sun, Aug 27, 2017 at 9:18 AM, Omar André Gonzáles Díaz > >>> <oma.gonza...@gmail.com> wrote: > >>>> Hello, I need some help with regex. > >>>> > >>>> I have this to sentences. I need to extract both "49MU6300" and > "LE32S5970" > >>>> and put them in a new colum "SKU". > >>>> > >>>> A) SMART TV UHD 49'' CURVO 49MU6300 > >>>> B) SMART TV HD 32'' LE32S5970 > >>>> > >>>> DataFrame for testing: > >>>> > >>>> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD 49'' > CURVO > >>>> 49MU6300", > >>>> "SMART TV HD 32'' LE32S5970")) > >>>> > >>>> > >>>> I'm using gsub like this: > >>>> > >>>> 1.- This would capture A as intended but only "32S5970" from B > (missing > >>>> "LE"). > >>>> > >>>> ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", > "\\2", > >>>> ecommerce$producto) > >>>> > >>>> > >>>> 2.- This would capture "LE32S5970" but not "49MU6300". > >>>> > >>>> ecommerce$sku <- > >>>> gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > >>>> ecommerce$producto) > >>>> > >>>> > >>>> 3.- If I make the 2 first letter optional with: > >>>> > >>>> ecommerce$sku <- > >>>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > >>>> ecommerce$producto) > >>>> > >>>> > >>>> "49MU6300" is capture, but again only "32S5970" from B (missing "LE"). > >>>> > >>>> > >>>> What should I do? How would you approche it? > >>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> ______________________________________________ > >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.