Does this do it for you (or get you closer): gsub("\\[.*\\]|[\\\\] |/ ","",tmp$Text) [1] "Я досяг того, чого хотів" [2] "Мені вдалося\nзробити бажане" [3] "Я досяг (досягла) того, чого хотів (хотіла)" [4] "Я\nдосяг(-ла) речей, яких хотілося досягти" [5] "Я досяг/ла того, чого\nхотів/ла" [6] "Я досяг\\досягла того, чого прагнув\\прагнула" [7] "Я\nдосягнув(ла) того, чого хотів(ла)"
On Tue, Jun 27, 2023 at 10:16 AM Chris Evans via R-help < r-help@r-project.org> wrote: > I am sure this is easy for people who are good at regexps but I'm > failing with it. The situation is that I have hundreds of lines of > Ukrainian translations of some English. They contain things like this: > > 1"Я досяг того, чого хотів"2"Мені вдалося зробити бажане"3"Я досяг > (досягла) того, чого хотів (хотіла)"4"Я досяг(-ла) речей, яких хотілося > досягти"5"Я досяг/ла того, чого хотів/ла"6"Я досяг\\досягла того, чого > прагнув\\прагнула."7"Я досягнув(ла) того, чого хотів(ла)" > > Using dput(): > > tmp <- structure(list(Text = c("Я досяг того, чого хотів", "Мені вдалося > зробити бажане", "Я досяг (досягла) того, чого хотів (хотіла)", "Я > досяг(-ла) речей, яких хотілося досягти", "Я досяг/ла того, чого > хотів/ла", "Я досяг\\досягла того, чого прагнув\\прагнула", "Я > досягнув(ла) того, чого хотів(ла)" )), row.names = c(NA, -7L), class = > c("tbl_df", "tbl", "data.frame" )) Those show four different ways > translators have handled gendered words: 1) Ignore them and (I'm > guessing) only give the masculine 2) Give the feminine form of the word > (or just the feminine suffix) in brackets 3) Give the feminine > form/suffix prefixed by a forward slash 4) Give the feminine form/suffix > prefixed by backslash (here a double backslash) I would like just to > drop all these feminine gendered options. (Don't worry, they'll get back > in later.) So I would like to replace 1) anything between brackets with > nothing! 2) anything between a forward slash and the next space with > nothing 3) anything between a backslash and the next space with nothing > but preserving the rest of the text. I have been trying to achieve this > using str_replace_all() but I am failing utterly. Here's a silly little > example of my failures. This was just trying to get the text I wanted to > replace (as I was trying to simplify the issues for my tired wetware): > > tmp %>%+ as_tibble() %>% + rename(Text = value) %>% + mutate(Text = > str_replace_all(Text, fixed("."), "")) %>% + filter(row_number() < 4) > %>% + mutate(Text2 = str_replace(Text, "\\(.*\\)", "\\1")) Errorin > `mutate()`:ℹIn argument: `Text2 = str_replace(Text, "\\(.*\\)", > "\\1")`.Caused by error in `stri_replace_first_regex()`:!Trying to > access the index that is out of bounds. (U_INDEX_OUTOFBOUNDS_ERROR) Run > `rlang::last_trace()` to see where the error occurred. I have tried > gurgling around the internet but am striking out so throwing myself on > the list. Apologies if this is trivial but I'd hate to have to clean > these hundreds of lines by hand though it's starting to look as if I'd > achieve that faster by hand than I will by banging my ignorance of R > regexp syntax on the problem. TIA, Chris > > -- > Chris Evans (he/him) > Visiting Professor, UDLA, Quito, Ecuador & Honorary Professor, > University of Roehampton, London, UK. > Work web site: https://www.psyctc.org/psyctc/ > CORE site: http://www.coresystemtrust.org.uk/ > Personal site: https://www.psyctc.org/pelerinage2016/ > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.