For context: > In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and > \\2. The expression says: > Substitute ALL of the match with the first captured expression, then " <", > then the second captured expression, then "> ". The rest of the line is >not > substituted and appears as-is.
Back to me: I guess what's giving me trouble is where to draw the line in terms of the end or edge of the expression. Given the code, then, > a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8") > b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > c <- gsub(b, "\\1<\\2> ", a) to me, it would seem as though this is the first captured expression, that is, as though \\1 refers back to ^([0-9-]{10} [0-9:]{8} ), since there are parenthesis around it, or since [0-9-]{10} [0-9:]{8} is enclosed in parentheses. Then it would seem as though [*]{3} is the second expression, and (\\w+ \\w+) is the third. According to this (admittedly wrong) logic, it would seem as though the <> would go around the date--like > 2016-03-20 <19:29:37> *** Jane Doe started a video chat The back references here recalls Davis's code earlier: > sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) There, commas were put around everything, and there you can see the edge of the expression very well. ^(.{10}) = first. (.{8}) = second. (<.+>) = third. (.+$) = fourth. So, by the same logic, it would seem as though in > b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" that ^([0-9-]{10} [0-9:]{8} ) is first, that [*]{3} is second, and that (\\w+ \\w+) is third. But, if Boris is to be right, and he is, obviously, then it would have to be the case that this entire thing, namely, ^([0-9-]{10} [0-9:]{8} )[*]{3}, is the first expression, since only if that were true would the <> be able to go around the names, as in [3] "2016-01-27 09:15:20 <Jane Doe> Hey " Again, so 2016-01-27 09:15:20 would have to be an entire unit, an expression. So I guess what I don't understand is how ^([0-9-]{10} [0-9:]{8} )[*]{3} can be an entire expression, although my hunch would be that it has something to do with the ^ or with the space after the } and before the (, as in > {3} (\\w+ Back to earlier: > The rest of the line is not substituted and appears as-is. Is that due to the space after the \\2? in > "\\1<\\2> " Notice space after > and before " Michael On Sun, May 19, 2019 at 2:31 PM Boris Steipe <boris.ste...@utoronto.ca> wrote: > > Inline ... > > > On 2019-05-19, at 13:56, Michael Boulineau <michael.p.boulin...@gmail.com> > > wrote: > > > >> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > > > > so the ^ signals that the regex BEGINS with a number (that could be > > any number, 0-9) that is only 10 characters long (then there's the > > dash in there, too, with the 0-9-, which I assume enabled the regex to > > grab the - that's between the numbers in the date) > > That's right. Note that within a "character class" the hyphen can have tow > meanings: normally it defines a range of characters, but if it appears as the > last character before "]" it is a literal hyphen. > > > , followed by a > > single space, followed by a unit that could be any number, again, but > > that is only 8 characters long this time. For that one, it will > > include the colon, hence the 9:, although for that one ([0-9:]{8} ), > > Right. > > > > I > > don't get why the space is on the inside in that one, after the {8}, > > The space needs to be preserved between the time and the name. I wrote > b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" # space in the first > captured expression > c <- gsub(b, "\\1<\\2> ", a) > ... but I could have written > b <- "^([0-9-]{10} [0-9:]{8})[*]{3} (\\w+ \\w+)" > c <- gsub(b, "\\1 <\\2> ", a) # space in the substituted string > ... same result > > > > whereas the space is on the outside with the other one ^([0-9-]{10} , > > directly after the {10}. Why is that? > > In the second case, I capture without a space, because I don't want the space > in the results, after the time. > > > > > > Then three *** [*]{3}, then the (\\w+ \\w+)", which Boris explained so > > well above. I guess I still don't get why this one seemed to have > > deleted the *** out of the mix, plus I still don't why it didn't > > remove the *** from the first one. > > Because the entire first line was not matched since it had a malformed > character preceding the date. > > > > > 2016-03-20 19:29:37 *** Jane Doe started a video chat > > 2016-03-20 19:30:35 *** John Doe ended a video chat > > 2016-04-02 12:59:36 *** Jane Doe started a video chat > > 2016-04-02 13:00:43 *** John Doe ended a video chat > > 2016-04-02 13:01:08 *** Jane Doe started a video chat > > 2016-04-02 13:01:41 *** John Doe ended a video chat > > 2016-04-02 13:03:51 *** John Doe started a video chat > > 2016-04-02 13:06:35 *** John Doe ended a video chat > > > > This is a random sample from the beginning of the txt file with no > > edits. The ***s were deleted, all but the first one, the one that had > > the  but that was taken out by the encoding = "UTF-8". I know that > > the function was c <- gsub(b, "\\1<\\2> ", a), so it had a gsub () on > > there, the point of which is to do substitution work. > > > > Oh, I get it, I think. The \\1<\\2> in the gsub () puts the <> around > > the names, so that it's consistent with the rest of the data, so that > > the names in the text about that aren't enclosed in the <> are > > enclosed like the rest of them. But I still don't get why or how the > > gsub () replaced the *** with the <>... > > In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and > \\2. The expression says: > Substitute ALL of the match with the first captured expression, then " <", > then the second captured expression, then "> ". The rest of the line is not > substituted and appears as-is. > > > > > > This one is more straightforward. > > > >> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" > > > > any number with - for 10 characters, followed by a space. Oh, there's > > no space in this one ([0-9:]{8}), after the {8}. Hu. So, then, any > > number with : for 8 characters, followed by any two words separated by > > a space and enclosed in <>. And then the \\s* is followed by a single > > space? Or maybe it puts space on both sides (on the side of the #s to > > the left, and then the comment to the right). The (.+)$ is anything > > whatsoever until the end. > > \s is the metacharacter for "whitespace". \s* means zero or more whitespace. > I'm matching that OUTSIDE of the captured expression, to removes any leading > spaces from the data that goes into the data frame. > > > Cheers, > Boris > > > > > > > > Michael > > > > > > On Sun, May 19, 2019 at 4:37 AM Boris Steipe <boris.ste...@utoronto.ca> > > wrote: > >> > >> Inline > >> > >> > >> > >>> On 2019-05-18, at 20:34, Michael Boulineau > >>> <michael.p.boulin...@gmail.com> wrote: > >>> > >>> It appears to have worked, although there were three little quirks. > >>> The ; close(con); rm(con) didn't work for me; the first row of the > >>> data.frame was all NAs, when all was said and done; > >> > >> You will get NAs for lines that can't be matched to the regular > >> expression. That's a good thing, it allows you to test whether your > >> assumptions were valid for the entire file: > >> > >> # number of failed strcapture() > >> sum(is.na(e$date)) > >> > >> > >>> and then there > >>> were still three *** on the same line where the  was apparently > >>> deleted. > >> > >> This is a sign that something else happened with the line that prevented > >> the regex from matching. In that case you need to investigate more. I see > >> an invalid multibyte character at the beginning of the line you posted > >> below. > >> > >>> > >>>> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8") > >>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > >>>> c <- gsub(b, "\\1<\\2> ", a) > >>>> head (c) > >>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > >>> [2] "2016-01-27 09:15:20 <Jane Doe> > >>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" > >> > >> [...] > >> > >>> But, before I do anything else, I'm going to study the regex in this > >>> particular code. For example, I'm still not sure why there has to the > >>> second \\w+ in the (\\w+ \\w+). Little things like that. > >> > >> \w is the metacharacter for alphanumeric characters, \w+ designates > >> something we could call a word. Thus \w+ \w+ are two words separated by a > >> single blank. This corresponds to your example, but, as I wrote > >> previously, you need to think very carefully whether this covers all > >> possible cases (Could there be only one word? More than one blank? Could > >> letters be separated by hyphens or periods?) In most cases we could have > >> more robustly matched everything between "<" and ">" (taking care to test > >> what happens if the message contains those characters). But for the video > >> chat lines we need to make an assumption about what is name and what is > >> not. If "started a video chat" is the only possibility in such lines, you > >> can use this information instead. If there are other possibilities, you > >> need a different strategy. In NLP there is no one-approach-fits-all. > >> > >> To validate the structure of the names in your transcripts, you can look at > >> > >> patt <- " <.+?> " # " <any string, not greedy> " > >> m <- regexpr(patt, c) > >> unique(regmatches(c, m)) > >> > >> > >> > >> B. > >> > >> > >> > >>> > >>> Michael > >>> > >>> > >>> On Sat, May 18, 2019 at 4:30 PM Boris Steipe <boris.ste...@utoronto.ca> > >>> wrote: > >>>> > >>>> This works for me: > >>>> > >>>> # sample data > >>>> c <- character() > >>>> c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat" > >>>> c[2] <- "2016-01-27 09:15:20 <Jane Doe> > >>>> https://lh3.googleusercontent.com/" > >>>> c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey " > >>>> c[4] <- "2016-01-27 09:15:22 <John Doe> ended a video chat" > >>>> c[5] <- "2016-01-27 21:07:11 <Jane Doe> started a video chat" > >>>> c[6] <- "2016-01-27 21:26:57 <John Doe> ended a video chat" > >>>> > >>>> > >>>> # regex ^(year) (time) <(word word)>\\s*(string)$ > >>>> patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" > >>>> proto <- data.frame(date = character(), > >>>> time = character(), > >>>> name = character(), > >>>> text = character(), > >>>> stringsAsFactors = TRUE) > >>>> d <- strcapture(patt, c, proto) > >>>> > >>>> > >>>> > >>>> date time name text > >>>> 1 2016-01-27 09:14:40 Jane Doe started a video chat > >>>> 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/ > >>>> 3 2016-01-27 09:15:20 Jane Doe Hey > >>>> 4 2016-01-27 09:15:22 John Doe ended a video chat > >>>> 5 2016-01-27 21:07:11 Jane Doe started a video chat > >>>> 6 2016-01-27 21:26:57 John Doe ended a video chat > >>>> > >>>> > >>>> > >>>> B. > >>>> > >>>> > >>>>> On 2019-05-18, at 18:32, Michael Boulineau > >>>>> <michael.p.boulin...@gmail.com> wrote: > >>>>> > >>>>> Going back and thinking through what Boris and William were saying > >>>>> (also Ivan), I tried this: > >>>>> > >>>>> a <- readLines ("hangouts-conversation-6.csv.txt") > >>>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > >>>>> c <- gsub(b, "\\1<\\2> ", a) > >>>>>> head (c) > >>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > >>>>> [2] "2016-01-27 09:15:20 <Jane Doe> > >>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" > >>>>> [3] "2016-01-27 09:15:20 <Jane Doe> Hey " > >>>>> [4] "2016-01-27 09:15:22 <John Doe> ended a video chat" > >>>>> [5] "2016-01-27 21:07:11 <Jane Doe> started a video chat" > >>>>> [6] "2016-01-27 21:26:57 <John Doe> ended a video chat" > >>>>> > >>>>> The  is still there, since I forgot to do what Ivan had suggested, > >>>>> namely, > >>>>> > >>>>> a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding > >>>>> = "UTF-8")); close(con); rm(con) > >>>>> > >>>>> But then the new code is still turning out only NAs when I apply > >>>>> strcapture (). This was what happened next: > >>>>> > >>>>>> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>>> + c, proto=data.frame(stringsAsFactors=FALSE, When="", > >>>>> Who="", > >>>>> + What="")) > >>>>>> head (d) > >>>>> When Who What > >>>>> 1 <NA> <NA> <NA> > >>>>> 2 <NA> <NA> <NA> > >>>>> 3 <NA> <NA> <NA> > >>>>> 4 <NA> <NA> <NA> > >>>>> 5 <NA> <NA> <NA> > >>>>> 6 <NA> <NA> <NA> > >>>>> > >>>>> I've been reading up on regular expressions, too, so this code seems > >>>>> spot on. What's going wrong? > >>>>> > >>>>> Michael > >>>>> > >>>>> On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> > >>>>> wrote: > >>>>>> > >>>>>> Don't start putting in extra commas and then reading this as csv. That > >>>>>> approach is broken. The correct approach is what Bill outlined: read > >>>>>> everything with readLines(), and then use a proper regular expression > >>>>>> with strcapture(). > >>>>>> > >>>>>> You need to pre-process the object that readLines() gives you: replace > >>>>>> the contents of the videochat lines, and make it conform to the format > >>>>>> of the other lines before you process it into your data frame. > >>>>>> > >>>>>> Approximately something like > >>>>>> > >>>>>> # read the raw data > >>>>>> tmp <- readLines("hangouts-conversation-6.csv.txt") > >>>>>> > >>>>>> # process all video chat lines > >>>>>> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) " # (year time > >>>>>> )*** (word word) > >>>>>> tmp <- gsub(patt, "\\1<\\2> ", tmp) > >>>>>> > >>>>>> # next, use strcapture() > >>>>>> > >>>>>> Note that this makes the assumption that your names are always exactly > >>>>>> two words containing only letters. If that assumption is not true, > >>>>>> more though needs to go into the regex. But you can test that: > >>>>>> > >>>>>> patt <- " <\\w+ \\w+> " #" <word word> " > >>>>>> sum( ! grepl(patt, tmp))) > >>>>>> > >>>>>> ... will give the number of lines that remain in your file that do not > >>>>>> have a tag that can be interpreted as "Who" > >>>>>> > >>>>>> Once that is fine, use Bill's approach - or a regular expression of > >>>>>> your own design - to create your data frame. > >>>>>> > >>>>>> Hope this helps, > >>>>>> Boris > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On 2019-05-17, at 16:18, Michael Boulineau > >>>>>>> <michael.p.boulin...@gmail.com> wrote: > >>>>>>> > >>>>>>> Very interesting. I'm sure I'll be trying to get rid of the byte order > >>>>>>> mark eventually. But right now, I'm more worried about getting the > >>>>>>> character vector into either a csv file or data.frame; that way, I can > >>>>>>> be able to work with the data neatly tabulated into four columns: > >>>>>>> date, time, person, comment. I assume it's a write.csv function, but I > >>>>>>> don't know what arguments to put in it. header=FALSE? fill=T? > >>>>>>> > >>>>>>> Micheal > >>>>>>> > >>>>>>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller > >>>>>>> <jdnew...@dcn.davis.ca.us> wrote: > >>>>>>>> > >>>>>>>> If byte order mark is the issue then you can specify the file > >>>>>>>> encoding as "UTF-8-BOM" and it won't show up in your data any more. > >>>>>>>> > >>>>>>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help > >>>>>>>> <r-help@r-project.org> wrote: > >>>>>>>>> The pattern I gave worked for the lines that you originally showed > >>>>>>>>> from > >>>>>>>>> the > >>>>>>>>> data file ('a'), before you put commas into them. If the name is > >>>>>>>>> either of > >>>>>>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed > >>>>>>>>> so > >>>>>>>>> something like "(<[^>]*>|[*]{3})". > >>>>>>>>> > >>>>>>>>> The " " at the start of the imported data may come from the byte > >>>>>>>>> order > >>>>>>>>> mark that Windows apps like to put at the front of a text file in > >>>>>>>>> UTF-8 > >>>>>>>>> or > >>>>>>>>> UTF-16 format. > >>>>>>>>> > >>>>>>>>> Bill Dunlap > >>>>>>>>> TIBCO Software > >>>>>>>>> wdunlap tibco.com > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau < > >>>>>>>>> michael.p.boulin...@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>>> This seemed to work: > >>>>>>>>>> > >>>>>>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt") > >>>>>>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) > >>>>>>>>>>> b [1:84] > >>>>>>>>>> > >>>>>>>>>> And the first 85 lines looks like this: > >>>>>>>>>> > >>>>>>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" > >>>>>>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>>>>>> > >>>>>>>>>> Then they transition to the commas: > >>>>>>>>>> > >>>>>>>>>>> b [84:100] > >>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey" > >>>>>>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" > >>>>>>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" > >>>>>>>>>> > >>>>>>>>>> Even the strange bit on line 6347 was caught by this: > >>>>>>>>>> > >>>>>>>>>>> b [6346:6348] > >>>>>>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe" > >>>>>>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" > >>>>>>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" > >>>>>>>>>> > >>>>>>>>>> Perhaps most awesomely, the code catches spaces that are interposed > >>>>>>>>>> into the comment itself: > >>>>>>>>>> > >>>>>>>>>>> b [4] > >>>>>>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey " > >>>>>>>>>>> b [85] > >>>>>>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey" > >>>>>>>>>> > >>>>>>>>>> Notice whether there is a space after the "hey" or not. > >>>>>>>>>> > >>>>>>>>>> These are the first two lines: > >>>>>>>>>> > >>>>>>>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > >>>>>>>>>> [2] "2016-01-27,09:15:20,<Jane > >>>>>>>>>> Doe>, > >>>>>>>>>> > >>>>>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf > >>>>>>>>>> " > >>>>>>>>>> > >>>>>>>>>> So, who knows what happened with the  at the beginning of [1] > >>>>>>>>>> directly above. But notice how there are no commas in [1] but there > >>>>>>>>>> appear in [2]. I don't see why really long ones like [2] directly > >>>>>>>>>> above would be a problem, were they to be translated into a csv or > >>>>>>>>>> data frame column. > >>>>>>>>>> > >>>>>>>>>> Now, with the commas in there, couldn't we write this into a csv > >>>>>>>>>> or a > >>>>>>>>>> data.frame? Some of this data will end up being garbage, I imagine. > >>>>>>>>>> Like in [2] directly above. Or with [83] and [84] at the top of > >>>>>>>>>> this > >>>>>>>>>> discussion post/email. Embarrassingly, I've been trying to convert > >>>>>>>>>> this into a data.frame or csv but I can't manage to. I've been > >>>>>>>>>> using > >>>>>>>>>> the write.csv function, but I don't think I've been getting the > >>>>>>>>>> arguments correct. > >>>>>>>>>> > >>>>>>>>>> At the end of the day, I would like a data.frame and/or csv with > >>>>>>>>>> the > >>>>>>>>>> following four columns: date, time, person, comment. > >>>>>>>>>> > >>>>>>>>>> I tried this, too: > >>>>>>>>>> > >>>>>>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>>>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>>>>>>>> + a, proto=data.frame(stringsAsFactors=FALSE, > >>>>>>>>> When="", > >>>>>>>>>> Who="", > >>>>>>>>>> + What="")) > >>>>>>>>>> > >>>>>>>>>> But all I got was this: > >>>>>>>>>> > >>>>>>>>>>> c [1:100, ] > >>>>>>>>>> When Who What > >>>>>>>>>> 1 <NA> <NA> <NA> > >>>>>>>>>> 2 <NA> <NA> <NA> > >>>>>>>>>> 3 <NA> <NA> <NA> > >>>>>>>>>> 4 <NA> <NA> <NA> > >>>>>>>>>> 5 <NA> <NA> <NA> > >>>>>>>>>> 6 <NA> <NA> <NA> > >>>>>>>>>> > >>>>>>>>>> It seems to have caught nothing. > >>>>>>>>>> > >>>>>>>>>>> unique (c) > >>>>>>>>>> When Who What > >>>>>>>>>> 1 <NA> <NA> <NA> > >>>>>>>>>> > >>>>>>>>>> But I like that it converted into columns. That's a really great > >>>>>>>>>> format. With a little tweaking, it'd be a great code for this data > >>>>>>>>>> set. > >>>>>>>>>> > >>>>>>>>>> Michael > >>>>>>>>>> > >>>>>>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help > >>>>>>>>>> <r-help@r-project.org> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Consider using readLines() and strcapture() for reading such a > >>>>>>>>> file. > >>>>>>>>>> E.g., > >>>>>>>>>>> suppose readLines(files) produced a character vector like > >>>>>>>>>>> > >>>>>>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", > >>>>>>>>>>> "2016-10-21 10:56:29 <John Doe> John_Doe", > >>>>>>>>>>> "2016-10-21 10:56:37 <John Doe> Admit#8242", > >>>>>>>>>>> "October 23, 1819 12:34 <Jane Eyre> I am not an angel") > >>>>>>>>>>> > >>>>>>>>>>> Then you can make a data.frame with columns When, Who, and What by > >>>>>>>>>>> supplying a pattern containing three parenthesized capture > >>>>>>>>> expressions: > >>>>>>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>>>>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>>>>>>>>> x, proto=data.frame(stringsAsFactors=FALSE, When="", > >>>>>>>>> Who="", > >>>>>>>>>>> What="")) > >>>>>>>>>>>> str(z) > >>>>>>>>>>> 'data.frame': 4 obs. of 3 variables: > >>>>>>>>>>> $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" > >>>>>>>>> "2016-10-21 > >>>>>>>>>>> 10:56:37" NA > >>>>>>>>>>> $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA > >>>>>>>>>>> $ What: chr "What's your login" "John_Doe" "Admit#8242" NA > >>>>>>>>>>> > >>>>>>>>>>> Lines that don't match the pattern result in NA's - you might make > >>>>>>>>> a > >>>>>>>>>> second > >>>>>>>>>>> pass over the corresponding elements of x with a new pattern. > >>>>>>>>>>> > >>>>>>>>>>> You can convert the When column from character to time with > >>>>>>>>> as.POSIXct(). > >>>>>>>>>>> > >>>>>>>>>>> Bill Dunlap > >>>>>>>>>>> TIBCO Software > >>>>>>>>>>> wdunlap tibco.com > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius > >>>>>>>>> <dwinsem...@comcast.net> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote: > >>>>>>>>>>>>> OK. So, I named the object test and then checked the 6347th > >>>>>>>>> item > >>>>>>>>>>>>> > >>>>>>>>>>>>>> test <- readLines ("hangouts-conversation.txt) > >>>>>>>>>>>>>> test [6347] > >>>>>>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" > >>>>>>>>>>>>> > >>>>>>>>>>>>> Perhaps where it was getting screwed up is, since the end of > >>>>>>>>> this is > >>>>>>>>>> a > >>>>>>>>>>>>> number (8242), then, given that there's no space between the > >>>>>>>>> number > >>>>>>>>>>>>> and what ought to be the next row, R didn't know where to draw > >>>>>>>>> the > >>>>>>>>>>>>> line. Sure enough, it looks like this when I go to the original > >>>>>>>>> file > >>>>>>>>>>>>> and control f "#8242" > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login > >>>>>>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe > >>>>>>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242 > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> An octothorpe is an end of line signifier and is interpreted as > >>>>>>>>>> allowing > >>>>>>>>>>>> comments. You can prevent that interpretation with suitable > >>>>>>>>> choice of > >>>>>>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why > >>>>>>>>> that > >>>>>>>>>>>> should cause anu error or a failure to match that pattern. > >>>>>>>>>>>> > >>>>>>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion > >>>>>>>>>>>>> > >>>>>>>>>>>>> Again, it doesn't look like that in the file. Gmail > >>>>>>>>> automatically > >>>>>>>>>>>>> formats it like that when I paste it in. More to the point, it > >>>>>>>>> looks > >>>>>>>>>>>>> like > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 > >>>>>>>>> 10:56:29 > >>>>>>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> > >>>>>>>>>> Admit#82422016-10-21 > >>>>>>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion > >>>>>>>>>>>>> > >>>>>>>>>>>>> Notice Admit#82422016. So there's that. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Then I built object test2. > >>>>>>>>>>>>> > >>>>>>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", > >>>>>>>>> test) > >>>>>>>>>>>>> > >>>>>>>>>>>>> This worked for 84 lines, then this happened. > >>>>>>>>>>>> > >>>>>>>>>>>> It may have done something but as you later discovered my first > >>>>>>>>> code > >>>>>>>>>> for > >>>>>>>>>>>> the pattern was incorrect. I had tested it (and pasted in the > >>>>>>>>> results > >>>>>>>>>> of > >>>>>>>>>>>> the test) . The way to refer to a capture class is with > >>>>>>>>> back-slashes > >>>>>>>>>>>> before the numbers, not forward-slashes. Try this: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", > >>>>>>>>> "\\1,\\2,\\3,\\4", > >>>>>>>>>> chrvec) > >>>>>>>>>>>>> newvec > >>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" > >>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > >>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, > >>>>>>>>> not > >>>>>>>>>> really" > >>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, > >>>>>>>>> didn't > >>>>>>>>>> sleep" > >>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or > >>>>>>>>> where I am > >>>>>>>>>>>> really" > >>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good > >>>>>>>>> eay" > >>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>" > >>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>" > >>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little > >>>>>>>>> more > >>>>>>>>>>>> rigorous..." > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I made note of the fact that the 10th and 11th lines had no > >>>>>>>>> commas. > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> test2 [84] > >>>>>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>>>>>>>> > >>>>>>>>>>>> That line didn't have any "<" so wasn't matched. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> You could remove all none matching lines for pattern of > >>>>>>>>>>>> > >>>>>>>>>>>> dates<space>times<space>"<"<name>">"<space><anything> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> with: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Do read: > >>>>>>>>>>>> > >>>>>>>>>>>> ?read.csv > >>>>>>>>>>>> > >>>>>>>>>>>> ?regex > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> > >>>>>>>>>>>> David > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> test2 [85] > >>>>>>>>>>>>> [1] "//1,//2,//3,//4" > >>>>>>>>>>>>>> test [85] > >>>>>>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey" > >>>>>>>>>>>>> > >>>>>>>>>>>>> Notice how I toggled back and forth between test and test2 > >>>>>>>>> there. So, > >>>>>>>>>>>>> whatever happened with the regex, it happened in the switch > >>>>>>>>> from 84 > >>>>>>>>>> to > >>>>>>>>>>>>> 85, I guess. It went on like > >>>>>>>>>>>>> > >>>>>>>>>>>>> [990] "//1,//2,//3,//4" > >>>>>>>>>>>>> [991] "//1,//2,//3,//4" > >>>>>>>>>>>>> [992] "//1,//2,//3,//4" > >>>>>>>>>>>>> [993] "//1,//2,//3,//4" > >>>>>>>>>>>>> [994] "//1,//2,//3,//4" > >>>>>>>>>>>>> [995] "//1,//2,//3,//4" > >>>>>>>>>>>>> [996] "//1,//2,//3,//4" > >>>>>>>>>>>>> [997] "//1,//2,//3,//4" > >>>>>>>>>>>>> [998] "//1,//2,//3,//4" > >>>>>>>>>>>>> [999] "//1,//2,//3,//4" > >>>>>>>>>>>>> [1000] "//1,//2,//3,//4" > >>>>>>>>>>>>> > >>>>>>>>>>>>> up until line 1000, then I reached max.print. > >>>>>>>>>>>> > >>>>>>>>>>>>> Michael > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius < > >>>>>>>>>> dwinsem...@comcast.net> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote: > >>>>>>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and > >>>>>>>>> not do > >>>>>>>>>>>> that again. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I tried the read.fwf from the foreign package, with a code > >>>>>>>>> like > >>>>>>>>>> this: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> d <- read.fwf("hangouts-conversation.txt", > >>>>>>>>>>>>>>> widths= c(10,10,20,40), > >>>>>>>>>>>>>>> > >>>>>>>>> col.names=c("date","time","person","comment"), > >>>>>>>>>>>>>>> strip.white=TRUE) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> But it threw this error: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote = > >>>>>>>>> quote, > >>>>>>>>>> dec > >>>>>>>>>>>> = dec, : > >>>>>>>>>>>>>>> line 6347 did not have 4 elements > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print > >>>>>>>>> it > >>>>>>>>>> out.) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Interestingly, though, the error only happened when I > >>>>>>>>> increased the > >>>>>>>>>>>>>>> width size. But I had to increase the size, or else I > >>>>>>>>> couldn't > >>>>>>>>>> "see" > >>>>>>>>>>>>>>> anything. The comment was so small that nothing was being > >>>>>>>>>> captured by > >>>>>>>>>>>>>>> the size of the column. so to speak. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It seems like what's throwing me is that there's no comma > >>>>>>>>> that > >>>>>>>>>>>>>>> demarcates the end of the text proper. For example: > >>>>>>>>>>>>>> Not sure why you thought there should be a comma. Lines > >>>>>>>>> usually end > >>>>>>>>>>>>>> with <cr> and or a <lf>. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Once you have the raw text in a character vector from > >>>>>>>>> `readLines` > >>>>>>>>>> named, > >>>>>>>>>>>>>> say, 'chrvec', then you could selectively substitute commas > >>>>>>>>> for > >>>>>>>>>> spaces > >>>>>>>>>>>>>> with regex. (Now that you no longer desire to remove the dates > >>>>>>>>> and > >>>>>>>>>>>> times.) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> This will not do any replacements when the pattern is not > >>>>>>>>> matched. > >>>>>>>>>> See > >>>>>>>>>>>>>> this test: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", > >>>>>>>>> "\\1,\\2,\\3,\\4", > >>>>>>>>>>>> chrvec) > >>>>>>>>>>>>>>> newvec > >>>>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" > >>>>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to > >>>>>>>>> Edinburgh" > >>>>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >>>>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has > >>>>>>>>> happened, not > >>>>>>>>>>>> really" > >>>>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, > >>>>>>>>> didn't > >>>>>>>>>>>> sleep" > >>>>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or > >>>>>>>>> where > >>>>>>>>>> I am > >>>>>>>>>>>>>> really" > >>>>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >>>>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >>>>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a > >>>>>>>>> good > >>>>>>>>>> eay" > >>>>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>" > >>>>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>" > >>>>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little > >>>>>>>>> more > >>>>>>>>>>>>>> rigorous..." > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> You should probably remove the "empty comment" lines. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> David. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a > >>>>>>>>>> starbucks2016-07-01 > >>>>>>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 > >>>>>>>>> <Jane > >>>>>>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> > >>>>>>>>> There was > >>>>>>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It was interesting, too, when I pasted the text into the > >>>>>>>>> email, it > >>>>>>>>>>>>>>> self-formatted into the way I wanted it to look. I had to > >>>>>>>>> manually > >>>>>>>>>>>>>>> make it look like it does above, since that's the way that it > >>>>>>>>>> looks in > >>>>>>>>>>>>>>> the txt file. I wonder if it's being organized by XML or > >>>>>>>>> something. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Anyways, There's always a space between the two sideways > >>>>>>>>> carrots, > >>>>>>>>>> just > >>>>>>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's > >>>>>>>>> always > >>>>>>>>>> a > >>>>>>>>>>>>>>> space between the data and time. Like this. 2016-07-01 > >>>>>>>>> 15:34:30 > >>>>>>>>>> See. > >>>>>>>>>>>>>>> Space. But there's never a space between the end of the > >>>>>>>>> comment and > >>>>>>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01 > >>>>>>>>> 15:35:02 > >>>>>>>>>>>>>>> See. starbucks and 2016 are smooshed together. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This code is also on the table right now too. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> a <- read.table("E:/working > >>>>>>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"", > >>>>>>>>>>>>>>> comment.char="", fill=TRUE) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h) > >>>>>>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Those last lines are a work in progress. I wish I could > >>>>>>>>> import a > >>>>>>>>>>>>>>> picture of what it looks like when it's translated into a > >>>>>>>>> data > >>>>>>>>>> frame. > >>>>>>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of > >>>>>>>>> sort of > >>>>>>>>>>>>>>> works, but the comments keep bleeding into the data and time > >>>>>>>>>> column. > >>>>>>>>>>>>>>> It's like > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > >>>>>>>>>>>>>>> over there > >>>>>>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to > >>>>>>>>>> itself, as > >>>>>>>>>>>>>>> will be the "I've'"and the "never" etc. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I will use a regular expression if I have to, but it would be > >>>>>>>>> nice > >>>>>>>>>> to > >>>>>>>>>>>>>>> keep the dates and times on there. Originally, I thought they > >>>>>>>>> were > >>>>>>>>>>>>>>> meaningless, but I've since changed my mind on that count. > >>>>>>>>> The > >>>>>>>>>> time of > >>>>>>>>>>>>>>> day isn't so important. But, especially since, say, Gmail > >>>>>>>>> itself > >>>>>>>>>> knows > >>>>>>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I > >>>>>>>>> know > >>>>>>>>>>>>>>> this data has structure to it. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Michael > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < > >>>>>>>>>>>> dwinsem...@comcast.net> wrote: > >>>>>>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: > >>>>>>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks > >>>>>>>>> like > >>>>>>>>>> this: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey > >>>>>>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > >>>>>>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo > >>>>>>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not > >>>>>>>>>> really > >>>>>>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, > >>>>>>>>> didn't > >>>>>>>>>> sleep > >>>>>>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where > >>>>>>>>> I am > >>>>>>>>>>>> really > >>>>>>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london > >>>>>>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep > >>>>>>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good > >>>>>>>>> eay > >>>>>>>>>>>>>>>>> 2016-07-01 02:58:56 <jone> > >>>>>>>>>>>>>>>>> 2016-07-01 02:59:34 <jane> > >>>>>>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little > >>>>>>>>> more > >>>>>>>>>>>> rigorous... > >>>>>>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) > >>>>>>>>> Use > >>>>>>>>>> regex > >>>>>>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<". > >>>>>>>>> Read > >>>>>>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a > >>>>>>>>>> pattern > >>>>>>>>>>>>>>>> ".+<" and replace with "". > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> And do read the Posting Guide. Cross-posting to > >>>>>>>>> StackOverflow and > >>>>>>>>>>>> Rhelp, > >>>>>>>>>>>>>>>> at least within hours of each, is considered poor manners. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> David. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like > >>>>>>>>> it's > >>>>>>>>>> going > >>>>>>>>>>>> to > >>>>>>>>>>>>>>>>> be difficult to annotate with the coreNLP library or > >>>>>>>>> package. I'm > >>>>>>>>>>>>>>>>> doing natural language processing. In other words, I'm > >>>>>>>>> curious > >>>>>>>>>> as to > >>>>>>>>>>>>>>>>> how I would shave off the dates, that is, to make it look > >>>>>>>>> like: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> <john> hey > >>>>>>>>>>>>>>>>> <jane> waiting for plane to Edinburgh > >>>>>>>>>>>>>>>>> <john> thinking about my boo > >>>>>>>>>>>>>>>>> <jane> nothing crappy has happened, not really > >>>>>>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep > >>>>>>>>>>>>>>>>> <jane> no idea what time it is or where I am really > >>>>>>>>>>>>>>>>> <john> just know it's london > >>>>>>>>>>>>>>>>> <jane> you are probably asleep > >>>>>>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay > >>>>>>>>>>>>>>>>> <jone> > >>>>>>>>>>>>>>>>> <jane> > >>>>>>>>>>>>>>>>> <john> British security is a little more rigorous... > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by > >>>>>>>>>> writing a > >>>>>>>>>>>>>>>>> regular expression? such that I create a new object with no > >>>>>>>>>> numbers > >>>>>>>>>>>> or > >>>>>>>>>>>>>>>>> dates. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Michael > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and > >>>>>>>>> more, > >>>>>>>>>> see > >>>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>>>>>>> and provide commented, minimal, self-contained, > >>>>>>>>> reproducible > >>>>>>>>>> code. > >>>>>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>>>>>> see > >>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>>>>>> code. > >>>>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>>>>>> see > >>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>>>>>> code. > >>>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>>>>>> see > >>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>>>>>> code. > >>>>>>>>>>>> > >>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>>>>>> code. > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> [[alternative HTML version deleted]] > >>>>>>>>>>> > >>>>>>>>>>> ______________________________________________ > >>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>>>>>> > >>>>>>>>>> ______________________________________________ > >>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> [[alternative HTML version deleted]] > >>>>>>>>> > >>>>>>>>> ______________________________________________ > >>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>> PLEASE do read the posting guide > >>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Sent from my phone. Please excuse my brevity. > >>>>>>> > >>>>>>> ______________________________________________ > >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide > >>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>> > >>>>> > >>>>> ______________________________________________ > >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > >>>>> http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>> > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >> > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.