Inline
> On 2019-05-19, at 18:11, Michael Boulineau <michael.p.boulin...@gmail.com> > wrote: > > For context: > >> In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and >> \\2. The expression says: >> Substitute ALL of the match with the first captured expression, then " <", >> then the second captured expression, then "> ". The rest of the line is >not >> substituted and appears as-is. > > Back to me: I guess what's giving me trouble is where to draw the line > in terms of the end or edge of the expression. Given the code, then, > >> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8") >> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" >> c <- gsub(b, "\\1<\\2> ", a) > > to me, it would seem as though this is the first captured expression, > that is, as though \\1 refers back to ^([0-9-]{10} [0-9:]{8} ), since > there are parenthesis around it, or since [0-9-]{10} [0-9:]{8} is > enclosed in parentheses. That's correct: parentheses in regular expressions delimit captured substrings. > Then it would seem as though [*]{3} is the > second expression, and (\\w+ \\w+) is the third. Note that "[*]{3}" has no parentheses, is not captured and is not accounted for in the back-references. \\1 and \\2 refers only to the captured substrings - everything else contributes to whether the regex matches at all, but is no longer considered after the match. > According to this > (admittedly wrong) logic, it would seem as though the <> would go > around the date--like No: it goes around \\2, which is (\\w+ \\w+) > >> 2016-03-20 <19:29:37> *** Jane Doe started a video chat > > The back references here recalls Davis's code earlier: > >> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > > There, commas were put around everything, and there you can see the > edge of the expression very well. ^(.{10}) = first. (.{8}) = second. > (<.+>) = third. (.+$) = fourth. So, by the same logic, it would seem > as though in > >> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > > that ^([0-9-]{10} [0-9:]{8} ) is first, that [*]{3} is second, and > that (\\w+ \\w+) is third. > > But, if Boris is to be right, and he is, obviously, then it would have > to be the case that this entire thing, namely, ^([0-9-]{10} [0-9:]{8} > )[*]{3}, is the first expression, Actually "[*]{3}" is not part of the first expression - it is discarded because not in parentheses > since only if that were true would > the <> be able to go around the names, as in > > [3] "2016-01-27 09:15:20 <Jane Doe> Hey " > > Again, so 2016-01-27 09:15:20 would have to be an entire unit, an > expression. The word "expression" has a different technical meaning, but colloquially you are right. > So I guess what I don't understand is how ^([0-9-]{10} > [0-9:]{8} )[*]{3} can be an entire expression, although my hunch would > be that it has something to do with the ^ or with the space after the > } and before the (, as in > >> {3} (\\w+ > No. Just the parentheses. > Back to earlier: > >> The rest of the line is not substituted and appears as-is. > > Is that due to the space after the \\2? in > >> "\\1<\\2> No, that is because the substitution in gsub() targets only the match of the regex - and the string to the end is not part of the regex. Cheers, Boris > Notice space after > and before " > > Michael > > On Sun, May 19, 2019 at 2:31 PM Boris Steipe <boris.ste...@utoronto.ca> wrote: >> >> Inline ... >> >>> On 2019-05-19, at 13:56, Michael Boulineau <michael.p.boulin...@gmail.com> >>> wrote: >>> >>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" >>> >>> so the ^ signals that the regex BEGINS with a number (that could be >>> any number, 0-9) that is only 10 characters long (then there's the >>> dash in there, too, with the 0-9-, which I assume enabled the regex to >>> grab the - that's between the numbers in the date) >> >> That's right. Note that within a "character class" the hyphen can have tow >> meanings: normally it defines a range of characters, but if it appears as >> the last character before "]" it is a literal hyphen. >> >>> , followed by a >>> single space, followed by a unit that could be any number, again, but >>> that is only 8 characters long this time. For that one, it will >>> include the colon, hence the 9:, although for that one ([0-9:]{8} ), >> >> Right. >> >> >>> I >>> don't get why the space is on the inside in that one, after the {8}, >> >> The space needs to be preserved between the time and the name. I wrote >> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" # space in the first >> captured expression >> c <- gsub(b, "\\1<\\2> ", a) >> ... but I could have written >> b <- "^([0-9-]{10} [0-9:]{8})[*]{3} (\\w+ \\w+)" >> c <- gsub(b, "\\1 <\\2> ", a) # space in the substituted string >> ... same result >> >> >>> whereas the space is on the outside with the other one ^([0-9-]{10} , >>> directly after the {10}. Why is that? >> >> In the second case, I capture without a space, because I don't want the >> space in the results, after the time. >> >> >>> >>> Then three *** [*]{3}, then the (\\w+ \\w+)", which Boris explained so >>> well above. I guess I still don't get why this one seemed to have >>> deleted the *** out of the mix, plus I still don't why it didn't >>> remove the *** from the first one. >> >> Because the entire first line was not matched since it had a malformed >> character preceding the date. >> >>> >>> 2016-03-20 19:29:37 *** Jane Doe started a video chat >>> 2016-03-20 19:30:35 *** John Doe ended a video chat >>> 2016-04-02 12:59:36 *** Jane Doe started a video chat >>> 2016-04-02 13:00:43 *** John Doe ended a video chat >>> 2016-04-02 13:01:08 *** Jane Doe started a video chat >>> 2016-04-02 13:01:41 *** John Doe ended a video chat >>> 2016-04-02 13:03:51 *** John Doe started a video chat >>> 2016-04-02 13:06:35 *** John Doe ended a video chat >>> >>> This is a random sample from the beginning of the txt file with no >>> edits. The ***s were deleted, all but the first one, the one that had >>> the  but that was taken out by the encoding = "UTF-8". I know that >>> the function was c <- gsub(b, "\\1<\\2> ", a), so it had a gsub () on >>> there, the point of which is to do substitution work. >>> >>> Oh, I get it, I think. The \\1<\\2> in the gsub () puts the <> around >>> the names, so that it's consistent with the rest of the data, so that >>> the names in the text about that aren't enclosed in the <> are >>> enclosed like the rest of them. But I still don't get why or how the >>> gsub () replaced the *** with the <>... >> >> In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and >> \\2. The expression says: >> Substitute ALL of the match with the first captured expression, then " <", >> then the second captured expression, then "> ". The rest of the line is not >> substituted and appears as-is. >> >> >>> >>> This one is more straightforward. >>> >>>> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" >>> >>> any number with - for 10 characters, followed by a space. Oh, there's >>> no space in this one ([0-9:]{8}), after the {8}. Hu. So, then, any >>> number with : for 8 characters, followed by any two words separated by >>> a space and enclosed in <>. And then the \\s* is followed by a single >>> space? Or maybe it puts space on both sides (on the side of the #s to >>> the left, and then the comment to the right). The (.+)$ is anything >>> whatsoever until the end. >> >> \s is the metacharacter for "whitespace". \s* means zero or more whitespace. >> I'm matching that OUTSIDE of the captured expression, to removes any leading >> spaces from the data that goes into the data frame. >> >> >> Cheers, >> Boris >> >> >> >> >>> >>> Michael >>> >>> >>> On Sun, May 19, 2019 at 4:37 AM Boris Steipe <boris.ste...@utoronto.ca> >>> wrote: >>>> >>>> Inline >>>> >>>> >>>> >>>>> On 2019-05-18, at 20:34, Michael Boulineau >>>>> <michael.p.boulin...@gmail.com> wrote: >>>>> >>>>> It appears to have worked, although there were three little quirks. >>>>> The ; close(con); rm(con) didn't work for me; the first row of the >>>>> data.frame was all NAs, when all was said and done; >>>> >>>> You will get NAs for lines that can't be matched to the regular >>>> expression. That's a good thing, it allows you to test whether your >>>> assumptions were valid for the entire file: >>>> >>>> # number of failed strcapture() >>>> sum(is.na(e$date)) >>>> >>>> >>>>> and then there >>>>> were still three *** on the same line where the  was apparently >>>>> deleted. >>>> >>>> This is a sign that something else happened with the line that prevented >>>> the regex from matching. In that case you need to investigate more. I see >>>> an invalid multibyte character at the beginning of the line you posted >>>> below. >>>> >>>>> >>>>>> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8") >>>>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" >>>>>> c <- gsub(b, "\\1<\\2> ", a) >>>>>> head (c) >>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" >>>>> [2] "2016-01-27 09:15:20 <Jane Doe> >>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" >>>> >>>> [...] >>>> >>>>> But, before I do anything else, I'm going to study the regex in this >>>>> particular code. For example, I'm still not sure why there has to the >>>>> second \\w+ in the (\\w+ \\w+). Little things like that. >>>> >>>> \w is the metacharacter for alphanumeric characters, \w+ designates >>>> something we could call a word. Thus \w+ \w+ are two words separated by a >>>> single blank. This corresponds to your example, but, as I wrote >>>> previously, you need to think very carefully whether this covers all >>>> possible cases (Could there be only one word? More than one blank? Could >>>> letters be separated by hyphens or periods?) In most cases we could have >>>> more robustly matched everything between "<" and ">" (taking care to test >>>> what happens if the message contains those characters). But for the video >>>> chat lines we need to make an assumption about what is name and what is >>>> not. If "started a video chat" is the only possibility in such lines, you >>>> can use this information instead. If there are other possibilities, you >>>> need a different strategy. In NLP there is no one-approach-fits-all. >>>> >>>> To validate the structure of the names in your transcripts, you can look at >>>> >>>> patt <- " <.+?> " # " <any string, not greedy> " >>>> m <- regexpr(patt, c) >>>> unique(regmatches(c, m)) >>>> >>>> >>>> >>>> B. >>>> >>>> >>>> >>>>> >>>>> Michael >>>>> >>>>> >>>>> On Sat, May 18, 2019 at 4:30 PM Boris Steipe <boris.ste...@utoronto.ca> >>>>> wrote: >>>>>> >>>>>> This works for me: >>>>>> >>>>>> # sample data >>>>>> c <- character() >>>>>> c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat" >>>>>> c[2] <- "2016-01-27 09:15:20 <Jane Doe> >>>>>> https://lh3.googleusercontent.com/" >>>>>> c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey " >>>>>> c[4] <- "2016-01-27 09:15:22 <John Doe> ended a video chat" >>>>>> c[5] <- "2016-01-27 21:07:11 <Jane Doe> started a video chat" >>>>>> c[6] <- "2016-01-27 21:26:57 <John Doe> ended a video chat" >>>>>> >>>>>> >>>>>> # regex ^(year) (time) <(word word)>\\s*(string)$ >>>>>> patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" >>>>>> proto <- data.frame(date = character(), >>>>>> time = character(), >>>>>> name = character(), >>>>>> text = character(), >>>>>> stringsAsFactors = TRUE) >>>>>> d <- strcapture(patt, c, proto) >>>>>> >>>>>> >>>>>> >>>>>> date time name text >>>>>> 1 2016-01-27 09:14:40 Jane Doe started a video chat >>>>>> 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/ >>>>>> 3 2016-01-27 09:15:20 Jane Doe Hey >>>>>> 4 2016-01-27 09:15:22 John Doe ended a video chat >>>>>> 5 2016-01-27 21:07:11 Jane Doe started a video chat >>>>>> 6 2016-01-27 21:26:57 John Doe ended a video chat >>>>>> >>>>>> >>>>>> >>>>>> B. >>>>>> >>>>>> >>>>>>> On 2019-05-18, at 18:32, Michael Boulineau >>>>>>> <michael.p.boulin...@gmail.com> wrote: >>>>>>> >>>>>>> Going back and thinking through what Boris and William were saying >>>>>>> (also Ivan), I tried this: >>>>>>> >>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt") >>>>>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" >>>>>>> c <- gsub(b, "\\1<\\2> ", a) >>>>>>>> head (c) >>>>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" >>>>>>> [2] "2016-01-27 09:15:20 <Jane Doe> >>>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" >>>>>>> [3] "2016-01-27 09:15:20 <Jane Doe> Hey " >>>>>>> [4] "2016-01-27 09:15:22 <John Doe> ended a video chat" >>>>>>> [5] "2016-01-27 21:07:11 <Jane Doe> started a video chat" >>>>>>> [6] "2016-01-27 21:26:57 <John Doe> ended a video chat" >>>>>>> >>>>>>> The  is still there, since I forgot to do what Ivan had suggested, >>>>>>> namely, >>>>>>> >>>>>>> a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding >>>>>>> = "UTF-8")); close(con); rm(con) >>>>>>> >>>>>>> But then the new code is still turning out only NAs when I apply >>>>>>> strcapture (). This was what happened next: >>>>>>> >>>>>>>> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} >>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", >>>>>>> + c, proto=data.frame(stringsAsFactors=FALSE, When="", >>>>>>> Who="", >>>>>>> + What="")) >>>>>>>> head (d) >>>>>>> When Who What >>>>>>> 1 <NA> <NA> <NA> >>>>>>> 2 <NA> <NA> <NA> >>>>>>> 3 <NA> <NA> <NA> >>>>>>> 4 <NA> <NA> <NA> >>>>>>> 5 <NA> <NA> <NA> >>>>>>> 6 <NA> <NA> <NA> >>>>>>> >>>>>>> I've been reading up on regular expressions, too, so this code seems >>>>>>> spot on. What's going wrong? >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> >>>>>>> wrote: >>>>>>>> >>>>>>>> Don't start putting in extra commas and then reading this as csv. That >>>>>>>> approach is broken. The correct approach is what Bill outlined: read >>>>>>>> everything with readLines(), and then use a proper regular expression >>>>>>>> with strcapture(). >>>>>>>> >>>>>>>> You need to pre-process the object that readLines() gives you: replace >>>>>>>> the contents of the videochat lines, and make it conform to the format >>>>>>>> of the other lines before you process it into your data frame. >>>>>>>> >>>>>>>> Approximately something like >>>>>>>> >>>>>>>> # read the raw data >>>>>>>> tmp <- readLines("hangouts-conversation-6.csv.txt") >>>>>>>> >>>>>>>> # process all video chat lines >>>>>>>> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) " # (year time >>>>>>>> )*** (word word) >>>>>>>> tmp <- gsub(patt, "\\1<\\2> ", tmp) >>>>>>>> >>>>>>>> # next, use strcapture() >>>>>>>> >>>>>>>> Note that this makes the assumption that your names are always exactly >>>>>>>> two words containing only letters. If that assumption is not true, >>>>>>>> more though needs to go into the regex. But you can test that: >>>>>>>> >>>>>>>> patt <- " <\\w+ \\w+> " #" <word word> " >>>>>>>> sum( ! grepl(patt, tmp))) >>>>>>>> >>>>>>>> ... will give the number of lines that remain in your file that do not >>>>>>>> have a tag that can be interpreted as "Who" >>>>>>>> >>>>>>>> Once that is fine, use Bill's approach - or a regular expression of >>>>>>>> your own design - to create your data frame. >>>>>>>> >>>>>>>> Hope this helps, >>>>>>>> Boris >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On 2019-05-17, at 16:18, Michael Boulineau >>>>>>>>> <michael.p.boulin...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Very interesting. I'm sure I'll be trying to get rid of the byte order >>>>>>>>> mark eventually. But right now, I'm more worried about getting the >>>>>>>>> character vector into either a csv file or data.frame; that way, I can >>>>>>>>> be able to work with the data neatly tabulated into four columns: >>>>>>>>> date, time, person, comment. I assume it's a write.csv function, but I >>>>>>>>> don't know what arguments to put in it. header=FALSE? fill=T? >>>>>>>>> >>>>>>>>> Micheal >>>>>>>>> >>>>>>>>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller >>>>>>>>> <jdnew...@dcn.davis.ca.us> wrote: >>>>>>>>>> >>>>>>>>>> If byte order mark is the issue then you can specify the file >>>>>>>>>> encoding as "UTF-8-BOM" and it won't show up in your data any more. >>>>>>>>>> >>>>>>>>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help >>>>>>>>>> <r-help@r-project.org> wrote: >>>>>>>>>>> The pattern I gave worked for the lines that you originally showed >>>>>>>>>>> from >>>>>>>>>>> the >>>>>>>>>>> data file ('a'), before you put commas into them. If the name is >>>>>>>>>>> either of >>>>>>>>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed >>>>>>>>>>> so >>>>>>>>>>> something like "(<[^>]*>|[*]{3})". >>>>>>>>>>> >>>>>>>>>>> The " " at the start of the imported data may come from the byte >>>>>>>>>>> order >>>>>>>>>>> mark that Windows apps like to put at the front of a text file in >>>>>>>>>>> UTF-8 >>>>>>>>>>> or >>>>>>>>>>> UTF-16 format. >>>>>>>>>>> >>>>>>>>>>> Bill Dunlap >>>>>>>>>>> TIBCO Software >>>>>>>>>>> wdunlap tibco.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau < >>>>>>>>>>> michael.p.boulin...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> This seemed to work: >>>>>>>>>>>> >>>>>>>>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt") >>>>>>>>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) >>>>>>>>>>>>> b [1:84] >>>>>>>>>>>> >>>>>>>>>>>> And the first 85 lines looks like this: >>>>>>>>>>>> >>>>>>>>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" >>>>>>>>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" >>>>>>>>>>>> >>>>>>>>>>>> Then they transition to the commas: >>>>>>>>>>>> >>>>>>>>>>>>> b [84:100] >>>>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" >>>>>>>>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey" >>>>>>>>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" >>>>>>>>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" >>>>>>>>>>>> >>>>>>>>>>>> Even the strange bit on line 6347 was caught by this: >>>>>>>>>>>> >>>>>>>>>>>>> b [6346:6348] >>>>>>>>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe" >>>>>>>>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" >>>>>>>>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" >>>>>>>>>>>> >>>>>>>>>>>> Perhaps most awesomely, the code catches spaces that are interposed >>>>>>>>>>>> into the comment itself: >>>>>>>>>>>> >>>>>>>>>>>>> b [4] >>>>>>>>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey " >>>>>>>>>>>>> b [85] >>>>>>>>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey" >>>>>>>>>>>> >>>>>>>>>>>> Notice whether there is a space after the "hey" or not. >>>>>>>>>>>> >>>>>>>>>>>> These are the first two lines: >>>>>>>>>>>> >>>>>>>>>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" >>>>>>>>>>>> [2] "2016-01-27,09:15:20,<Jane >>>>>>>>>>>> Doe>, >>>>>>>>>>>> >>>>>>>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf >>>>>>>>>>>> " >>>>>>>>>>>> >>>>>>>>>>>> So, who knows what happened with the  at the beginning of [1] >>>>>>>>>>>> directly above. But notice how there are no commas in [1] but there >>>>>>>>>>>> appear in [2]. I don't see why really long ones like [2] directly >>>>>>>>>>>> above would be a problem, were they to be translated into a csv or >>>>>>>>>>>> data frame column. >>>>>>>>>>>> >>>>>>>>>>>> Now, with the commas in there, couldn't we write this into a csv >>>>>>>>>>>> or a >>>>>>>>>>>> data.frame? Some of this data will end up being garbage, I imagine. >>>>>>>>>>>> Like in [2] directly above. Or with [83] and [84] at the top of >>>>>>>>>>>> this >>>>>>>>>>>> discussion post/email. Embarrassingly, I've been trying to convert >>>>>>>>>>>> this into a data.frame or csv but I can't manage to. I've been >>>>>>>>>>>> using >>>>>>>>>>>> the write.csv function, but I don't think I've been getting the >>>>>>>>>>>> arguments correct. >>>>>>>>>>>> >>>>>>>>>>>> At the end of the day, I would like a data.frame and/or csv with >>>>>>>>>>>> the >>>>>>>>>>>> following four columns: date, time, person, comment. >>>>>>>>>>>> >>>>>>>>>>>> I tried this, too: >>>>>>>>>>>> >>>>>>>>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} >>>>>>>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", >>>>>>>>>>>> + a, proto=data.frame(stringsAsFactors=FALSE, >>>>>>>>>>> When="", >>>>>>>>>>>> Who="", >>>>>>>>>>>> + What="")) >>>>>>>>>>>> >>>>>>>>>>>> But all I got was this: >>>>>>>>>>>> >>>>>>>>>>>>> c [1:100, ] >>>>>>>>>>>> When Who What >>>>>>>>>>>> 1 <NA> <NA> <NA> >>>>>>>>>>>> 2 <NA> <NA> <NA> >>>>>>>>>>>> 3 <NA> <NA> <NA> >>>>>>>>>>>> 4 <NA> <NA> <NA> >>>>>>>>>>>> 5 <NA> <NA> <NA> >>>>>>>>>>>> 6 <NA> <NA> <NA> >>>>>>>>>>>> >>>>>>>>>>>> It seems to have caught nothing. >>>>>>>>>>>> >>>>>>>>>>>>> unique (c) >>>>>>>>>>>> When Who What >>>>>>>>>>>> 1 <NA> <NA> <NA> >>>>>>>>>>>> >>>>>>>>>>>> But I like that it converted into columns. That's a really great >>>>>>>>>>>> format. With a little tweaking, it'd be a great code for this data >>>>>>>>>>>> set. >>>>>>>>>>>> >>>>>>>>>>>> Michael >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help >>>>>>>>>>>> <r-help@r-project.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Consider using readLines() and strcapture() for reading such a >>>>>>>>>>> file. >>>>>>>>>>>> E.g., >>>>>>>>>>>>> suppose readLines(files) produced a character vector like >>>>>>>>>>>>> >>>>>>>>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", >>>>>>>>>>>>> "2016-10-21 10:56:29 <John Doe> John_Doe", >>>>>>>>>>>>> "2016-10-21 10:56:37 <John Doe> Admit#8242", >>>>>>>>>>>>> "October 23, 1819 12:34 <Jane Eyre> I am not an angel") >>>>>>>>>>>>> >>>>>>>>>>>>> Then you can make a data.frame with columns When, Who, and What by >>>>>>>>>>>>> supplying a pattern containing three parenthesized capture >>>>>>>>>>> expressions: >>>>>>>>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} >>>>>>>>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", >>>>>>>>>>>>> x, proto=data.frame(stringsAsFactors=FALSE, When="", >>>>>>>>>>> Who="", >>>>>>>>>>>>> What="")) >>>>>>>>>>>>>> str(z) >>>>>>>>>>>>> 'data.frame': 4 obs. of 3 variables: >>>>>>>>>>>>> $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" >>>>>>>>>>> "2016-10-21 >>>>>>>>>>>>> 10:56:37" NA >>>>>>>>>>>>> $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA >>>>>>>>>>>>> $ What: chr "What's your login" "John_Doe" "Admit#8242" NA >>>>>>>>>>>>> >>>>>>>>>>>>> Lines that don't match the pattern result in NA's - you might make >>>>>>>>>>> a >>>>>>>>>>>> second >>>>>>>>>>>>> pass over the corresponding elements of x with a new pattern. >>>>>>>>>>>>> >>>>>>>>>>>>> You can convert the When column from character to time with >>>>>>>>>>> as.POSIXct(). >>>>>>>>>>>>> >>>>>>>>>>>>> Bill Dunlap >>>>>>>>>>>>> TIBCO Software >>>>>>>>>>>>> wdunlap tibco.com >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius >>>>>>>>>>> <dwinsem...@comcast.net> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote: >>>>>>>>>>>>>>> OK. So, I named the object test and then checked the 6347th >>>>>>>>>>> item >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> test <- readLines ("hangouts-conversation.txt) >>>>>>>>>>>>>>>> test [6347] >>>>>>>>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Perhaps where it was getting screwed up is, since the end of >>>>>>>>>>> this is >>>>>>>>>>>> a >>>>>>>>>>>>>>> number (8242), then, given that there's no space between the >>>>>>>>>>> number >>>>>>>>>>>>>>> and what ought to be the next row, R didn't know where to draw >>>>>>>>>>> the >>>>>>>>>>>>>>> line. Sure enough, it looks like this when I go to the original >>>>>>>>>>> file >>>>>>>>>>>>>>> and control f "#8242" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login >>>>>>>>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe >>>>>>>>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> An octothorpe is an end of line signifier and is interpreted as >>>>>>>>>>>> allowing >>>>>>>>>>>>>> comments. You can prevent that interpretation with suitable >>>>>>>>>>> choice of >>>>>>>>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why >>>>>>>>>>> that >>>>>>>>>>>>>> should cause anu error or a failure to match that pattern. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Again, it doesn't look like that in the file. Gmail >>>>>>>>>>> automatically >>>>>>>>>>>>>>> formats it like that when I paste it in. More to the point, it >>>>>>>>>>> looks >>>>>>>>>>>>>>> like >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 >>>>>>>>>>> 10:56:29 >>>>>>>>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> >>>>>>>>>>>> Admit#82422016-10-21 >>>>>>>>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Notice Admit#82422016. So there's that. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then I built object test2. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", >>>>>>>>>>> test) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This worked for 84 lines, then this happened. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It may have done something but as you later discovered my first >>>>>>>>>>> code >>>>>>>>>>>> for >>>>>>>>>>>>>> the pattern was incorrect. I had tested it (and pasted in the >>>>>>>>>>> results >>>>>>>>>>>> of >>>>>>>>>>>>>> the test) . The way to refer to a capture class is with >>>>>>>>>>> back-slashes >>>>>>>>>>>>>> before the numbers, not forward-slashes. Try this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", >>>>>>>>>>> "\\1,\\2,\\3,\\4", >>>>>>>>>>>> chrvec) >>>>>>>>>>>>>>> newvec >>>>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" >>>>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" >>>>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" >>>>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, >>>>>>>>>>> not >>>>>>>>>>>> really" >>>>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, >>>>>>>>>>> didn't >>>>>>>>>>>> sleep" >>>>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or >>>>>>>>>>> where I am >>>>>>>>>>>>>> really" >>>>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" >>>>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" >>>>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good >>>>>>>>>>> eay" >>>>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>" >>>>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>" >>>>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little >>>>>>>>>>> more >>>>>>>>>>>>>> rigorous..." >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I made note of the fact that the 10th and 11th lines had no >>>>>>>>>>> commas. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> test2 [84] >>>>>>>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" >>>>>>>>>>>>>> >>>>>>>>>>>>>> That line didn't have any "<" so wasn't matched. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> You could remove all none matching lines for pattern of >>>>>>>>>>>>>> >>>>>>>>>>>>>> dates<space>times<space>"<"<name>">"<space><anything> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> with: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do read: >>>>>>>>>>>>>> >>>>>>>>>>>>>> ?read.csv >>>>>>>>>>>>>> >>>>>>>>>>>>>> ?regex >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> >>>>>>>>>>>>>> David >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> test2 [85] >>>>>>>>>>>>>>> [1] "//1,//2,//3,//4" >>>>>>>>>>>>>>>> test [85] >>>>>>>>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Notice how I toggled back and forth between test and test2 >>>>>>>>>>> there. So, >>>>>>>>>>>>>>> whatever happened with the regex, it happened in the switch >>>>>>>>>>> from 84 >>>>>>>>>>>> to >>>>>>>>>>>>>>> 85, I guess. It went on like >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [990] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [991] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [992] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [993] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [994] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [995] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [996] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [997] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [998] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [999] "//1,//2,//3,//4" >>>>>>>>>>>>>>> [1000] "//1,//2,//3,//4" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> up until line 1000, then I reached max.print. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Michael >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius < >>>>>>>>>>>> dwinsem...@comcast.net> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote: >>>>>>>>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and >>>>>>>>>>> not do >>>>>>>>>>>>>> that again. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I tried the read.fwf from the foreign package, with a code >>>>>>>>>>> like >>>>>>>>>>>> this: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> d <- read.fwf("hangouts-conversation.txt", >>>>>>>>>>>>>>>>> widths= c(10,10,20,40), >>>>>>>>>>>>>>>>> >>>>>>>>>>> col.names=c("date","time","person","comment"), >>>>>>>>>>>>>>>>> strip.white=TRUE) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> But it threw this error: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote = >>>>>>>>>>> quote, >>>>>>>>>>>> dec >>>>>>>>>>>>>> = dec, : >>>>>>>>>>>>>>>>> line 6347 did not have 4 elements >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print >>>>>>>>>>> it >>>>>>>>>>>> out.) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Interestingly, though, the error only happened when I >>>>>>>>>>> increased the >>>>>>>>>>>>>>>>> width size. But I had to increase the size, or else I >>>>>>>>>>> couldn't >>>>>>>>>>>> "see" >>>>>>>>>>>>>>>>> anything. The comment was so small that nothing was being >>>>>>>>>>>> captured by >>>>>>>>>>>>>>>>> the size of the column. so to speak. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It seems like what's throwing me is that there's no comma >>>>>>>>>>> that >>>>>>>>>>>>>>>>> demarcates the end of the text proper. For example: >>>>>>>>>>>>>>>> Not sure why you thought there should be a comma. Lines >>>>>>>>>>> usually end >>>>>>>>>>>>>>>> with <cr> and or a <lf>. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Once you have the raw text in a character vector from >>>>>>>>>>> `readLines` >>>>>>>>>>>> named, >>>>>>>>>>>>>>>> say, 'chrvec', then you could selectively substitute commas >>>>>>>>>>> for >>>>>>>>>>>> spaces >>>>>>>>>>>>>>>> with regex. (Now that you no longer desire to remove the dates >>>>>>>>>>> and >>>>>>>>>>>>>> times.) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This will not do any replacements when the pattern is not >>>>>>>>>>> matched. >>>>>>>>>>>> See >>>>>>>>>>>>>>>> this test: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", >>>>>>>>>>> "\\1,\\2,\\3,\\4", >>>>>>>>>>>>>> chrvec) >>>>>>>>>>>>>>>>> newvec >>>>>>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" >>>>>>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to >>>>>>>>>>> Edinburgh" >>>>>>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" >>>>>>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has >>>>>>>>>>> happened, not >>>>>>>>>>>>>> really" >>>>>>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, >>>>>>>>>>> didn't >>>>>>>>>>>>>> sleep" >>>>>>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or >>>>>>>>>>> where >>>>>>>>>>>> I am >>>>>>>>>>>>>>>> really" >>>>>>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" >>>>>>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" >>>>>>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a >>>>>>>>>>> good >>>>>>>>>>>> eay" >>>>>>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>" >>>>>>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>" >>>>>>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little >>>>>>>>>>> more >>>>>>>>>>>>>>>> rigorous..." >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You should probably remove the "empty comment" lines. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> David. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a >>>>>>>>>>>> starbucks2016-07-01 >>>>>>>>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 >>>>>>>>>>> <Jane >>>>>>>>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> >>>>>>>>>>> There was >>>>>>>>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It was interesting, too, when I pasted the text into the >>>>>>>>>>> email, it >>>>>>>>>>>>>>>>> self-formatted into the way I wanted it to look. I had to >>>>>>>>>>> manually >>>>>>>>>>>>>>>>> make it look like it does above, since that's the way that it >>>>>>>>>>>> looks in >>>>>>>>>>>>>>>>> the txt file. I wonder if it's being organized by XML or >>>>>>>>>>> something. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Anyways, There's always a space between the two sideways >>>>>>>>>>> carrots, >>>>>>>>>>>> just >>>>>>>>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's >>>>>>>>>>> always >>>>>>>>>>>> a >>>>>>>>>>>>>>>>> space between the data and time. Like this. 2016-07-01 >>>>>>>>>>> 15:34:30 >>>>>>>>>>>> See. >>>>>>>>>>>>>>>>> Space. But there's never a space between the end of the >>>>>>>>>>> comment and >>>>>>>>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01 >>>>>>>>>>> 15:35:02 >>>>>>>>>>>>>>>>> See. starbucks and 2016 are smooshed together. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This code is also on the table right now too. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> a <- read.table("E:/working >>>>>>>>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"", >>>>>>>>>>>>>>>>> comment.char="", fill=TRUE) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h) >>>>>>>>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Those last lines are a work in progress. I wish I could >>>>>>>>>>> import a >>>>>>>>>>>>>>>>> picture of what it looks like when it's translated into a >>>>>>>>>>> data >>>>>>>>>>>> frame. >>>>>>>>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of >>>>>>>>>>> sort of >>>>>>>>>>>>>>>>> works, but the comments keep bleeding into the data and time >>>>>>>>>>>> column. >>>>>>>>>>>>>>>>> It's like >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been >>>>>>>>>>>>>>>>> over there >>>>>>>>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to >>>>>>>>>>>> itself, as >>>>>>>>>>>>>>>>> will be the "I've'"and the "never" etc. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I will use a regular expression if I have to, but it would be >>>>>>>>>>> nice >>>>>>>>>>>> to >>>>>>>>>>>>>>>>> keep the dates and times on there. Originally, I thought they >>>>>>>>>>> were >>>>>>>>>>>>>>>>> meaningless, but I've since changed my mind on that count. >>>>>>>>>>> The >>>>>>>>>>>> time of >>>>>>>>>>>>>>>>> day isn't so important. But, especially since, say, Gmail >>>>>>>>>>> itself >>>>>>>>>>>> knows >>>>>>>>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I >>>>>>>>>>> know >>>>>>>>>>>>>>>>> this data has structure to it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Michael >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < >>>>>>>>>>>>>> dwinsem...@comcast.net> wrote: >>>>>>>>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: >>>>>>>>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks >>>>>>>>>>> like >>>>>>>>>>>> this: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey >>>>>>>>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh >>>>>>>>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo >>>>>>>>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not >>>>>>>>>>>> really >>>>>>>>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, >>>>>>>>>>> didn't >>>>>>>>>>>> sleep >>>>>>>>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where >>>>>>>>>>> I am >>>>>>>>>>>>>> really >>>>>>>>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london >>>>>>>>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep >>>>>>>>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good >>>>>>>>>>> eay >>>>>>>>>>>>>>>>>>> 2016-07-01 02:58:56 <jone> >>>>>>>>>>>>>>>>>>> 2016-07-01 02:59:34 <jane> >>>>>>>>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little >>>>>>>>>>> more >>>>>>>>>>>>>> rigorous... >>>>>>>>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) >>>>>>>>>>> Use >>>>>>>>>>>> regex >>>>>>>>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<". >>>>>>>>>>> Read >>>>>>>>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a >>>>>>>>>>>> pattern >>>>>>>>>>>>>>>>>> ".+<" and replace with "". >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> And do read the Posting Guide. Cross-posting to >>>>>>>>>>> StackOverflow and >>>>>>>>>>>>>> Rhelp, >>>>>>>>>>>>>>>>>> at least within hours of each, is considered poor manners. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> David. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like >>>>>>>>>>> it's >>>>>>>>>>>> going >>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> be difficult to annotate with the coreNLP library or >>>>>>>>>>> package. I'm >>>>>>>>>>>>>>>>>>> doing natural language processing. In other words, I'm >>>>>>>>>>> curious >>>>>>>>>>>> as to >>>>>>>>>>>>>>>>>>> how I would shave off the dates, that is, to make it look >>>>>>>>>>> like: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> <john> hey >>>>>>>>>>>>>>>>>>> <jane> waiting for plane to Edinburgh >>>>>>>>>>>>>>>>>>> <john> thinking about my boo >>>>>>>>>>>>>>>>>>> <jane> nothing crappy has happened, not really >>>>>>>>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep >>>>>>>>>>>>>>>>>>> <jane> no idea what time it is or where I am really >>>>>>>>>>>>>>>>>>> <john> just know it's london >>>>>>>>>>>>>>>>>>> <jane> you are probably asleep >>>>>>>>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay >>>>>>>>>>>>>>>>>>> <jone> >>>>>>>>>>>>>>>>>>> <jane> >>>>>>>>>>>>>>>>>>> <john> British security is a little more rigorous... >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by >>>>>>>>>>>> writing a >>>>>>>>>>>>>>>>>>> regular expression? such that I create a new object with no >>>>>>>>>>>> numbers >>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>> dates. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Michael >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ______________________________________________ >>>>>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and >>>>>>>>>>> more, >>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>>>>>>>>>> and provide commented, minimal, self-contained, >>>>>>>>>>> reproducible >>>>>>>>>>>> code. >>>>>>>>>>>>>>>>> ______________________________________________ >>>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >>>>>>>>>>> see >>>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible >>>>>>>>>>> code. >>>>>>>>>>>>>>>> ______________________________________________ >>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >>>>>>>>>>> see >>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible >>>>>>>>>>> code. >>>>>>>>>>>>>>> ______________________________________________ >>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >>>>>>>>>>> see >>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible >>>>>>>>>>> code. >>>>>>>>>>>>>> >>>>>>>>>>>>>> ______________________________________________ >>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible >>>>>>>>>>> code. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>>>>>> >>>>>>>>>>>>> ______________________________________________ >>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>>>>>> >>>>>>>>>>>> ______________________________________________ >>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>>>> >>>>>>>>>>> ______________________________________________ >>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>>>> PLEASE do read the posting guide >>>>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Sent from my phone. Please excuse my brevity. >>>>>>>>> >>>>>>>>> ______________________________________________ >>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>>> PLEASE do read the posting guide >>>>>>>>> http://www.R-project.org/posting-guide.html >>>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>> >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide >>>>>>> http://www.R-project.org/posting-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.