> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" so the ^ signals that the regex BEGINS with a number (that could be any number, 0-9) that is only 10 characters long (then there's the dash in there, too, with the 0-9-, which I assume enabled the regex to grab the - that's between the numbers in the date), followed by a single space, followed by a unit that could be any number, again, but that is only 8 characters long this time. For that one, it will include the colon, hence the 9:, although for that one ([0-9:]{8} ), I don't get why the space is on the inside in that one, after the {8}, whereas the space is on the outside with the other one ^([0-9-]{10} , directly after the {10}. Why is that?
Then three *** [*]{3}, then the (\\w+ \\w+)", which Boris explained so well above. I guess I still don't get why this one seemed to have deleted the *** out of the mix, plus I still don't why it didn't remove the *** from the first one. 2016-03-20 19:29:37 *** Jane Doe started a video chat 2016-03-20 19:30:35 *** John Doe ended a video chat 2016-04-02 12:59:36 *** Jane Doe started a video chat 2016-04-02 13:00:43 *** John Doe ended a video chat 2016-04-02 13:01:08 *** Jane Doe started a video chat 2016-04-02 13:01:41 *** John Doe ended a video chat 2016-04-02 13:03:51 *** John Doe started a video chat 2016-04-02 13:06:35 *** John Doe ended a video chat This is a random sample from the beginning of the txt file with no edits. The ***s were deleted, all but the first one, the one that had the  but that was taken out by the encoding = "UTF-8". I know that the function was c <- gsub(b, "\\1<\\2> ", a), so it had a gsub () on there, the point of which is to do substitution work. Oh, I get it, I think. The \\1<\\2> in the gsub () puts the <> around the names, so that it's consistent with the rest of the data, so that the names in the text about that aren't enclosed in the <> are enclosed like the rest of them. But I still don't get why or how the gsub () replaced the *** with the <>... This one is more straightforward. > d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" any number with - for 10 characters, followed by a space. Oh, there's no space in this one ([0-9:]{8}), after the {8}. Hu. So, then, any number with : for 8 characters, followed by any two words separated by a space and enclosed in <>. And then the \\s* is followed by a single space? Or maybe it puts space on both sides (on the side of the #s to the left, and then the comment to the right). The (.+)$ is anything whatsoever until the end. Michael On Sun, May 19, 2019 at 4:37 AM Boris Steipe <boris.ste...@utoronto.ca> wrote: > > Inline > > > > > On 2019-05-18, at 20:34, Michael Boulineau <michael.p.boulin...@gmail.com> > > wrote: > > > > It appears to have worked, although there were three little quirks. > > The ; close(con); rm(con) didn't work for me; the first row of the > > data.frame was all NAs, when all was said and done; > > You will get NAs for lines that can't be matched to the regular expression. > That's a good thing, it allows you to test whether your assumptions were > valid for the entire file: > > # number of failed strcapture() > sum(is.na(e$date)) > > > > and then there > > were still three *** on the same line where the  was apparently > > deleted. > > This is a sign that something else happened with the line that prevented the > regex from matching. In that case you need to investigate more. I see an > invalid multibyte character at the beginning of the line you posted below. > > > > >> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8") > >> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > >> c <- gsub(b, "\\1<\\2> ", a) > >> head (c) > > [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > > [2] "2016-01-27 09:15:20 <Jane Doe> > > https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" > > [...] > > > But, before I do anything else, I'm going to study the regex in this > > particular code. For example, I'm still not sure why there has to the > > second \\w+ in the (\\w+ \\w+). Little things like that. > > \w is the metacharacter for alphanumeric characters, \w+ designates something > we could call a word. Thus \w+ \w+ are two words separated by a single blank. > This corresponds to your example, but, as I wrote previously, you need to > think very carefully whether this covers all possible cases (Could there be > only one word? More than one blank? Could letters be separated by hyphens or > periods?) In most cases we could have more robustly matched everything > between "<" and ">" (taking care to test what happens if the message contains > those characters). But for the video chat lines we need to make an assumption > about what is name and what is not. If "started a video chat" is the only > possibility in such lines, you can use this information instead. If there are > other possibilities, you need a different strategy. In NLP there is no > one-approach-fits-all. > > To validate the structure of the names in your transcripts, you can look at > > patt <- " <.+?> " # " <any string, not greedy> " > m <- regexpr(patt, c) > unique(regmatches(c, m)) > > > > B. > > > > > > > Michael > > > > > > On Sat, May 18, 2019 at 4:30 PM Boris Steipe <boris.ste...@utoronto.ca> > > wrote: > >> > >> This works for me: > >> > >> # sample data > >> c <- character() > >> c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat" > >> c[2] <- "2016-01-27 09:15:20 <Jane Doe> https://lh3.googleusercontent.com/" > >> c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey " > >> c[4] <- "2016-01-27 09:15:22 <John Doe> ended a video chat" > >> c[5] <- "2016-01-27 21:07:11 <Jane Doe> started a video chat" > >> c[6] <- "2016-01-27 21:26:57 <John Doe> ended a video chat" > >> > >> > >> # regex ^(year) (time) <(word word)>\\s*(string)$ > >> patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$" > >> proto <- data.frame(date = character(), > >> time = character(), > >> name = character(), > >> text = character(), > >> stringsAsFactors = TRUE) > >> d <- strcapture(patt, c, proto) > >> > >> > >> > >> date time name text > >> 1 2016-01-27 09:14:40 Jane Doe started a video chat > >> 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/ > >> 3 2016-01-27 09:15:20 Jane Doe Hey > >> 4 2016-01-27 09:15:22 John Doe ended a video chat > >> 5 2016-01-27 21:07:11 Jane Doe started a video chat > >> 6 2016-01-27 21:26:57 John Doe ended a video chat > >> > >> > >> > >> B. > >> > >> > >>> On 2019-05-18, at 18:32, Michael Boulineau > >>> <michael.p.boulin...@gmail.com> wrote: > >>> > >>> Going back and thinking through what Boris and William were saying > >>> (also Ivan), I tried this: > >>> > >>> a <- readLines ("hangouts-conversation-6.csv.txt") > >>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" > >>> c <- gsub(b, "\\1<\\2> ", a) > >>>> head (c) > >>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > >>> [2] "2016-01-27 09:15:20 <Jane Doe> > >>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" > >>> [3] "2016-01-27 09:15:20 <Jane Doe> Hey " > >>> [4] "2016-01-27 09:15:22 <John Doe> ended a video chat" > >>> [5] "2016-01-27 21:07:11 <Jane Doe> started a video chat" > >>> [6] "2016-01-27 21:26:57 <John Doe> ended a video chat" > >>> > >>> The  is still there, since I forgot to do what Ivan had suggested, > >>> namely, > >>> > >>> a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding > >>> = "UTF-8")); close(con); rm(con) > >>> > >>> But then the new code is still turning out only NAs when I apply > >>> strcapture (). This was what happened next: > >>> > >>>> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>> + c, proto=data.frame(stringsAsFactors=FALSE, When="", > >>> Who="", > >>> + What="")) > >>>> head (d) > >>> When Who What > >>> 1 <NA> <NA> <NA> > >>> 2 <NA> <NA> <NA> > >>> 3 <NA> <NA> <NA> > >>> 4 <NA> <NA> <NA> > >>> 5 <NA> <NA> <NA> > >>> 6 <NA> <NA> <NA> > >>> > >>> I've been reading up on regular expressions, too, so this code seems > >>> spot on. What's going wrong? > >>> > >>> Michael > >>> > >>> On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> > >>> wrote: > >>>> > >>>> Don't start putting in extra commas and then reading this as csv. That > >>>> approach is broken. The correct approach is what Bill outlined: read > >>>> everything with readLines(), and then use a proper regular expression > >>>> with strcapture(). > >>>> > >>>> You need to pre-process the object that readLines() gives you: replace > >>>> the contents of the videochat lines, and make it conform to the format > >>>> of the other lines before you process it into your data frame. > >>>> > >>>> Approximately something like > >>>> > >>>> # read the raw data > >>>> tmp <- readLines("hangouts-conversation-6.csv.txt") > >>>> > >>>> # process all video chat lines > >>>> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) " # (year time )*** > >>>> (word word) > >>>> tmp <- gsub(patt, "\\1<\\2> ", tmp) > >>>> > >>>> # next, use strcapture() > >>>> > >>>> Note that this makes the assumption that your names are always exactly > >>>> two words containing only letters. If that assumption is not true, more > >>>> though needs to go into the regex. But you can test that: > >>>> > >>>> patt <- " <\\w+ \\w+> " #" <word word> " > >>>> sum( ! grepl(patt, tmp))) > >>>> > >>>> ... will give the number of lines that remain in your file that do not > >>>> have a tag that can be interpreted as "Who" > >>>> > >>>> Once that is fine, use Bill's approach - or a regular expression of your > >>>> own design - to create your data frame. > >>>> > >>>> Hope this helps, > >>>> Boris > >>>> > >>>> > >>>> > >>>> > >>>>> On 2019-05-17, at 16:18, Michael Boulineau > >>>>> <michael.p.boulin...@gmail.com> wrote: > >>>>> > >>>>> Very interesting. I'm sure I'll be trying to get rid of the byte order > >>>>> mark eventually. But right now, I'm more worried about getting the > >>>>> character vector into either a csv file or data.frame; that way, I can > >>>>> be able to work with the data neatly tabulated into four columns: > >>>>> date, time, person, comment. I assume it's a write.csv function, but I > >>>>> don't know what arguments to put in it. header=FALSE? fill=T? > >>>>> > >>>>> Micheal > >>>>> > >>>>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller > >>>>> <jdnew...@dcn.davis.ca.us> wrote: > >>>>>> > >>>>>> If byte order mark is the issue then you can specify the file encoding > >>>>>> as "UTF-8-BOM" and it won't show up in your data any more. > >>>>>> > >>>>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help > >>>>>> <r-help@r-project.org> wrote: > >>>>>>> The pattern I gave worked for the lines that you originally showed > >>>>>>> from > >>>>>>> the > >>>>>>> data file ('a'), before you put commas into them. If the name is > >>>>>>> either of > >>>>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so > >>>>>>> something like "(<[^>]*>|[*]{3})". > >>>>>>> > >>>>>>> The " " at the start of the imported data may come from the byte > >>>>>>> order > >>>>>>> mark that Windows apps like to put at the front of a text file in > >>>>>>> UTF-8 > >>>>>>> or > >>>>>>> UTF-16 format. > >>>>>>> > >>>>>>> Bill Dunlap > >>>>>>> TIBCO Software > >>>>>>> wdunlap tibco.com > >>>>>>> > >>>>>>> > >>>>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau < > >>>>>>> michael.p.boulin...@gmail.com> wrote: > >>>>>>> > >>>>>>>> This seemed to work: > >>>>>>>> > >>>>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt") > >>>>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) > >>>>>>>>> b [1:84] > >>>>>>>> > >>>>>>>> And the first 85 lines looks like this: > >>>>>>>> > >>>>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" > >>>>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>>>> > >>>>>>>> Then they transition to the commas: > >>>>>>>> > >>>>>>>>> b [84:100] > >>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey" > >>>>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" > >>>>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" > >>>>>>>> > >>>>>>>> Even the strange bit on line 6347 was caught by this: > >>>>>>>> > >>>>>>>>> b [6346:6348] > >>>>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe" > >>>>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" > >>>>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" > >>>>>>>> > >>>>>>>> Perhaps most awesomely, the code catches spaces that are interposed > >>>>>>>> into the comment itself: > >>>>>>>> > >>>>>>>>> b [4] > >>>>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey " > >>>>>>>>> b [85] > >>>>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey" > >>>>>>>> > >>>>>>>> Notice whether there is a space after the "hey" or not. > >>>>>>>> > >>>>>>>> These are the first two lines: > >>>>>>>> > >>>>>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > >>>>>>>> [2] "2016-01-27,09:15:20,<Jane > >>>>>>>> Doe>, > >>>>>>>> > >>>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf > >>>>>>>> " > >>>>>>>> > >>>>>>>> So, who knows what happened with the  at the beginning of [1] > >>>>>>>> directly above. But notice how there are no commas in [1] but there > >>>>>>>> appear in [2]. I don't see why really long ones like [2] directly > >>>>>>>> above would be a problem, were they to be translated into a csv or > >>>>>>>> data frame column. > >>>>>>>> > >>>>>>>> Now, with the commas in there, couldn't we write this into a csv or a > >>>>>>>> data.frame? Some of this data will end up being garbage, I imagine. > >>>>>>>> Like in [2] directly above. Or with [83] and [84] at the top of this > >>>>>>>> discussion post/email. Embarrassingly, I've been trying to convert > >>>>>>>> this into a data.frame or csv but I can't manage to. I've been using > >>>>>>>> the write.csv function, but I don't think I've been getting the > >>>>>>>> arguments correct. > >>>>>>>> > >>>>>>>> At the end of the day, I would like a data.frame and/or csv with the > >>>>>>>> following four columns: date, time, person, comment. > >>>>>>>> > >>>>>>>> I tried this, too: > >>>>>>>> > >>>>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>>>>>> + a, proto=data.frame(stringsAsFactors=FALSE, > >>>>>>> When="", > >>>>>>>> Who="", > >>>>>>>> + What="")) > >>>>>>>> > >>>>>>>> But all I got was this: > >>>>>>>> > >>>>>>>>> c [1:100, ] > >>>>>>>> When Who What > >>>>>>>> 1 <NA> <NA> <NA> > >>>>>>>> 2 <NA> <NA> <NA> > >>>>>>>> 3 <NA> <NA> <NA> > >>>>>>>> 4 <NA> <NA> <NA> > >>>>>>>> 5 <NA> <NA> <NA> > >>>>>>>> 6 <NA> <NA> <NA> > >>>>>>>> > >>>>>>>> It seems to have caught nothing. > >>>>>>>> > >>>>>>>>> unique (c) > >>>>>>>> When Who What > >>>>>>>> 1 <NA> <NA> <NA> > >>>>>>>> > >>>>>>>> But I like that it converted into columns. That's a really great > >>>>>>>> format. With a little tweaking, it'd be a great code for this data > >>>>>>>> set. > >>>>>>>> > >>>>>>>> Michael > >>>>>>>> > >>>>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help > >>>>>>>> <r-help@r-project.org> wrote: > >>>>>>>>> > >>>>>>>>> Consider using readLines() and strcapture() for reading such a > >>>>>>> file. > >>>>>>>> E.g., > >>>>>>>>> suppose readLines(files) produced a character vector like > >>>>>>>>> > >>>>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", > >>>>>>>>> "2016-10-21 10:56:29 <John Doe> John_Doe", > >>>>>>>>> "2016-10-21 10:56:37 <John Doe> Admit#8242", > >>>>>>>>> "October 23, 1819 12:34 <Jane Eyre> I am not an angel") > >>>>>>>>> > >>>>>>>>> Then you can make a data.frame with columns When, Who, and What by > >>>>>>>>> supplying a pattern containing three parenthesized capture > >>>>>>> expressions: > >>>>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>>>>>>> x, proto=data.frame(stringsAsFactors=FALSE, When="", > >>>>>>> Who="", > >>>>>>>>> What="")) > >>>>>>>>>> str(z) > >>>>>>>>> 'data.frame': 4 obs. of 3 variables: > >>>>>>>>> $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" > >>>>>>> "2016-10-21 > >>>>>>>>> 10:56:37" NA > >>>>>>>>> $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA > >>>>>>>>> $ What: chr "What's your login" "John_Doe" "Admit#8242" NA > >>>>>>>>> > >>>>>>>>> Lines that don't match the pattern result in NA's - you might make > >>>>>>> a > >>>>>>>> second > >>>>>>>>> pass over the corresponding elements of x with a new pattern. > >>>>>>>>> > >>>>>>>>> You can convert the When column from character to time with > >>>>>>> as.POSIXct(). > >>>>>>>>> > >>>>>>>>> Bill Dunlap > >>>>>>>>> TIBCO Software > >>>>>>>>> wdunlap tibco.com > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius > >>>>>>> <dwinsem...@comcast.net> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote: > >>>>>>>>>>> OK. So, I named the object test and then checked the 6347th > >>>>>>> item > >>>>>>>>>>> > >>>>>>>>>>>> test <- readLines ("hangouts-conversation.txt) > >>>>>>>>>>>> test [6347] > >>>>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" > >>>>>>>>>>> > >>>>>>>>>>> Perhaps where it was getting screwed up is, since the end of > >>>>>>> this is > >>>>>>>> a > >>>>>>>>>>> number (8242), then, given that there's no space between the > >>>>>>> number > >>>>>>>>>>> and what ought to be the next row, R didn't know where to draw > >>>>>>> the > >>>>>>>>>>> line. Sure enough, it looks like this when I go to the original > >>>>>>> file > >>>>>>>>>>> and control f "#8242" > >>>>>>>>>>> > >>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login > >>>>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe > >>>>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> An octothorpe is an end of line signifier and is interpreted as > >>>>>>>> allowing > >>>>>>>>>> comments. You can prevent that interpretation with suitable > >>>>>>> choice of > >>>>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why > >>>>>>> that > >>>>>>>>>> should cause anu error or a failure to match that pattern. > >>>>>>>>>> > >>>>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion > >>>>>>>>>>> > >>>>>>>>>>> Again, it doesn't look like that in the file. Gmail > >>>>>>> automatically > >>>>>>>>>>> formats it like that when I paste it in. More to the point, it > >>>>>>> looks > >>>>>>>>>>> like > >>>>>>>>>>> > >>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 > >>>>>>> 10:56:29 > >>>>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> > >>>>>>>> Admit#82422016-10-21 > >>>>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion > >>>>>>>>>>> > >>>>>>>>>>> Notice Admit#82422016. So there's that. > >>>>>>>>>>> > >>>>>>>>>>> Then I built object test2. > >>>>>>>>>>> > >>>>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", > >>>>>>> test) > >>>>>>>>>>> > >>>>>>>>>>> This worked for 84 lines, then this happened. > >>>>>>>>>> > >>>>>>>>>> It may have done something but as you later discovered my first > >>>>>>> code > >>>>>>>> for > >>>>>>>>>> the pattern was incorrect. I had tested it (and pasted in the > >>>>>>> results > >>>>>>>> of > >>>>>>>>>> the test) . The way to refer to a capture class is with > >>>>>>> back-slashes > >>>>>>>>>> before the numbers, not forward-slashes. Try this: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", > >>>>>>> "\\1,\\2,\\3,\\4", > >>>>>>>> chrvec) > >>>>>>>>>>> newvec > >>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" > >>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > >>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, > >>>>>>> not > >>>>>>>> really" > >>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, > >>>>>>> didn't > >>>>>>>> sleep" > >>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or > >>>>>>> where I am > >>>>>>>>>> really" > >>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good > >>>>>>> eay" > >>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>" > >>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>" > >>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little > >>>>>>> more > >>>>>>>>>> rigorous..." > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> I made note of the fact that the 10th and 11th lines had no > >>>>>>> commas. > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> test2 [84] > >>>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>>>>>> > >>>>>>>>>> That line didn't have any "<" so wasn't matched. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> You could remove all none matching lines for pattern of > >>>>>>>>>> > >>>>>>>>>> dates<space>times<space>"<"<name>">"<space><anything> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> with: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Do read: > >>>>>>>>>> > >>>>>>>>>> ?read.csv > >>>>>>>>>> > >>>>>>>>>> ?regex > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> > >>>>>>>>>> David > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>>> test2 [85] > >>>>>>>>>>> [1] "//1,//2,//3,//4" > >>>>>>>>>>>> test [85] > >>>>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey" > >>>>>>>>>>> > >>>>>>>>>>> Notice how I toggled back and forth between test and test2 > >>>>>>> there. So, > >>>>>>>>>>> whatever happened with the regex, it happened in the switch > >>>>>>> from 84 > >>>>>>>> to > >>>>>>>>>>> 85, I guess. It went on like > >>>>>>>>>>> > >>>>>>>>>>> [990] "//1,//2,//3,//4" > >>>>>>>>>>> [991] "//1,//2,//3,//4" > >>>>>>>>>>> [992] "//1,//2,//3,//4" > >>>>>>>>>>> [993] "//1,//2,//3,//4" > >>>>>>>>>>> [994] "//1,//2,//3,//4" > >>>>>>>>>>> [995] "//1,//2,//3,//4" > >>>>>>>>>>> [996] "//1,//2,//3,//4" > >>>>>>>>>>> [997] "//1,//2,//3,//4" > >>>>>>>>>>> [998] "//1,//2,//3,//4" > >>>>>>>>>>> [999] "//1,//2,//3,//4" > >>>>>>>>>>> [1000] "//1,//2,//3,//4" > >>>>>>>>>>> > >>>>>>>>>>> up until line 1000, then I reached max.print. > >>>>>>>>>> > >>>>>>>>>>> Michael > >>>>>>>>>>> > >>>>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius < > >>>>>>>> dwinsem...@comcast.net> > >>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote: > >>>>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and > >>>>>>> not do > >>>>>>>>>> that again. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I tried the read.fwf from the foreign package, with a code > >>>>>>> like > >>>>>>>> this: > >>>>>>>>>>>>> > >>>>>>>>>>>>> d <- read.fwf("hangouts-conversation.txt", > >>>>>>>>>>>>> widths= c(10,10,20,40), > >>>>>>>>>>>>> > >>>>>>> col.names=c("date","time","person","comment"), > >>>>>>>>>>>>> strip.white=TRUE) > >>>>>>>>>>>>> > >>>>>>>>>>>>> But it threw this error: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote = > >>>>>>> quote, > >>>>>>>> dec > >>>>>>>>>> = dec, : > >>>>>>>>>>>>> line 6347 did not have 4 elements > >>>>>>>>>>>> > >>>>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print > >>>>>>> it > >>>>>>>> out.) > >>>>>>>>>>>> > >>>>>>>>>>>>> Interestingly, though, the error only happened when I > >>>>>>> increased the > >>>>>>>>>>>>> width size. But I had to increase the size, or else I > >>>>>>> couldn't > >>>>>>>> "see" > >>>>>>>>>>>>> anything. The comment was so small that nothing was being > >>>>>>>> captured by > >>>>>>>>>>>>> the size of the column. so to speak. > >>>>>>>>>>>>> > >>>>>>>>>>>>> It seems like what's throwing me is that there's no comma > >>>>>>> that > >>>>>>>>>>>>> demarcates the end of the text proper. For example: > >>>>>>>>>>>> Not sure why you thought there should be a comma. Lines > >>>>>>> usually end > >>>>>>>>>>>> with <cr> and or a <lf>. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Once you have the raw text in a character vector from > >>>>>>> `readLines` > >>>>>>>> named, > >>>>>>>>>>>> say, 'chrvec', then you could selectively substitute commas > >>>>>>> for > >>>>>>>> spaces > >>>>>>>>>>>> with regex. (Now that you no longer desire to remove the dates > >>>>>>> and > >>>>>>>>>> times.) > >>>>>>>>>>>> > >>>>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > >>>>>>>>>>>> > >>>>>>>>>>>> This will not do any replacements when the pattern is not > >>>>>>> matched. > >>>>>>>> See > >>>>>>>>>>>> this test: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", > >>>>>>> "\\1,\\2,\\3,\\4", > >>>>>>>>>> chrvec) > >>>>>>>>>>>>> newvec > >>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" > >>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to > >>>>>>> Edinburgh" > >>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has > >>>>>>> happened, not > >>>>>>>>>> really" > >>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, > >>>>>>> didn't > >>>>>>>>>> sleep" > >>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or > >>>>>>> where > >>>>>>>> I am > >>>>>>>>>>>> really" > >>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a > >>>>>>> good > >>>>>>>> eay" > >>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>" > >>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>" > >>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little > >>>>>>> more > >>>>>>>>>>>> rigorous..." > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> You should probably remove the "empty comment" lines. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> > >>>>>>>>>>>> David. > >>>>>>>>>>>> > >>>>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a > >>>>>>>> starbucks2016-07-01 > >>>>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 > >>>>>>> <Jane > >>>>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> > >>>>>>> There was > >>>>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47 > >>>>>>>>>>>>> > >>>>>>>>>>>>> It was interesting, too, when I pasted the text into the > >>>>>>> email, it > >>>>>>>>>>>>> self-formatted into the way I wanted it to look. I had to > >>>>>>> manually > >>>>>>>>>>>>> make it look like it does above, since that's the way that it > >>>>>>>> looks in > >>>>>>>>>>>>> the txt file. I wonder if it's being organized by XML or > >>>>>>> something. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Anyways, There's always a space between the two sideways > >>>>>>> carrots, > >>>>>>>> just > >>>>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's > >>>>>>> always > >>>>>>>> a > >>>>>>>>>>>>> space between the data and time. Like this. 2016-07-01 > >>>>>>> 15:34:30 > >>>>>>>> See. > >>>>>>>>>>>>> Space. But there's never a space between the end of the > >>>>>>> comment and > >>>>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01 > >>>>>>> 15:35:02 > >>>>>>>>>>>>> See. starbucks and 2016 are smooshed together. > >>>>>>>>>>>>> > >>>>>>>>>>>>> This code is also on the table right now too. > >>>>>>>>>>>>> > >>>>>>>>>>>>> a <- read.table("E:/working > >>>>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"", > >>>>>>>>>>>>> comment.char="", fill=TRUE) > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > >>>>>>>>>>>>> > >>>>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h) > >>>>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > >>>>>>>>>>>>> > >>>>>>>>>>>>> Those last lines are a work in progress. I wish I could > >>>>>>> import a > >>>>>>>>>>>>> picture of what it looks like when it's translated into a > >>>>>>> data > >>>>>>>> frame. > >>>>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of > >>>>>>> sort of > >>>>>>>>>>>>> works, but the comments keep bleeding into the data and time > >>>>>>>> column. > >>>>>>>>>>>>> It's like > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > >>>>>>>>>>>>> over there > >>>>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > >>>>>>>>>>>>> > >>>>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to > >>>>>>>> itself, as > >>>>>>>>>>>>> will be the "I've'"and the "never" etc. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I will use a regular expression if I have to, but it would be > >>>>>>> nice > >>>>>>>> to > >>>>>>>>>>>>> keep the dates and times on there. Originally, I thought they > >>>>>>> were > >>>>>>>>>>>>> meaningless, but I've since changed my mind on that count. > >>>>>>> The > >>>>>>>> time of > >>>>>>>>>>>>> day isn't so important. But, especially since, say, Gmail > >>>>>>> itself > >>>>>>>> knows > >>>>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I > >>>>>>> know > >>>>>>>>>>>>> this data has structure to it. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Michael > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < > >>>>>>>>>> dwinsem...@comcast.net> wrote: > >>>>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: > >>>>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks > >>>>>>> like > >>>>>>>> this: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey > >>>>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > >>>>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo > >>>>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not > >>>>>>>> really > >>>>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, > >>>>>>> didn't > >>>>>>>> sleep > >>>>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where > >>>>>>> I am > >>>>>>>>>> really > >>>>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london > >>>>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep > >>>>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good > >>>>>>> eay > >>>>>>>>>>>>>>> 2016-07-01 02:58:56 <jone> > >>>>>>>>>>>>>>> 2016-07-01 02:59:34 <jane> > >>>>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little > >>>>>>> more > >>>>>>>>>> rigorous... > >>>>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) > >>>>>>> Use > >>>>>>>> regex > >>>>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<". > >>>>>>> Read > >>>>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a > >>>>>>>> pattern > >>>>>>>>>>>>>> ".+<" and replace with "". > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> And do read the Posting Guide. Cross-posting to > >>>>>>> StackOverflow and > >>>>>>>>>> Rhelp, > >>>>>>>>>>>>>> at least within hours of each, is considered poor manners. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> David. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like > >>>>>>> it's > >>>>>>>> going > >>>>>>>>>> to > >>>>>>>>>>>>>>> be difficult to annotate with the coreNLP library or > >>>>>>> package. I'm > >>>>>>>>>>>>>>> doing natural language processing. In other words, I'm > >>>>>>> curious > >>>>>>>> as to > >>>>>>>>>>>>>>> how I would shave off the dates, that is, to make it look > >>>>>>> like: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> <john> hey > >>>>>>>>>>>>>>> <jane> waiting for plane to Edinburgh > >>>>>>>>>>>>>>> <john> thinking about my boo > >>>>>>>>>>>>>>> <jane> nothing crappy has happened, not really > >>>>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep > >>>>>>>>>>>>>>> <jane> no idea what time it is or where I am really > >>>>>>>>>>>>>>> <john> just know it's london > >>>>>>>>>>>>>>> <jane> you are probably asleep > >>>>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay > >>>>>>>>>>>>>>> <jone> > >>>>>>>>>>>>>>> <jane> > >>>>>>>>>>>>>>> <john> British security is a little more rigorous... > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by > >>>>>>>> writing a > >>>>>>>>>>>>>>> regular expression? such that I create a new object with no > >>>>>>>> numbers > >>>>>>>>>> or > >>>>>>>>>>>>>>> dates. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Michael > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and > >>>>>>> more, > >>>>>>>> see > >>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>>>>> and provide commented, minimal, self-contained, > >>>>>>> reproducible > >>>>>>>> code. > >>>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>>>> see > >>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>>>> code. > >>>>>>>>>>>> ______________________________________________ > >>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>>>> see > >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>>>> code. > >>>>>>>>>>> ______________________________________________ > >>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>>>>>> see > >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>>>> code. > >>>>>>>>>> > >>>>>>>>>> ______________________________________________ > >>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>> PLEASE do read the posting guide > >>>>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>>>>>> code. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> [[alternative HTML version deleted]] > >>>>>>>>> > >>>>>>>>> ______________________________________________ > >>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>> PLEASE do read the posting guide > >>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>>>> > >>>>>>>> ______________________________________________ > >>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>> PLEASE do read the posting guide > >>>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>>>> > >>>>>>> > >>>>>>> [[alternative HTML version deleted]] > >>>>>>> > >>>>>>> ______________________________________________ > >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide > >>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>> > >>>>>> -- > >>>>>> Sent from my phone. Please excuse my brevity. > >>>>> > >>>>> ______________________________________________ > >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > >>>>> http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>> > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.