Going back and thinking through what Boris and William were saying (also Ivan), I tried this:
a <- readLines ("hangouts-conversation-6.csv.txt") b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" c <- gsub(b, "\\1<\\2> ", a) > head (c) [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" [2] "2016-01-27 09:15:20 <Jane Doe> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" [3] "2016-01-27 09:15:20 <Jane Doe> Hey " [4] "2016-01-27 09:15:22 <John Doe> ended a video chat" [5] "2016-01-27 21:07:11 <Jane Doe> started a video chat" [6] "2016-01-27 21:26:57 <John Doe> ended a video chat" The  is still there, since I forgot to do what Ivan had suggested, namely, a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding = "UTF-8")); close(con); rm(con) But then the new code is still turning out only NAs when I apply strcapture (). This was what happened next: > d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", + c, proto=data.frame(stringsAsFactors=FALSE, When="", Who="", + What="")) > head (d) When Who What 1 <NA> <NA> <NA> 2 <NA> <NA> <NA> 3 <NA> <NA> <NA> 4 <NA> <NA> <NA> 5 <NA> <NA> <NA> 6 <NA> <NA> <NA> I've been reading up on regular expressions, too, so this code seems spot on. What's going wrong? Michael On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> wrote: > > Don't start putting in extra commas and then reading this as csv. That > approach is broken. The correct approach is what Bill outlined: read > everything with readLines(), and then use a proper regular expression with > strcapture(). > > You need to pre-process the object that readLines() gives you: replace the > contents of the videochat lines, and make it conform to the format of the > other lines before you process it into your data frame. > > Approximately something like > > # read the raw data > tmp <- readLines("hangouts-conversation-6.csv.txt") > > # process all video chat lines > patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) " # (year time )*** > (word word) > tmp <- gsub(patt, "\\1<\\2> ", tmp) > > # next, use strcapture() > > Note that this makes the assumption that your names are always exactly two > words containing only letters. If that assumption is not true, more though > needs to go into the regex. But you can test that: > > patt <- " <\\w+ \\w+> " #" <word word> " > sum( ! grepl(patt, tmp))) > > ... will give the number of lines that remain in your file that do not have a > tag that can be interpreted as "Who" > > Once that is fine, use Bill's approach - or a regular expression of your own > design - to create your data frame. > > Hope this helps, > Boris > > > > > > On 2019-05-17, at 16:18, Michael Boulineau <michael.p.boulin...@gmail.com> > > wrote: > > > > Very interesting. I'm sure I'll be trying to get rid of the byte order > > mark eventually. But right now, I'm more worried about getting the > > character vector into either a csv file or data.frame; that way, I can > > be able to work with the data neatly tabulated into four columns: > > date, time, person, comment. I assume it's a write.csv function, but I > > don't know what arguments to put in it. header=FALSE? fill=T? > > > > Micheal > > > > On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnew...@dcn.davis.ca.us> > > wrote: > >> > >> If byte order mark is the issue then you can specify the file encoding as > >> "UTF-8-BOM" and it won't show up in your data any more. > >> > >> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help > >> <r-help@r-project.org> wrote: > >>> The pattern I gave worked for the lines that you originally showed from > >>> the > >>> data file ('a'), before you put commas into them. If the name is > >>> either of > >>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so > >>> something like "(<[^>]*>|[*]{3})". > >>> > >>> The " " at the start of the imported data may come from the byte > >>> order > >>> mark that Windows apps like to put at the front of a text file in UTF-8 > >>> or > >>> UTF-16 format. > >>> > >>> Bill Dunlap > >>> TIBCO Software > >>> wdunlap tibco.com > >>> > >>> > >>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau < > >>> michael.p.boulin...@gmail.com> wrote: > >>> > >>>> This seemed to work: > >>>> > >>>>> a <- readLines ("hangouts-conversation-6.csv.txt") > >>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) > >>>>> b [1:84] > >>>> > >>>> And the first 85 lines looks like this: > >>>> > >>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" > >>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>> > >>>> Then they transition to the commas: > >>>> > >>>>> b [84:100] > >>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>> [2] "2016-07-01,02:50:35,<John Doe>,hey" > >>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" > >>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" > >>>> > >>>> Even the strange bit on line 6347 was caught by this: > >>>> > >>>>> b [6346:6348] > >>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe" > >>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" > >>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" > >>>> > >>>> Perhaps most awesomely, the code catches spaces that are interposed > >>>> into the comment itself: > >>>> > >>>>> b [4] > >>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey " > >>>>> b [85] > >>>> [1] "2016-07-01,02:50:35,<John Doe>,hey" > >>>> > >>>> Notice whether there is a space after the "hey" or not. > >>>> > >>>> These are the first two lines: > >>>> > >>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" > >>>> [2] "2016-01-27,09:15:20,<Jane > >>>> Doe>, > >>>> > >>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf > >>>> " > >>>> > >>>> So, who knows what happened with the  at the beginning of [1] > >>>> directly above. But notice how there are no commas in [1] but there > >>>> appear in [2]. I don't see why really long ones like [2] directly > >>>> above would be a problem, were they to be translated into a csv or > >>>> data frame column. > >>>> > >>>> Now, with the commas in there, couldn't we write this into a csv or a > >>>> data.frame? Some of this data will end up being garbage, I imagine. > >>>> Like in [2] directly above. Or with [83] and [84] at the top of this > >>>> discussion post/email. Embarrassingly, I've been trying to convert > >>>> this into a data.frame or csv but I can't manage to. I've been using > >>>> the write.csv function, but I don't think I've been getting the > >>>> arguments correct. > >>>> > >>>> At the end of the day, I would like a data.frame and/or csv with the > >>>> following four columns: date, time, person, comment. > >>>> > >>>> I tried this, too: > >>>> > >>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>> + a, proto=data.frame(stringsAsFactors=FALSE, > >>> When="", > >>>> Who="", > >>>> + What="")) > >>>> > >>>> But all I got was this: > >>>> > >>>>> c [1:100, ] > >>>> When Who What > >>>> 1 <NA> <NA> <NA> > >>>> 2 <NA> <NA> <NA> > >>>> 3 <NA> <NA> <NA> > >>>> 4 <NA> <NA> <NA> > >>>> 5 <NA> <NA> <NA> > >>>> 6 <NA> <NA> <NA> > >>>> > >>>> It seems to have caught nothing. > >>>> > >>>>> unique (c) > >>>> When Who What > >>>> 1 <NA> <NA> <NA> > >>>> > >>>> But I like that it converted into columns. That's a really great > >>>> format. With a little tweaking, it'd be a great code for this data > >>>> set. > >>>> > >>>> Michael > >>>> > >>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help > >>>> <r-help@r-project.org> wrote: > >>>>> > >>>>> Consider using readLines() and strcapture() for reading such a > >>> file. > >>>> E.g., > >>>>> suppose readLines(files) produced a character vector like > >>>>> > >>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", > >>>>> "2016-10-21 10:56:29 <John Doe> John_Doe", > >>>>> "2016-10-21 10:56:37 <John Doe> Admit#8242", > >>>>> "October 23, 1819 12:34 <Jane Eyre> I am not an angel") > >>>>> > >>>>> Then you can make a data.frame with columns When, Who, and What by > >>>>> supplying a pattern containing three parenthesized capture > >>> expressions: > >>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > >>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > >>>>> x, proto=data.frame(stringsAsFactors=FALSE, When="", > >>> Who="", > >>>>> What="")) > >>>>>> str(z) > >>>>> 'data.frame': 4 obs. of 3 variables: > >>>>> $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" > >>> "2016-10-21 > >>>>> 10:56:37" NA > >>>>> $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA > >>>>> $ What: chr "What's your login" "John_Doe" "Admit#8242" NA > >>>>> > >>>>> Lines that don't match the pattern result in NA's - you might make > >>> a > >>>> second > >>>>> pass over the corresponding elements of x with a new pattern. > >>>>> > >>>>> You can convert the When column from character to time with > >>> as.POSIXct(). > >>>>> > >>>>> Bill Dunlap > >>>>> TIBCO Software > >>>>> wdunlap tibco.com > >>>>> > >>>>> > >>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius > >>> <dwinsem...@comcast.net> > >>>>> wrote: > >>>>> > >>>>>> > >>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote: > >>>>>>> OK. So, I named the object test and then checked the 6347th > >>> item > >>>>>>> > >>>>>>>> test <- readLines ("hangouts-conversation.txt) > >>>>>>>> test [6347] > >>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" > >>>>>>> > >>>>>>> Perhaps where it was getting screwed up is, since the end of > >>> this is > >>>> a > >>>>>>> number (8242), then, given that there's no space between the > >>> number > >>>>>>> and what ought to be the next row, R didn't know where to draw > >>> the > >>>>>>> line. Sure enough, it looks like this when I go to the original > >>> file > >>>>>>> and control f "#8242" > >>>>>>> > >>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login > >>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe > >>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242 > >>>>>> > >>>>>> > >>>>>> An octothorpe is an end of line signifier and is interpreted as > >>>> allowing > >>>>>> comments. You can prevent that interpretation with suitable > >>> choice of > >>>>>> parameters to `read.table` or `read.csv`. I don't understand why > >>> that > >>>>>> should cause anu error or a failure to match that pattern. > >>>>>> > >>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion > >>>>>>> > >>>>>>> Again, it doesn't look like that in the file. Gmail > >>> automatically > >>>>>>> formats it like that when I paste it in. More to the point, it > >>> looks > >>>>>>> like > >>>>>>> > >>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 > >>> 10:56:29 > >>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> > >>>> Admit#82422016-10-21 > >>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion > >>>>>>> > >>>>>>> Notice Admit#82422016. So there's that. > >>>>>>> > >>>>>>> Then I built object test2. > >>>>>>> > >>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", > >>> test) > >>>>>>> > >>>>>>> This worked for 84 lines, then this happened. > >>>>>> > >>>>>> It may have done something but as you later discovered my first > >>> code > >>>> for > >>>>>> the pattern was incorrect. I had tested it (and pasted in the > >>> results > >>>> of > >>>>>> the test) . The way to refer to a capture class is with > >>> back-slashes > >>>>>> before the numbers, not forward-slashes. Try this: > >>>>>> > >>>>>> > >>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", > >>> "\\1,\\2,\\3,\\4", > >>>> chrvec) > >>>>>>> newvec > >>>>>> [1] "2016-07-01,02:50:35,<john>,hey" > >>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > >>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, > >>> not > >>>> really" > >>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, > >>> didn't > >>>> sleep" > >>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or > >>> where I am > >>>>>> really" > >>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good > >>> eay" > >>>>>> [10] "2016-07-01 02:58:56 <jone>" > >>>>>> [11] "2016-07-01 02:59:34 <jane>" > >>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little > >>> more > >>>>>> rigorous..." > >>>>>> > >>>>>> > >>>>>> I made note of the fact that the 10th and 11th lines had no > >>> commas. > >>>>>> > >>>>>>> > >>>>>>>> test2 [84] > >>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > >>>>>> > >>>>>> That line didn't have any "<" so wasn't matched. > >>>>>> > >>>>>> > >>>>>> You could remove all none matching lines for pattern of > >>>>>> > >>>>>> dates<space>times<space>"<"<name>">"<space><anything> > >>>>>> > >>>>>> > >>>>>> with: > >>>>>> > >>>>>> > >>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] > >>>>>> > >>>>>> > >>>>>> Do read: > >>>>>> > >>>>>> ?read.csv > >>>>>> > >>>>>> ?regex > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> > >>>>>> David > >>>>>> > >>>>>> > >>>>>>>> test2 [85] > >>>>>>> [1] "//1,//2,//3,//4" > >>>>>>>> test [85] > >>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey" > >>>>>>> > >>>>>>> Notice how I toggled back and forth between test and test2 > >>> there. So, > >>>>>>> whatever happened with the regex, it happened in the switch > >>> from 84 > >>>> to > >>>>>>> 85, I guess. It went on like > >>>>>>> > >>>>>>> [990] "//1,//2,//3,//4" > >>>>>>> [991] "//1,//2,//3,//4" > >>>>>>> [992] "//1,//2,//3,//4" > >>>>>>> [993] "//1,//2,//3,//4" > >>>>>>> [994] "//1,//2,//3,//4" > >>>>>>> [995] "//1,//2,//3,//4" > >>>>>>> [996] "//1,//2,//3,//4" > >>>>>>> [997] "//1,//2,//3,//4" > >>>>>>> [998] "//1,//2,//3,//4" > >>>>>>> [999] "//1,//2,//3,//4" > >>>>>>> [1000] "//1,//2,//3,//4" > >>>>>>> > >>>>>>> up until line 1000, then I reached max.print. > >>>>>> > >>>>>>> Michael > >>>>>>> > >>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius < > >>>> dwinsem...@comcast.net> > >>>>>> wrote: > >>>>>>>> > >>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote: > >>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and > >>> not do > >>>>>> that again. > >>>>>>>>> > >>>>>>>>> I tried the read.fwf from the foreign package, with a code > >>> like > >>>> this: > >>>>>>>>> > >>>>>>>>> d <- read.fwf("hangouts-conversation.txt", > >>>>>>>>> widths= c(10,10,20,40), > >>>>>>>>> > >>> col.names=c("date","time","person","comment"), > >>>>>>>>> strip.white=TRUE) > >>>>>>>>> > >>>>>>>>> But it threw this error: > >>>>>>>>> > >>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote = > >>> quote, > >>>> dec > >>>>>> = dec, : > >>>>>>>>> line 6347 did not have 4 elements > >>>>>>>> > >>>>>>>> So what does line 6347 look like? (Use `readLines` and print > >>> it > >>>> out.) > >>>>>>>> > >>>>>>>>> Interestingly, though, the error only happened when I > >>> increased the > >>>>>>>>> width size. But I had to increase the size, or else I > >>> couldn't > >>>> "see" > >>>>>>>>> anything. The comment was so small that nothing was being > >>>> captured by > >>>>>>>>> the size of the column. so to speak. > >>>>>>>>> > >>>>>>>>> It seems like what's throwing me is that there's no comma > >>> that > >>>>>>>>> demarcates the end of the text proper. For example: > >>>>>>>> Not sure why you thought there should be a comma. Lines > >>> usually end > >>>>>>>> with <cr> and or a <lf>. > >>>>>>>> > >>>>>>>> > >>>>>>>> Once you have the raw text in a character vector from > >>> `readLines` > >>>> named, > >>>>>>>> say, 'chrvec', then you could selectively substitute commas > >>> for > >>>> spaces > >>>>>>>> with regex. (Now that you no longer desire to remove the dates > >>> and > >>>>>> times.) > >>>>>>>> > >>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > >>>>>>>> > >>>>>>>> This will not do any replacements when the pattern is not > >>> matched. > >>>> See > >>>>>>>> this test: > >>>>>>>> > >>>>>>>> > >>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", > >>> "\\1,\\2,\\3,\\4", > >>>>>> chrvec) > >>>>>>>>> newvec > >>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey" > >>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to > >>> Edinburgh" > >>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has > >>> happened, not > >>>>>> really" > >>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, > >>> didn't > >>>>>> sleep" > >>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or > >>> where > >>>> I am > >>>>>>>> really" > >>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a > >>> good > >>>> eay" > >>>>>>>> [10] "2016-07-01 02:58:56 <jone>" > >>>>>>>> [11] "2016-07-01 02:59:34 <jane>" > >>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little > >>> more > >>>>>>>> rigorous..." > >>>>>>>> > >>>>>>>> > >>>>>>>> You should probably remove the "empty comment" lines. > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> David. > >>>>>>>> > >>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a > >>>> starbucks2016-07-01 > >>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 > >>> <Jane > >>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> > >>> There was > >>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47 > >>>>>>>>> > >>>>>>>>> It was interesting, too, when I pasted the text into the > >>> email, it > >>>>>>>>> self-formatted into the way I wanted it to look. I had to > >>> manually > >>>>>>>>> make it look like it does above, since that's the way that it > >>>> looks in > >>>>>>>>> the txt file. I wonder if it's being organized by XML or > >>> something. > >>>>>>>>> > >>>>>>>>> Anyways, There's always a space between the two sideways > >>> carrots, > >>>> just > >>>>>>>>> like there is right now: <John Doe> See. Space. And there's > >>> always > >>>> a > >>>>>>>>> space between the data and time. Like this. 2016-07-01 > >>> 15:34:30 > >>>> See. > >>>>>>>>> Space. But there's never a space between the end of the > >>> comment and > >>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01 > >>> 15:35:02 > >>>>>>>>> See. starbucks and 2016 are smooshed together. > >>>>>>>>> > >>>>>>>>> This code is also on the table right now too. > >>>>>>>>> > >>>>>>>>> a <- read.table("E:/working > >>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"", > >>>>>>>>> comment.char="", fill=TRUE) > >>>>>>>>> > >>>>>>>>> > >>>>>> > >>>> > >>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > >>>>>>>>> > >>>>>>>>> aa<-gsub("[^[:digit:]]","",h) > >>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > >>>>>>>>> > >>>>>>>>> Those last lines are a work in progress. I wish I could > >>> import a > >>>>>>>>> picture of what it looks like when it's translated into a > >>> data > >>>> frame. > >>>>>>>>> The fill=TRUE helped to get the data in table that kind of > >>> sort of > >>>>>>>>> works, but the comments keep bleeding into the data and time > >>>> column. > >>>>>>>>> It's like > >>>>>>>>> > >>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > >>>>>>>>> over there > >>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > >>>>>>>>> > >>>>>>>>> And then, maybe, the "seriously" will be in a column all to > >>>> itself, as > >>>>>>>>> will be the "I've'"and the "never" etc. > >>>>>>>>> > >>>>>>>>> I will use a regular expression if I have to, but it would be > >>> nice > >>>> to > >>>>>>>>> keep the dates and times on there. Originally, I thought they > >>> were > >>>>>>>>> meaningless, but I've since changed my mind on that count. > >>> The > >>>> time of > >>>>>>>>> day isn't so important. But, especially since, say, Gmail > >>> itself > >>>> knows > >>>>>>>>> how to quickly recognize what it is, I know it can be done. I > >>> know > >>>>>>>>> this data has structure to it. > >>>>>>>>> > >>>>>>>>> Michael > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < > >>>>>> dwinsem...@comcast.net> wrote: > >>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: > >>>>>>>>>>> I have a wild and crazy text file, the head of which looks > >>> like > >>>> this: > >>>>>>>>>>> > >>>>>>>>>>> 2016-07-01 02:50:35 <john> hey > >>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > >>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo > >>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not > >>>> really > >>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, > >>> didn't > >>>> sleep > >>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where > >>> I am > >>>>>> really > >>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london > >>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep > >>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good > >>> eay > >>>>>>>>>>> 2016-07-01 02:58:56 <jone> > >>>>>>>>>>> 2016-07-01 02:59:34 <jane> > >>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little > >>> more > >>>>>> rigorous... > >>>>>>>>>> Looks entirely not-"crazy". Typical log file format. > >>>>>>>>>> > >>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) > >>> Use > >>>> regex > >>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<". > >>> Read > >>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a > >>>> pattern > >>>>>>>>>> ".+<" and replace with "". > >>>>>>>>>> > >>>>>>>>>> And do read the Posting Guide. Cross-posting to > >>> StackOverflow and > >>>>>> Rhelp, > >>>>>>>>>> at least within hours of each, is considered poor manners. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> > >>>>>>>>>> David. > >>>>>>>>>> > >>>>>>>>>>> It goes on for a while. It's a big file. But I feel like > >>> it's > >>>> going > >>>>>> to > >>>>>>>>>>> be difficult to annotate with the coreNLP library or > >>> package. I'm > >>>>>>>>>>> doing natural language processing. In other words, I'm > >>> curious > >>>> as to > >>>>>>>>>>> how I would shave off the dates, that is, to make it look > >>> like: > >>>>>>>>>>> > >>>>>>>>>>> <john> hey > >>>>>>>>>>> <jane> waiting for plane to Edinburgh > >>>>>>>>>>> <john> thinking about my boo > >>>>>>>>>>> <jane> nothing crappy has happened, not really > >>>>>>>>>>> <john> plane went by pretty fast, didn't sleep > >>>>>>>>>>> <jane> no idea what time it is or where I am really > >>>>>>>>>>> <john> just know it's london > >>>>>>>>>>> <jane> you are probably asleep > >>>>>>>>>>> <jane> I hope fish was fishy in a good eay > >>>>>>>>>>> <jone> > >>>>>>>>>>> <jane> > >>>>>>>>>>> <john> British security is a little more rigorous... > >>>>>>>>>>> > >>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by > >>>> writing a > >>>>>>>>>>> regular expression? such that I create a new object with no > >>>> numbers > >>>>>> or > >>>>>>>>>>> dates. > >>>>>>>>>>> > >>>>>>>>>>> Michael > >>>>>>>>>>> > >>>>>>>>>>> ______________________________________________ > >>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and > >>> more, > >>>> see > >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>>>> PLEASE do read the posting guide > >>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>>>> and provide commented, minimal, self-contained, > >>> reproducible > >>>> code. > >>>>>>>>> ______________________________________________ > >>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>> see > >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>>> PLEASE do read the posting guide > >>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>>> and provide commented, minimal, self-contained, reproducible > >>> code. > >>>>>>>> ______________________________________________ > >>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>> see > >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>>> PLEASE do read the posting guide > >>>>>> http://www.R-project.org/posting-guide.html > >>>>>>>> and provide commented, minimal, self-contained, reproducible > >>> code. > >>>>>>> ______________________________________________ > >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >>> see > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide > >>>>>> http://www.R-project.org/posting-guide.html > >>>>>>> and provide commented, minimal, self-contained, reproducible > >>> code. > >>>>>> > >>>>>> ______________________________________________ > >>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>> PLEASE do read the posting guide > >>>>>> http://www.R-project.org/posting-guide.html > >>>>>> and provide commented, minimal, self-contained, reproducible > >>> code. > >>>>>> > >>>>> > >>>>> [[alternative HTML version deleted]] > >>>>> > >>>>> ______________________________________________ > >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > >>>> http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>> > >>>> ______________________________________________ > >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide > >>>> http://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>>> > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> > >> -- > >> Sent from my phone. Please excuse my brevity. > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.