OK. So, I named the object test and then checked the 6347th item > test <- readLines ("hangouts-conversation.txt) > test [6347] [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
Perhaps where it was getting screwed up is, since the end of this is a number (8242), then, given that there's no space between the number and what ought to be the next row, R didn't know where to draw the line. Sure enough, it looks like this when I go to the original file and control f "#8242" 2016-10-21 10:35:36 <Jane Doe> What's your login 2016-10-21 10:56:29 <John Doe> John_Doe 2016-10-21 10:56:37 <John Doe> Admit#8242 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion Again, it doesn't look like that in the file. Gmail automatically formats it like that when I paste it in. More to the point, it looks like 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 10:56:29 <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> Admit#82422016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion Notice Admit#82422016. So there's that. Then I built object test2. test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", test) This worked for 84 lines, then this happened. > test2 [84] [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > test2 [85] [1] "//1,//2,//3,//4" > test [85] [1] "2016-07-01 02:50:35 <John Doe> hey" Notice how I toggled back and forth between test and test2 there. So, whatever happened with the regex, it happened in the switch from 84 to 85, I guess. It went on like [990] "//1,//2,//3,//4" [991] "//1,//2,//3,//4" [992] "//1,//2,//3,//4" [993] "//1,//2,//3,//4" [994] "//1,//2,//3,//4" [995] "//1,//2,//3,//4" [996] "//1,//2,//3,//4" [997] "//1,//2,//3,//4" [998] "//1,//2,//3,//4" [999] "//1,//2,//3,//4" [1000] "//1,//2,//3,//4" up until line 1000, then I reached max.print. Michael On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsem...@comcast.net> wrote: > > > On 5/16/19 12:30 PM, Michael Boulineau wrote: > > Thanks for this tip on etiquette, David. I will be sure and not do that > > again. > > > > I tried the read.fwf from the foreign package, with a code like this: > > > > d <- read.fwf("hangouts-conversation.txt", > > widths= c(10,10,20,40), > > col.names=c("date","time","person","comment"), > > strip.white=TRUE) > > > > But it threw this error: > > > > Error in scan(file = file, what = what, sep = sep, quote = quote, dec = > > dec, : > > line 6347 did not have 4 elements > > > So what does line 6347 look like? (Use `readLines` and print it out.) > > > > > Interestingly, though, the error only happened when I increased the > > width size. But I had to increase the size, or else I couldn't "see" > > anything. The comment was so small that nothing was being captured by > > the size of the column. so to speak. > > > > It seems like what's throwing me is that there's no comma that > > demarcates the end of the text proper. For example: > > Not sure why you thought there should be a comma. Lines usually end > with <cr> and or a <lf>. > > > Once you have the raw text in a character vector from `readLines` named, > say, 'chrvec', then you could selectively substitute commas for spaces > with regex. (Now that you no longer desire to remove the dates and times.) > > sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > > This will not do any replacements when the pattern is not matched. See > this test: > > > > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec) > > newvec > [1] "2016-07-01,02:50:35,<john>,hey" > [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really" > [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep" > [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am > really" > [7] "2016-07-01,02:54:17,<john>,just know it's london" > [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay" > [10] "2016-07-01 02:58:56 <jone>" > [11] "2016-07-01 02:59:34 <jane>" > [12] "2016-07-01,03:02:48,<john>,British security is a little more > rigorous..." > > > You should probably remove the "empty comment" lines. > > > -- > > David. > > > > > 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01 > > 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane > > Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was > > lots of Starbucks in my day2016-07-01 15:35:47 > > > > It was interesting, too, when I pasted the text into the email, it > > self-formatted into the way I wanted it to look. I had to manually > > make it look like it does above, since that's the way that it looks in > > the txt file. I wonder if it's being organized by XML or something. > > > > Anyways, There's always a space between the two sideways carrots, just > > like there is right now: <John Doe> See. Space. And there's always a > > space between the data and time. Like this. 2016-07-01 15:34:30 See. > > Space. But there's never a space between the end of the comment and > > the next date. Like this: We were in a starbucks2016-07-01 15:35:02 > > See. starbucks and 2016 are smooshed together. > > > > This code is also on the table right now too. > > > > a <- read.table("E:/working > > directory/-189/hangouts-conversation2.txt", quote="\"", > > comment.char="", fill=TRUE) > > > > h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > > > > aa<-gsub("[^[:digit:]]","",h) > > my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > > > > Those last lines are a work in progress. I wish I could import a > > picture of what it looks like when it's translated into a data frame. > > The fill=TRUE helped to get the data in table that kind of sort of > > works, but the comments keep bleeding into the data and time column. > > It's like > > > > 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > > over there > > 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > > > > And then, maybe, the "seriously" will be in a column all to itself, as > > will be the "I've'"and the "never" etc. > > > > I will use a regular expression if I have to, but it would be nice to > > keep the dates and times on there. Originally, I thought they were > > meaningless, but I've since changed my mind on that count. The time of > > day isn't so important. But, especially since, say, Gmail itself knows > > how to quickly recognize what it is, I know it can be done. I know > > this data has structure to it. > > > > Michael > > > > > > > > On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsem...@comcast.net> > > wrote: > >> > >> On 5/15/19 4:07 PM, Michael Boulineau wrote: > >>> I have a wild and crazy text file, the head of which looks like this: > >>> > >>> 2016-07-01 02:50:35 <john> hey > >>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > >>> 2016-07-01 02:51:45 <john> thinking about my boo > >>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really > >>> 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep > >>> 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really > >>> 2016-07-01 02:54:17 <john> just know it's london > >>> 2016-07-01 02:56:44 <jane> you are probably asleep > >>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay > >>> 2016-07-01 02:58:56 <jone> > >>> 2016-07-01 02:59:34 <jane> > >>> 2016-07-01 03:02:48 <john> British security is a little more rigorous... > >> Looks entirely not-"crazy". Typical log file format. > >> > >> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex > >> (i.e. the sub-function) to strip everything up to the "<". Read > >> `?regex`. Since that's not a metacharacters you could use a pattern > >> ".+<" and replace with "". > >> > >> And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp, > >> at least within hours of each, is considered poor manners. > >> > >> > >> -- > >> > >> David. > >> > >>> It goes on for a while. It's a big file. But I feel like it's going to > >>> be difficult to annotate with the coreNLP library or package. I'm > >>> doing natural language processing. In other words, I'm curious as to > >>> how I would shave off the dates, that is, to make it look like: > >>> > >>> <john> hey > >>> <jane> waiting for plane to Edinburgh > >>> <john> thinking about my boo > >>> <jane> nothing crappy has happened, not really > >>> <john> plane went by pretty fast, didn't sleep > >>> <jane> no idea what time it is or where I am really > >>> <john> just know it's london > >>> <jane> you are probably asleep > >>> <jane> I hope fish was fishy in a good eay > >>> <jone> > >>> <jane> > >>> <john> British security is a little more rigorous... > >>> > >>> To be clear, then, I'm trying to clean a large text file by writing a > >>> regular expression? such that I create a new object with no numbers or > >>> dates. > >>> > >>> Michael > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.