If byte order mark is the issue then you can specify the file encoding as "UTF-8-BOM" and it won't show up in your data any more.
On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help <r-help@r-project.org> wrote: >The pattern I gave worked for the lines that you originally showed from >the >data file ('a'), before you put commas into them. If the name is >either of >the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so >something like "(<[^>]*>|[*]{3})". > >The " " at the start of the imported data may come from the byte >order >mark that Windows apps like to put at the front of a text file in UTF-8 >or >UTF-16 format. > >Bill Dunlap >TIBCO Software >wdunlap tibco.com > > >On Fri, May 17, 2019 at 11:53 AM Michael Boulineau < >michael.p.boulin...@gmail.com> wrote: > >> This seemed to work: >> >> > a <- readLines ("hangouts-conversation-6.csv.txt") >> > b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) >> > b [1:84] >> >> And the first 85 lines looks like this: >> >> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" >> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" >> >> Then they transition to the commas: >> >> > b [84:100] >> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" >> [2] "2016-07-01,02:50:35,<John Doe>,hey" >> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" >> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" >> >> Even the strange bit on line 6347 was caught by this: >> >> > b [6346:6348] >> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe" >> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" >> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" >> >> Perhaps most awesomely, the code catches spaces that are interposed >> into the comment itself: >> >> > b [4] >> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey " >> > b [85] >> [1] "2016-07-01,02:50:35,<John Doe>,hey" >> >> Notice whether there is a space after the "hey" or not. >> >> These are the first two lines: >> >> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat" >> [2] "2016-01-27,09:15:20,<Jane >> Doe>, >> >https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf >> " >> >> So, who knows what happened with the  at the beginning of [1] >> directly above. But notice how there are no commas in [1] but there >> appear in [2]. I don't see why really long ones like [2] directly >> above would be a problem, were they to be translated into a csv or >> data frame column. >> >> Now, with the commas in there, couldn't we write this into a csv or a >> data.frame? Some of this data will end up being garbage, I imagine. >> Like in [2] directly above. Or with [83] and [84] at the top of this >> discussion post/email. Embarrassingly, I've been trying to convert >> this into a data.frame or csv but I can't manage to. I've been using >> the write.csv function, but I don't think I've been getting the >> arguments correct. >> >> At the end of the day, I would like a data.frame and/or csv with the >> following four columns: date, time, person, comment. >> >> I tried this, too: >> >> > c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} >> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", >> + a, proto=data.frame(stringsAsFactors=FALSE, >When="", >> Who="", >> + What="")) >> >> But all I got was this: >> >> > c [1:100, ] >> When Who What >> 1 <NA> <NA> <NA> >> 2 <NA> <NA> <NA> >> 3 <NA> <NA> <NA> >> 4 <NA> <NA> <NA> >> 5 <NA> <NA> <NA> >> 6 <NA> <NA> <NA> >> >> It seems to have caught nothing. >> >> > unique (c) >> When Who What >> 1 <NA> <NA> <NA> >> >> But I like that it converted into columns. That's a really great >> format. With a little tweaking, it'd be a great code for this data >> set. >> >> Michael >> >> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help >> <r-help@r-project.org> wrote: >> > >> > Consider using readLines() and strcapture() for reading such a >file. >> E.g., >> > suppose readLines(files) produced a character vector like >> > >> > x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", >> > "2016-10-21 10:56:29 <John Doe> John_Doe", >> > "2016-10-21 10:56:37 <John Doe> Admit#8242", >> > "October 23, 1819 12:34 <Jane Eyre> I am not an angel") >> > >> > Then you can make a data.frame with columns When, Who, and What by >> > supplying a pattern containing three parenthesized capture >expressions: >> > > z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} >> > [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", >> > x, proto=data.frame(stringsAsFactors=FALSE, When="", >Who="", >> > What="")) >> > > str(z) >> > 'data.frame': 4 obs. of 3 variables: >> > $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" >"2016-10-21 >> > 10:56:37" NA >> > $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA >> > $ What: chr "What's your login" "John_Doe" "Admit#8242" NA >> > >> > Lines that don't match the pattern result in NA's - you might make >a >> second >> > pass over the corresponding elements of x with a new pattern. >> > >> > You can convert the When column from character to time with >as.POSIXct(). >> > >> > Bill Dunlap >> > TIBCO Software >> > wdunlap tibco.com >> > >> > >> > On Thu, May 16, 2019 at 8:30 PM David Winsemius ><dwinsem...@comcast.net> >> > wrote: >> > >> > > >> > > On 5/16/19 3:53 PM, Michael Boulineau wrote: >> > > > OK. So, I named the object test and then checked the 6347th >item >> > > > >> > > >> test <- readLines ("hangouts-conversation.txt) >> > > >> test [6347] >> > > > [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" >> > > > >> > > > Perhaps where it was getting screwed up is, since the end of >this is >> a >> > > > number (8242), then, given that there's no space between the >number >> > > > and what ought to be the next row, R didn't know where to draw >the >> > > > line. Sure enough, it looks like this when I go to the original >file >> > > > and control f "#8242" >> > > > >> > > > 2016-10-21 10:35:36 <Jane Doe> What's your login >> > > > 2016-10-21 10:56:29 <John Doe> John_Doe >> > > > 2016-10-21 10:56:37 <John Doe> Admit#8242 >> > > >> > > >> > > An octothorpe is an end of line signifier and is interpreted as >> allowing >> > > comments. You can prevent that interpretation with suitable >choice of >> > > parameters to `read.table` or `read.csv`. I don't understand why >that >> > > should cause anu error or a failure to match that pattern. >> > > >> > > > 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion >> > > > >> > > > Again, it doesn't look like that in the file. Gmail >automatically >> > > > formats it like that when I paste it in. More to the point, it >looks >> > > > like >> > > > >> > > > 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 >10:56:29 >> > > > <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> >> Admit#82422016-10-21 >> > > > 11:00:13 <Jane Doe> Okay so you have a discussion >> > > > >> > > > Notice Admit#82422016. So there's that. >> > > > >> > > > Then I built object test2. >> > > > >> > > > test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", >test) >> > > > >> > > > This worked for 84 lines, then this happened. >> > > >> > > It may have done something but as you later discovered my first >code >> for >> > > the pattern was incorrect. I had tested it (and pasted in the >results >> of >> > > the test) . The way to refer to a capture class is with >back-slashes >> > > before the numbers, not forward-slashes. Try this: >> > > >> > > >> > > > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", >"\\1,\\2,\\3,\\4", >> chrvec) >> > > > newvec >> > > [1] "2016-07-01,02:50:35,<john>,hey" >> > > [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" >> > > [3] "2016-07-01,02:51:45,<john>,thinking about my boo" >> > > [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, >not >> really" >> > > [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, >didn't >> sleep" >> > > [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or >where I am >> > > really" >> > > [7] "2016-07-01,02:54:17,<john>,just know it's london" >> > > [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" >> > > [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good >eay" >> > > [10] "2016-07-01 02:58:56 <jone>" >> > > [11] "2016-07-01 02:59:34 <jane>" >> > > [12] "2016-07-01,03:02:48,<john>,British security is a little >more >> > > rigorous..." >> > > >> > > >> > > I made note of the fact that the 10th and 11th lines had no >commas. >> > > >> > > > >> > > >> test2 [84] >> > > > [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" >> > > >> > > That line didn't have any "<" so wasn't matched. >> > > >> > > >> > > You could remove all none matching lines for pattern of >> > > >> > > dates<space>times<space>"<"<name>">"<space><anything> >> > > >> > > >> > > with: >> > > >> > > >> > > chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] >> > > >> > > >> > > Do read: >> > > >> > > ?read.csv >> > > >> > > ?regex >> > > >> > > >> > > -- >> > > >> > > David >> > > >> > > >> > > >> test2 [85] >> > > > [1] "//1,//2,//3,//4" >> > > >> test [85] >> > > > [1] "2016-07-01 02:50:35 <John Doe> hey" >> > > > >> > > > Notice how I toggled back and forth between test and test2 >there. So, >> > > > whatever happened with the regex, it happened in the switch >from 84 >> to >> > > > 85, I guess. It went on like >> > > > >> > > > [990] "//1,//2,//3,//4" >> > > > [991] "//1,//2,//3,//4" >> > > > [992] "//1,//2,//3,//4" >> > > > [993] "//1,//2,//3,//4" >> > > > [994] "//1,//2,//3,//4" >> > > > [995] "//1,//2,//3,//4" >> > > > [996] "//1,//2,//3,//4" >> > > > [997] "//1,//2,//3,//4" >> > > > [998] "//1,//2,//3,//4" >> > > > [999] "//1,//2,//3,//4" >> > > > [1000] "//1,//2,//3,//4" >> > > > >> > > > up until line 1000, then I reached max.print. >> > > >> > > > Michael >> > > > >> > > > On Thu, May 16, 2019 at 1:05 PM David Winsemius < >> dwinsem...@comcast.net> >> > > wrote: >> > > >> >> > > >> On 5/16/19 12:30 PM, Michael Boulineau wrote: >> > > >>> Thanks for this tip on etiquette, David. I will be sure and >not do >> > > that again. >> > > >>> >> > > >>> I tried the read.fwf from the foreign package, with a code >like >> this: >> > > >>> >> > > >>> d <- read.fwf("hangouts-conversation.txt", >> > > >>> widths= c(10,10,20,40), >> > > >>> >col.names=c("date","time","person","comment"), >> > > >>> strip.white=TRUE) >> > > >>> >> > > >>> But it threw this error: >> > > >>> >> > > >>> Error in scan(file = file, what = what, sep = sep, quote = >quote, >> dec >> > > = dec, : >> > > >>> line 6347 did not have 4 elements >> > > >> >> > > >> So what does line 6347 look like? (Use `readLines` and print >it >> out.) >> > > >> >> > > >>> Interestingly, though, the error only happened when I >increased the >> > > >>> width size. But I had to increase the size, or else I >couldn't >> "see" >> > > >>> anything. The comment was so small that nothing was being >> captured by >> > > >>> the size of the column. so to speak. >> > > >>> >> > > >>> It seems like what's throwing me is that there's no comma >that >> > > >>> demarcates the end of the text proper. For example: >> > > >> Not sure why you thought there should be a comma. Lines >usually end >> > > >> with <cr> and or a <lf>. >> > > >> >> > > >> >> > > >> Once you have the raw text in a character vector from >`readLines` >> named, >> > > >> say, 'chrvec', then you could selectively substitute commas >for >> spaces >> > > >> with regex. (Now that you no longer desire to remove the dates >and >> > > times.) >> > > >> >> > > >> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) >> > > >> >> > > >> This will not do any replacements when the pattern is not >matched. >> See >> > > >> this test: >> > > >> >> > > >> >> > > >> > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", >"\\1,\\2,\\3,\\4", >> > > chrvec) >> > > >> > newvec >> > > >> [1] "2016-07-01,02:50:35,<john>,hey" >> > > >> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to >Edinburgh" >> > > >> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" >> > > >> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has >happened, not >> > > really" >> > > >> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, >didn't >> > > sleep" >> > > >> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or >where >> I am >> > > >> really" >> > > >> [7] "2016-07-01,02:54:17,<john>,just know it's london" >> > > >> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" >> > > >> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a >good >> eay" >> > > >> [10] "2016-07-01 02:58:56 <jone>" >> > > >> [11] "2016-07-01 02:59:34 <jane>" >> > > >> [12] "2016-07-01,03:02:48,<john>,British security is a little >more >> > > >> rigorous..." >> > > >> >> > > >> >> > > >> You should probably remove the "empty comment" lines. >> > > >> >> > > >> >> > > >> -- >> > > >> >> > > >> David. >> > > >> >> > > >>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a >> starbucks2016-07-01 >> > > >>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 ><Jane >> > > >>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> >There was >> > > >>> lots of Starbucks in my day2016-07-01 15:35:47 >> > > >>> >> > > >>> It was interesting, too, when I pasted the text into the >email, it >> > > >>> self-formatted into the way I wanted it to look. I had to >manually >> > > >>> make it look like it does above, since that's the way that it >> looks in >> > > >>> the txt file. I wonder if it's being organized by XML or >something. >> > > >>> >> > > >>> Anyways, There's always a space between the two sideways >carrots, >> just >> > > >>> like there is right now: <John Doe> See. Space. And there's >always >> a >> > > >>> space between the data and time. Like this. 2016-07-01 >15:34:30 >> See. >> > > >>> Space. But there's never a space between the end of the >comment and >> > > >>> the next date. Like this: We were in a starbucks2016-07-01 >15:35:02 >> > > >>> See. starbucks and 2016 are smooshed together. >> > > >>> >> > > >>> This code is also on the table right now too. >> > > >>> >> > > >>> a <- read.table("E:/working >> > > >>> directory/-189/hangouts-conversation2.txt", quote="\"", >> > > >>> comment.char="", fill=TRUE) >> > > >>> >> > > >>> >> > > >> >h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) >> > > >>> >> > > >>> aa<-gsub("[^[:digit:]]","",h) >> > > >>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) >> > > >>> >> > > >>> Those last lines are a work in progress. I wish I could >import a >> > > >>> picture of what it looks like when it's translated into a >data >> frame. >> > > >>> The fill=TRUE helped to get the data in table that kind of >sort of >> > > >>> works, but the comments keep bleeding into the data and time >> column. >> > > >>> It's like >> > > >>> >> > > >>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been >> > > >>> over there >> > > >>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( >> > > >>> >> > > >>> And then, maybe, the "seriously" will be in a column all to >> itself, as >> > > >>> will be the "I've'"and the "never" etc. >> > > >>> >> > > >>> I will use a regular expression if I have to, but it would be >nice >> to >> > > >>> keep the dates and times on there. Originally, I thought they >were >> > > >>> meaningless, but I've since changed my mind on that count. >The >> time of >> > > >>> day isn't so important. But, especially since, say, Gmail >itself >> knows >> > > >>> how to quickly recognize what it is, I know it can be done. I >know >> > > >>> this data has structure to it. >> > > >>> >> > > >>> Michael >> > > >>> >> > > >>> >> > > >>> >> > > >>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < >> > > dwinsem...@comcast.net> wrote: >> > > >>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: >> > > >>>>> I have a wild and crazy text file, the head of which looks >like >> this: >> > > >>>>> >> > > >>>>> 2016-07-01 02:50:35 <john> hey >> > > >>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh >> > > >>>>> 2016-07-01 02:51:45 <john> thinking about my boo >> > > >>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not >> really >> > > >>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, >didn't >> sleep >> > > >>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where >I am >> > > really >> > > >>>>> 2016-07-01 02:54:17 <john> just know it's london >> > > >>>>> 2016-07-01 02:56:44 <jane> you are probably asleep >> > > >>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good >eay >> > > >>>>> 2016-07-01 02:58:56 <jone> >> > > >>>>> 2016-07-01 02:59:34 <jane> >> > > >>>>> 2016-07-01 03:02:48 <john> British security is a little >more >> > > rigorous... >> > > >>>> Looks entirely not-"crazy". Typical log file format. >> > > >>>> >> > > >>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) >Use >> regex >> > > >>>> (i.e. the sub-function) to strip everything up to the "<". >Read >> > > >>>> `?regex`. Since that's not a metacharacters you could use a >> pattern >> > > >>>> ".+<" and replace with "". >> > > >>>> >> > > >>>> And do read the Posting Guide. Cross-posting to >StackOverflow and >> > > Rhelp, >> > > >>>> at least within hours of each, is considered poor manners. >> > > >>>> >> > > >>>> >> > > >>>> -- >> > > >>>> >> > > >>>> David. >> > > >>>> >> > > >>>>> It goes on for a while. It's a big file. But I feel like >it's >> going >> > > to >> > > >>>>> be difficult to annotate with the coreNLP library or >package. I'm >> > > >>>>> doing natural language processing. In other words, I'm >curious >> as to >> > > >>>>> how I would shave off the dates, that is, to make it look >like: >> > > >>>>> >> > > >>>>> <john> hey >> > > >>>>> <jane> waiting for plane to Edinburgh >> > > >>>>> <john> thinking about my boo >> > > >>>>> <jane> nothing crappy has happened, not really >> > > >>>>> <john> plane went by pretty fast, didn't sleep >> > > >>>>> <jane> no idea what time it is or where I am really >> > > >>>>> <john> just know it's london >> > > >>>>> <jane> you are probably asleep >> > > >>>>> <jane> I hope fish was fishy in a good eay >> > > >>>>> <jone> >> > > >>>>> <jane> >> > > >>>>> <john> British security is a little more rigorous... >> > > >>>>> >> > > >>>>> To be clear, then, I'm trying to clean a large text file by >> writing a >> > > >>>>> regular expression? such that I create a new object with no >> numbers >> > > or >> > > >>>>> dates. >> > > >>>>> >> > > >>>>> Michael >> > > >>>>> >> > > >>>>> ______________________________________________ >> > > >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and >more, >> see >> > > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >> > > >>>>> PLEASE do read the posting guide >> > > http://www.R-project.org/posting-guide.html >> > > >>>>> and provide commented, minimal, self-contained, >reproducible >> code. >> > > >>> ______________________________________________ >> > > >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >see >> > > >>> https://stat.ethz.ch/mailman/listinfo/r-help >> > > >>> PLEASE do read the posting guide >> > > http://www.R-project.org/posting-guide.html >> > > >>> and provide commented, minimal, self-contained, reproducible >code. >> > > >> ______________________________________________ >> > > >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >see >> > > >> https://stat.ethz.ch/mailman/listinfo/r-help >> > > >> PLEASE do read the posting guide >> > > http://www.R-project.org/posting-guide.html >> > > >> and provide commented, minimal, self-contained, reproducible >code. >> > > > ______________________________________________ >> > > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >see >> > > > https://stat.ethz.ch/mailman/listinfo/r-help >> > > > PLEASE do read the posting guide >> > > http://www.R-project.org/posting-guide.html >> > > > and provide commented, minimal, self-contained, reproducible >code. >> > > >> > > ______________________________________________ >> > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > > https://stat.ethz.ch/mailman/listinfo/r-help >> > > PLEASE do read the posting guide >> > > http://www.R-project.org/posting-guide.html >> > > and provide commented, minimal, self-contained, reproducible >code. >> > > >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > >______________________________________________ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.