Re: [R] how to separate string from numbers in a large txt file

Michael Boulineau Sun, 19 May 2019 15:12:02 -0700

For context:

> In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and 
> \\2. The expression says:
> Substitute ALL of the match with the first captured expression, then " <", 
> then the second captured expression, then "> ". The rest of the line is >not 
> substituted and appears as-is.


Back to me: I guess what's giving me trouble is where to draw the line
in terms of the end or edge of the expression. Given the code, then,

> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1<\\2> ", a)

to me, it would seem as though this is the first captured expression,
that is, as though \\1 refers back to ^([0-9-]{10} [0-9:]{8} ), since
there are parenthesis around it, or since [0-9-]{10} [0-9:]{8} is
enclosed in parentheses. Then it would seem as though [*]{3} is the
second expression, and (\\w+ \\w+) is the third. According to this
(admittedly wrong) logic, it would seem as though the <> would go
around the date--like

> 2016-03-20 <19:29:37> *** Jane Doe started a video chat

The back references here recalls Davis's code earlier:

> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)

There, commas were put around everything, and there you can see the
edge of the expression very well. ^(.{10}) = first. (.{8}) = second.
(<.+>) = third. (.+$) = fourth. So, by the same logic, it would seem
as though in

> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"

that ^([0-9-]{10} [0-9:]{8} ) is first, that [*]{3} is second, and
that  (\\w+ \\w+) is third.

But, if Boris is to be right, and he is, obviously, then it would have
to be the case that this entire thing, namely, ^([0-9-]{10} [0-9:]{8}
)[*]{3}, is the first expression, since only if that were true would
the <> be able to go around the names, as in

[3] "2016-01-27 09:15:20 <Jane Doe> Hey "

Again, so 2016-01-27 09:15:20 would have to be an entire unit, an
expression. So I guess what I don't understand is how ^([0-9-]{10}
[0-9:]{8} )[*]{3} can be an entire expression, although my hunch would
be that it has something to do with the ^ or with the space after the
} and before the (, as in

> {3} (\\w+

Back to earlier:

> The rest of the line is not substituted and appears as-is.

Is that due to the space after the \\2? in

> "\\1<\\2> "

Notice space after > and before "

Michael

On Sun, May 19, 2019 at 2:31 PM Boris Steipe <boris.ste...@utoronto.ca> wrote:
>
> Inline ...
>
> > On 2019-05-19, at 13:56, Michael Boulineau <michael.p.boulin...@gmail.com> 
> > wrote:
> >
> >> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> >
> > so the ^ signals that the regex BEGINS with a number (that could be
> > any number, 0-9) that is only 10 characters long (then there's the
> > dash in there, too, with the 0-9-, which I assume enabled the regex to
> > grab the - that's between the numbers in the date)
>
> That's right. Note that within a "character class" the hyphen can have tow 
> meanings: normally it defines a range of characters, but if it appears as the 
> last character before "]" it is a literal hyphen.
>
> > , followed by a
> > single space, followed by a unit that could be any number, again, but
> > that is only 8 characters long this time. For that one, it will
> > include the colon, hence the 9:, although for that one ([0-9:]{8} ),
>
> Right.
>
>
> > I
> > don't get why the space is on the inside in that one, after the {8},
>
> The space needs to be preserved between the time and the name. I wrote
> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" # space in the first 
> captured expression
> c <- gsub(b, "\\1<\\2> ", a)
>  ... but I could have written
> b <- "^([0-9-]{10} [0-9:]{8})[*]{3} (\\w+ \\w+)"
> c <- gsub(b, "\\1 <\\2> ", a)  # space in the substituted string
> ... same result
>
>
> > whereas the space is on the outside with the other one ^([0-9-]{10} ,
> > directly after the {10}. Why is that?
>
> In the second case, I capture without a space, because I don't want the space 
> in the results, after the time.
>
>
> >
> > Then three *** [*]{3}, then the (\\w+ \\w+)", which Boris explained so
> > well above. I guess I still don't get why this one seemed to have
> > deleted the *** out of the mix, plus I still don't why it didn't
> > remove the *** from the first one.
>
> Because the entire first line was not matched since it had a malformed 
> character preceding the date.
>
> >
> > 2016-03-20 19:29:37 *** Jane Doe started a video chat
> > 2016-03-20 19:30:35 *** John Doe ended a video chat
> > 2016-04-02 12:59:36 *** Jane Doe started a video chat
> > 2016-04-02 13:00:43 *** John Doe ended a video chat
> > 2016-04-02 13:01:08 *** Jane Doe started a video chat
> > 2016-04-02 13:01:41 *** John Doe ended a video chat
> > 2016-04-02 13:03:51 *** John Doe started a video chat
> > 2016-04-02 13:06:35 *** John Doe ended a video chat
> >
> > This is a random sample from the beginning of the txt file with no
> > edits. The ***s were deleted, all but the first one, the one that had
> > the ï»¿ but that was taken out by the encoding = "UTF-8". I know that
> > the function was c <- gsub(b, "\\1<\\2> ", a), so it had a gsub () on
> > there, the point of which is to do substitution work.
> >
> > Oh, I get it, I think. The \\1<\\2> in the gsub () puts the <> around
> > the names, so that it's consistent with the rest of the data, so that
> > the names in the text about that aren't enclosed in the <> are
> > enclosed like the rest of them. But I still don't get why or how the
> > gsub () replaced the *** with the <>...
>
> In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and 
> \\2. The expression says:
> Substitute ALL of the match with the first captured expression, then " <", 
> then the second captured expression, then "> ". The rest of the line is not 
> substituted and appears as-is.
>
>
> >
> > This one is more straightforward.
> >
> >> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> >
> > any number with - for 10 characters, followed by a space. Oh, there's
> > no space in this one ([0-9:]{8}), after the {8}. Hu. So, then, any
> > number with : for 8 characters, followed by any two words separated by
> > a space and enclosed in <>. And then the \\s* is followed by a single
> > space? Or maybe it puts space on both sides (on the side of the #s to
> > the left, and then the comment to the right). The (.+)$ is anything
> > whatsoever until the end.
>
> \s is the metacharacter for "whitespace". \s* means zero or more whitespace. 
> I'm matching that OUTSIDE of the captured expression, to removes any leading 
> spaces from the data that goes into the data frame.
>
>
> Cheers,
> Boris
>
>
>
>
> >
> > Michael
> >
> >
> > On Sun, May 19, 2019 at 4:37 AM Boris Steipe <boris.ste...@utoronto.ca> 
> > wrote:
> >>
> >> Inline
> >>
> >>
> >>
> >>> On 2019-05-18, at 20:34, Michael Boulineau 
> >>> <michael.p.boulin...@gmail.com> wrote:
> >>>
> >>> It appears to have worked, although there were three little quirks.
> >>> The ; close(con); rm(con) didn't work for me; the first row of the
> >>> data.frame was all NAs, when all was said and done;
> >>
> >> You will get NAs for lines that can't be matched to the regular 
> >> expression. That's a good thing, it allows you to test whether your 
> >> assumptions were valid for the entire file:
> >>
> >> # number of failed strcapture()
> >> sum(is.na(e$date))
> >>
> >>
> >>> and then there
> >>> were still three *** on the same line where the ï»¿ was apparently
> >>> deleted.
> >>
> >> This is a sign that something else happened with the line that prevented 
> >> the regex from matching. In that case you need to investigate more. I see 
> >> an invalid multibyte character at the beginning of the line you posted 
> >> below.
> >>
> >>>
> >>>> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
> >>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> >>>> c <- gsub(b, "\\1<\\2> ", a)
> >>>> head (c)
> >>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat"
> >>> [2] "2016-01-27 09:15:20 <Jane Doe>
> >>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf";
> >>
> >> [...]
> >>
> >>> But, before I do anything else, I'm going to study the regex in this
> >>> particular code. For example, I'm still not sure why there has to the
> >>> second \\w+ in the (\\w+ \\w+). Little things like that.
> >>
> >> \w is the metacharacter for alphanumeric characters, \w+ designates 
> >> something we could call a word. Thus \w+ \w+ are two words separated by a 
> >> single blank. This corresponds to your example, but, as I wrote 
> >> previously, you need to think very carefully whether this covers all 
> >> possible cases (Could there be only one word? More than one blank? Could 
> >> letters be separated by hyphens or periods?) In most cases we could have 
> >> more robustly matched everything between "<" and ">" (taking care to test 
> >> what happens if the message contains those characters). But for the video 
> >> chat lines we need to make an assumption about what is name and what is 
> >> not. If "started a video chat" is the only possibility in such lines, you 
> >> can use this information instead. If there are other possibilities, you 
> >> need a different strategy. In NLP there is no one-approach-fits-all.
> >>
> >> To validate the structure of the names in your transcripts, you can look at
> >>
> >> patt <- " <.+?> "   # " <any string, not greedy> "
> >> m <- regexpr(patt, c)
> >> unique(regmatches(c, m))
> >>
> >>
> >>
> >> B.
> >>
> >>
> >>
> >>>
> >>> Michael
> >>>
> >>>
> >>> On Sat, May 18, 2019 at 4:30 PM Boris Steipe <boris.ste...@utoronto.ca> 
> >>> wrote:
> >>>>
> >>>> This works for me:
> >>>>
> >>>> # sample data
> >>>> c <- character()
> >>>> c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat"
> >>>> c[2] <- "2016-01-27 09:15:20 <Jane Doe> 
> >>>> https://lh3.googleusercontent.com/";
> >>>> c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey "
> >>>> c[4] <- "2016-01-27 09:15:22 <John Doe>  ended a video chat"
> >>>> c[5] <- "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
> >>>> c[6] <- "2016-01-27 21:26:57 <John Doe>  ended a video chat"
> >>>>
> >>>>
> >>>> # regex  ^(year)       (time)      <(word word)>\\s*(string)$
> >>>> patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
> >>>> proto <- data.frame(date = character(),
> >>>>                   time = character(),
> >>>>                   name = character(),
> >>>>                   text = character(),
> >>>>                   stringsAsFactors = TRUE)
> >>>> d <- strcapture(patt, c, proto)
> >>>>
> >>>>
> >>>>
> >>>>       date     time     name                               text
> >>>> 1 2016-01-27 09:14:40 Jane Doe               started a video chat
> >>>> 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/
> >>>> 3 2016-01-27 09:15:20 Jane Doe                               Hey
> >>>> 4 2016-01-27 09:15:22 John Doe                 ended a video chat
> >>>> 5 2016-01-27 21:07:11 Jane Doe               started a video chat
> >>>> 6 2016-01-27 21:26:57 John Doe                 ended a video chat
> >>>>
> >>>>
> >>>>
> >>>> B.
> >>>>
> >>>>
> >>>>> On 2019-05-18, at 18:32, Michael Boulineau 
> >>>>> <michael.p.boulin...@gmail.com> wrote:
> >>>>>
> >>>>> Going back and thinking through what Boris and William were saying
> >>>>> (also Ivan), I tried this:
> >>>>>
> >>>>> a <- readLines ("hangouts-conversation-6.csv.txt")
> >>>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> >>>>> c <- gsub(b, "\\1<\\2> ", a)
> >>>>>> head (c)
> >>>>> [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe started a video chat"
> >>>>> [2] "2016-01-27 09:15:20 <Jane Doe>
> >>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf";
> >>>>> [3] "2016-01-27 09:15:20 <Jane Doe> Hey "
> >>>>> [4] "2016-01-27 09:15:22 <John Doe>  ended a video chat"
> >>>>> [5] "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
> >>>>> [6] "2016-01-27 21:26:57 <John Doe>  ended a video chat"
> >>>>>
> >>>>> The ï»¿ is still there, since I forgot to do what Ivan had suggested, 
> >>>>> namely,
> >>>>>
> >>>>> a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding
> >>>>> = "UTF-8")); close(con); rm(con)
> >>>>>
> >>>>> But then the new code is still turning out only NAs when I apply
> >>>>> strcapture (). This was what happened next:
> >>>>>
> >>>>>> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> >>>>> +                 c, proto=data.frame(stringsAsFactors=FALSE, When="", 
> >>>>> Who="",
> >>>>> +                                     What=""))
> >>>>>> head (d)
> >>>>> When  Who What
> >>>>> 1 <NA> <NA> <NA>
> >>>>> 2 <NA> <NA> <NA>
> >>>>> 3 <NA> <NA> <NA>
> >>>>> 4 <NA> <NA> <NA>
> >>>>> 5 <NA> <NA> <NA>
> >>>>> 6 <NA> <NA> <NA>
> >>>>>
> >>>>> I've been reading up on regular expressions, too, so this code seems
> >>>>> spot on. What's going wrong?
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>> On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> 
> >>>>> wrote:
> >>>>>>
> >>>>>> Don't start putting in extra commas and then reading this as csv. That 
> >>>>>> approach is broken. The correct approach is what Bill outlined: read 
> >>>>>> everything with readLines(), and then use a proper regular expression 
> >>>>>> with strcapture().
> >>>>>>
> >>>>>> You need to pre-process the object that readLines() gives you: replace 
> >>>>>> the contents of the videochat lines, and make it conform to the format 
> >>>>>> of the other lines before you process it into your data frame.
> >>>>>>
> >>>>>> Approximately something like
> >>>>>>
> >>>>>> # read the raw data
> >>>>>> tmp <- readLines("hangouts-conversation-6.csv.txt")
> >>>>>>
> >>>>>> # process all video chat lines
> >>>>>> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) "  # (year time 
> >>>>>> )*** (word word)
> >>>>>> tmp <- gsub(patt, "\\1<\\2> ", tmp)
> >>>>>>
> >>>>>> # next, use strcapture()
> >>>>>>
> >>>>>> Note that this makes the assumption that your names are always exactly 
> >>>>>> two words containing only letters. If that assumption is not true, 
> >>>>>> more though needs to go into the regex. But you can test that:
> >>>>>>
> >>>>>> patt <- " <\\w+ \\w+> "   #" <word word> "
> >>>>>> sum( ! grepl(patt, tmp)))
> >>>>>>
> >>>>>> ... will give the number of lines that remain in your file that do not 
> >>>>>> have a tag that can be interpreted as "Who"
> >>>>>>
> >>>>>> Once that is fine, use Bill's approach - or a regular expression of 
> >>>>>> your own design - to create your data frame.
> >>>>>>
> >>>>>> Hope this helps,
> >>>>>> Boris
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On 2019-05-17, at 16:18, Michael Boulineau 
> >>>>>>> <michael.p.boulin...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Very interesting. I'm sure I'll be trying to get rid of the byte order
> >>>>>>> mark eventually. But right now, I'm more worried about getting the
> >>>>>>> character vector into either a csv file or data.frame; that way, I can
> >>>>>>> be able to work with the data neatly tabulated into four columns:
> >>>>>>> date, time, person, comment. I assume it's a write.csv function, but I
> >>>>>>> don't know what arguments to put in it. header=FALSE? fill=T?
> >>>>>>>
> >>>>>>> Micheal
> >>>>>>>
> >>>>>>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller 
> >>>>>>> <jdnew...@dcn.davis.ca.us> wrote:
> >>>>>>>>
> >>>>>>>> If byte order mark is the issue then you can specify the file 
> >>>>>>>> encoding as "UTF-8-BOM" and it won't show up in your data any more.
> >>>>>>>>
> >>>>>>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help 
> >>>>>>>> <r-help@r-project.org> wrote:
> >>>>>>>>> The pattern I gave worked for the lines that you originally showed 
> >>>>>>>>> from
> >>>>>>>>> the
> >>>>>>>>> data file ('a'), before you put commas into them.  If the name is
> >>>>>>>>> either of
> >>>>>>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed 
> >>>>>>>>> so
> >>>>>>>>> something like "(<[^>]*>|[*]{3})".
> >>>>>>>>>
> >>>>>>>>> The " ï»¿" at the start of the imported data may come from the byte
> >>>>>>>>> order
> >>>>>>>>> mark that Windows apps like to put at the front of a text file in 
> >>>>>>>>> UTF-8
> >>>>>>>>> or
> >>>>>>>>> UTF-16 format.
> >>>>>>>>>
> >>>>>>>>> Bill Dunlap
> >>>>>>>>> TIBCO Software
> >>>>>>>>> wdunlap tibco.com
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
> >>>>>>>>> michael.p.boulin...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> This seemed to work:
> >>>>>>>>>>
> >>>>>>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt")
> >>>>>>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a)
> >>>>>>>>>>> b [1:84]
> >>>>>>>>>>
> >>>>>>>>>> And the first 85 lines looks like this:
> >>>>>>>>>>
> >>>>>>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat"
> >>>>>>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>>>>>>
> >>>>>>>>>> Then they transition to the commas:
> >>>>>>>>>>
> >>>>>>>>>>> b [84:100]
> >>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey"
> >>>>>>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh"
> >>>>>>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo"
> >>>>>>>>>>
> >>>>>>>>>> Even the strange bit on line 6347 was caught by this:
> >>>>>>>>>>
> >>>>>>>>>>> b [6346:6348]
> >>>>>>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
> >>>>>>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
> >>>>>>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion"
> >>>>>>>>>>
> >>>>>>>>>> Perhaps most awesomely, the code catches spaces that are interposed
> >>>>>>>>>> into the comment itself:
> >>>>>>>>>>
> >>>>>>>>>>> b [4]
> >>>>>>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
> >>>>>>>>>>> b [85]
> >>>>>>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey"
> >>>>>>>>>>
> >>>>>>>>>> Notice whether there is a space after the "hey" or not.
> >>>>>>>>>>
> >>>>>>>>>> These are the first two lines:
> >>>>>>>>>>
> >>>>>>>>>> [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe started a video chat"
> >>>>>>>>>> [2] "2016-01-27,09:15:20,<Jane
> >>>>>>>>>> Doe>,
> >>>>>>>>>>
> >>>>>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
> >>>>>>>>>> "
> >>>>>>>>>>
> >>>>>>>>>> So, who knows what happened with the ï»¿ at the beginning of [1]
> >>>>>>>>>> directly above. But notice how there are no commas in [1] but there
> >>>>>>>>>> appear in [2]. I don't see why really long ones like [2] directly
> >>>>>>>>>> above would be a problem, were they to be translated into a csv or
> >>>>>>>>>> data frame column.
> >>>>>>>>>>
> >>>>>>>>>> Now, with the commas in there, couldn't we write this into a csv 
> >>>>>>>>>> or a
> >>>>>>>>>> data.frame? Some of this data will end up being garbage, I imagine.
> >>>>>>>>>> Like in [2] directly above. Or with [83] and [84] at the top of 
> >>>>>>>>>> this
> >>>>>>>>>> discussion post/email. Embarrassingly, I've been trying to convert
> >>>>>>>>>> this into a data.frame or csv but I can't manage to. I've been 
> >>>>>>>>>> using
> >>>>>>>>>> the write.csv function, but I don't think I've been getting the
> >>>>>>>>>> arguments correct.
> >>>>>>>>>>
> >>>>>>>>>> At the end of the day, I would like a data.frame and/or csv with 
> >>>>>>>>>> the
> >>>>>>>>>> following four columns: date, time, person, comment.
> >>>>>>>>>>
> >>>>>>>>>> I tried this, too:
> >>>>>>>>>>
> >>>>>>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> >>>>>>>>>> +                 a, proto=data.frame(stringsAsFactors=FALSE,
> >>>>>>>>> When="",
> >>>>>>>>>> Who="",
> >>>>>>>>>> +                                     What=""))
> >>>>>>>>>>
> >>>>>>>>>> But all I got was this:
> >>>>>>>>>>
> >>>>>>>>>>> c [1:100, ]
> >>>>>>>>>> When  Who What
> >>>>>>>>>> 1   <NA> <NA> <NA>
> >>>>>>>>>> 2   <NA> <NA> <NA>
> >>>>>>>>>> 3   <NA> <NA> <NA>
> >>>>>>>>>> 4   <NA> <NA> <NA>
> >>>>>>>>>> 5   <NA> <NA> <NA>
> >>>>>>>>>> 6   <NA> <NA> <NA>
> >>>>>>>>>>
> >>>>>>>>>> It seems to have caught nothing.
> >>>>>>>>>>
> >>>>>>>>>>> unique (c)
> >>>>>>>>>> When  Who What
> >>>>>>>>>> 1 <NA> <NA> <NA>
> >>>>>>>>>>
> >>>>>>>>>> But I like that it converted into columns. That's a really great
> >>>>>>>>>> format. With a little tweaking, it'd be a great code for this data
> >>>>>>>>>> set.
> >>>>>>>>>>
> >>>>>>>>>> Michael
> >>>>>>>>>>
> >>>>>>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
> >>>>>>>>>> <r-help@r-project.org> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Consider using readLines() and strcapture() for reading such a
> >>>>>>>>> file.
> >>>>>>>>>> E.g.,
> >>>>>>>>>>> suppose readLines(files) produced a character vector like
> >>>>>>>>>>>
> >>>>>>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login",
> >>>>>>>>>>>       "2016-10-21 10:56:29 <John Doe> John_Doe",
> >>>>>>>>>>>       "2016-10-21 10:56:37 <John Doe> Admit#8242",
> >>>>>>>>>>>       "October 23, 1819 12:34 <Jane Eyre> I am not an angel")
> >>>>>>>>>>>
> >>>>>>>>>>> Then you can make a data.frame with columns When, Who, and What by
> >>>>>>>>>>> supplying a pattern containing three parenthesized capture
> >>>>>>>>> expressions:
> >>>>>>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>>>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
> >>>>>>>>>>>          x, proto=data.frame(stringsAsFactors=FALSE, When="",
> >>>>>>>>> Who="",
> >>>>>>>>>>> What=""))
> >>>>>>>>>>>> str(z)
> >>>>>>>>>>> 'data.frame':   4 obs. of  3 variables:
> >>>>>>>>>>> $ When: chr  "2016-10-21 10:35:36" "2016-10-21 10:56:29"
> >>>>>>>>> "2016-10-21
> >>>>>>>>>>> 10:56:37" NA
> >>>>>>>>>>> $ Who : chr  "<Jane Doe>" "<John Doe>" "<John Doe>" NA
> >>>>>>>>>>> $ What: chr  "What's your login" "John_Doe" "Admit#8242" NA
> >>>>>>>>>>>
> >>>>>>>>>>> Lines that don't match the pattern result in NA's - you might make
> >>>>>>>>> a
> >>>>>>>>>> second
> >>>>>>>>>>> pass over the corresponding elements of x with a new pattern.
> >>>>>>>>>>>
> >>>>>>>>>>> You can convert the When column from character to time with
> >>>>>>>>> as.POSIXct().
> >>>>>>>>>>>
> >>>>>>>>>>> Bill Dunlap
> >>>>>>>>>>> TIBCO Software
> >>>>>>>>>>> wdunlap tibco.com
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius
> >>>>>>>>> <dwinsem...@comcast.net>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote:
> >>>>>>>>>>>>> OK. So, I named the object test and then checked the 6347th
> >>>>>>>>> item
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> test <- readLines ("hangouts-conversation.txt)
> >>>>>>>>>>>>>> test [6347]
> >>>>>>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Perhaps where it was getting screwed up is, since the end of
> >>>>>>>>> this is
> >>>>>>>>>> a
> >>>>>>>>>>>>> number (8242), then, given that there's no space between the
> >>>>>>>>> number
> >>>>>>>>>>>>> and what ought to be the next row, R didn't know where to draw
> >>>>>>>>> the
> >>>>>>>>>>>>> line. Sure enough, it looks like this when I go to the original
> >>>>>>>>> file
> >>>>>>>>>>>>> and control f "#8242"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login
> >>>>>>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe
> >>>>>>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> An octothorpe is an end of line signifier and is interpreted as
> >>>>>>>>>> allowing
> >>>>>>>>>>>> comments. You can prevent that interpretation with suitable
> >>>>>>>>> choice of
> >>>>>>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why
> >>>>>>>>> that
> >>>>>>>>>>>> should cause anu error or a failure to match that pattern.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Again, it doesn't look like that in the file. Gmail
> >>>>>>>>> automatically
> >>>>>>>>>>>>> formats it like that when I paste it in. More to the point, it
> >>>>>>>>> looks
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21
> >>>>>>>>> 10:56:29
> >>>>>>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
> >>>>>>>>>> Admit#82422016-10-21
> >>>>>>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Notice Admit#82422016. So there's that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Then I built object test2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4",
> >>>>>>>>> test)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This worked for 84 lines, then this happened.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It may have done something but as you later discovered my first
> >>>>>>>>> code
> >>>>>>>>>> for
> >>>>>>>>>>>> the pattern was incorrect. I had tested it (and pasted in the
> >>>>>>>>> results
> >>>>>>>>>> of
> >>>>>>>>>>>> the test) . The way to refer to a capture class is with
> >>>>>>>>> back-slashes
> >>>>>>>>>>>> before the numbers, not forward-slashes. Try this:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
> >>>>>>>>> "\\1,\\2,\\3,\\4",
> >>>>>>>>>> chrvec)
> >>>>>>>>>>>>> newvec
> >>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey"
> >>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
> >>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
> >>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened,
> >>>>>>>>> not
> >>>>>>>>>> really"
> >>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
> >>>>>>>>> didn't
> >>>>>>>>>> sleep"
> >>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
> >>>>>>>>> where I am
> >>>>>>>>>>>> really"
> >>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
> >>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
> >>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good
> >>>>>>>>> eay"
> >>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
> >>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
> >>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
> >>>>>>>>> more
> >>>>>>>>>>>> rigorous..."
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I made note of the fact that the 10th and 11th lines had no
> >>>>>>>>> commas.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> test2 [84]
> >>>>>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
> >>>>>>>>>>>>
> >>>>>>>>>>>> That line didn't have any "<" so wasn't matched.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> You could remove all none matching lines for pattern of
> >>>>>>>>>>>>
> >>>>>>>>>>>> dates<space>times<space>"<"<name>">"<space><anything>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> with:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)]
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Do read:
> >>>>>>>>>>>>
> >>>>>>>>>>>> ?read.csv
> >>>>>>>>>>>>
> >>>>>>>>>>>> ?regex
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>>
> >>>>>>>>>>>> David
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> test2 [85]
> >>>>>>>>>>>>> [1] "//1,//2,//3,//4"
> >>>>>>>>>>>>>> test [85]
> >>>>>>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Notice how I toggled back and forth between test and test2
> >>>>>>>>> there. So,
> >>>>>>>>>>>>> whatever happened with the regex, it happened in the switch
> >>>>>>>>> from 84
> >>>>>>>>>> to
> >>>>>>>>>>>>> 85, I guess. It went on like
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [990] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [991] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [992] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [993] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [994] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [995] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [996] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [997] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [998] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [999] "//1,//2,//3,//4"
> >>>>>>>>>>>>> [1000] "//1,//2,//3,//4"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> up until line 1000, then I reached max.print.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Michael
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius <
> >>>>>>>>>> dwinsem...@comcast.net>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> >>>>>>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and
> >>>>>>>>> not do
> >>>>>>>>>>>> that again.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I tried the read.fwf from the foreign package, with a code
> >>>>>>>>> like
> >>>>>>>>>> this:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> d <- read.fwf("hangouts-conversation.txt",
> >>>>>>>>>>>>>>>               widths= c(10,10,20,40),
> >>>>>>>>>>>>>>>
> >>>>>>>>> col.names=c("date","time","person","comment"),
> >>>>>>>>>>>>>>>               strip.white=TRUE)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> But it threw this error:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote =
> >>>>>>>>> quote,
> >>>>>>>>>> dec
> >>>>>>>>>>>> = dec,  :
> >>>>>>>>>>>>>>> line 6347 did not have 4 elements
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print
> >>>>>>>>> it
> >>>>>>>>>> out.)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Interestingly, though, the error only happened when I
> >>>>>>>>> increased the
> >>>>>>>>>>>>>>> width size. But I had to increase the size, or else I
> >>>>>>>>> couldn't
> >>>>>>>>>> "see"
> >>>>>>>>>>>>>>> anything.  The comment was so small that nothing was being
> >>>>>>>>>> captured by
> >>>>>>>>>>>>>>> the size of the column. so to speak.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It seems like what's throwing me is that there's no comma
> >>>>>>>>> that
> >>>>>>>>>>>>>>> demarcates the end of the text proper. For example:
> >>>>>>>>>>>>>> Not sure why you thought there should be a comma. Lines
> >>>>>>>>> usually end
> >>>>>>>>>>>>>> with  <cr> and or a <lf>.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Once you have the raw text in a character vector from
> >>>>>>>>> `readLines`
> >>>>>>>>>> named,
> >>>>>>>>>>>>>> say, 'chrvec', then you could selectively substitute commas
> >>>>>>>>> for
> >>>>>>>>>> spaces
> >>>>>>>>>>>>>> with regex. (Now that you no longer desire to remove the dates
> >>>>>>>>> and
> >>>>>>>>>>>> times.)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This will not do any replacements when the pattern is not
> >>>>>>>>> matched.
> >>>>>>>>>> See
> >>>>>>>>>>>>>> this test:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
> >>>>>>>>> "\\1,\\2,\\3,\\4",
> >>>>>>>>>>>> chrvec)
> >>>>>>>>>>>>>>> newvec
> >>>>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey"
> >>>>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
> >>>>>>>>> Edinburgh"
> >>>>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
> >>>>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
> >>>>>>>>> happened, not
> >>>>>>>>>>>> really"
> >>>>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
> >>>>>>>>> didn't
> >>>>>>>>>>>> sleep"
> >>>>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
> >>>>>>>>> where
> >>>>>>>>>> I am
> >>>>>>>>>>>>>> really"
> >>>>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
> >>>>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
> >>>>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
> >>>>>>>>> good
> >>>>>>>>>> eay"
> >>>>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
> >>>>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
> >>>>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
> >>>>>>>>> more
> >>>>>>>>>>>>>> rigorous..."
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You should probably remove the "empty comment" lines.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> David.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
> >>>>>>>>>> starbucks2016-07-01
> >>>>>>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09
> >>>>>>>>> <Jane
> >>>>>>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe>
> >>>>>>>>> There was
> >>>>>>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It was interesting, too, when I pasted the text into the
> >>>>>>>>> email, it
> >>>>>>>>>>>>>>> self-formatted into the way I wanted it to look. I had to
> >>>>>>>>> manually
> >>>>>>>>>>>>>>> make it look like it does above, since that's the way that it
> >>>>>>>>>> looks in
> >>>>>>>>>>>>>>> the txt file. I wonder if it's being organized by XML or
> >>>>>>>>> something.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Anyways, There's always a space between the two sideways
> >>>>>>>>> carrots,
> >>>>>>>>>> just
> >>>>>>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's
> >>>>>>>>> always
> >>>>>>>>>> a
> >>>>>>>>>>>>>>> space between the data and time. Like this. 2016-07-01
> >>>>>>>>> 15:34:30
> >>>>>>>>>> See.
> >>>>>>>>>>>>>>> Space. But there's never a space between the end of the
> >>>>>>>>> comment and
> >>>>>>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01
> >>>>>>>>> 15:35:02
> >>>>>>>>>>>>>>> See. starbucks and 2016 are smooshed together.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This code is also on the table right now too.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a <- read.table("E:/working
> >>>>>>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"",
> >>>>>>>>>>>>>>> comment.char="", fill=TRUE)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h)
> >>>>>>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Those last lines are a work in progress. I wish I could
> >>>>>>>>> import a
> >>>>>>>>>>>>>>> picture of what it looks like when it's translated into a
> >>>>>>>>> data
> >>>>>>>>>> frame.
> >>>>>>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of
> >>>>>>>>> sort of
> >>>>>>>>>>>>>>> works, but the comments keep bleeding into the data and time
> >>>>>>>>>> column.
> >>>>>>>>>>>>>>> It's like
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
> >>>>>>>>>>>>>>> over               there
> >>>>>>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to
> >>>>>>>>>> itself, as
> >>>>>>>>>>>>>>> will be the "I've'"and the "never" etc.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I will use a regular expression if I have to, but it would be
> >>>>>>>>> nice
> >>>>>>>>>> to
> >>>>>>>>>>>>>>> keep the dates and times on there. Originally, I thought they
> >>>>>>>>> were
> >>>>>>>>>>>>>>> meaningless, but I've since changed my mind on that count.
> >>>>>>>>> The
> >>>>>>>>>> time of
> >>>>>>>>>>>>>>> day isn't so important. But, especially since, say, Gmail
> >>>>>>>>> itself
> >>>>>>>>>> knows
> >>>>>>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I
> >>>>>>>>> know
> >>>>>>>>>>>>>>> this data has structure to it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Michael
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
> >>>>>>>>>>>> dwinsem...@comcast.net> wrote:
> >>>>>>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> >>>>>>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks
> >>>>>>>>> like
> >>>>>>>>>> this:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey
> >>>>>>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
> >>>>>>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo
> >>>>>>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not
> >>>>>>>>>> really
> >>>>>>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast,
> >>>>>>>>> didn't
> >>>>>>>>>> sleep
> >>>>>>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where
> >>>>>>>>> I am
> >>>>>>>>>>>> really
> >>>>>>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london
> >>>>>>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep
> >>>>>>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good
> >>>>>>>>> eay
> >>>>>>>>>>>>>>>>> 2016-07-01 02:58:56 <jone>
> >>>>>>>>>>>>>>>>> 2016-07-01 02:59:34 <jane>
> >>>>>>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little
> >>>>>>>>> more
> >>>>>>>>>>>> rigorous...
> >>>>>>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2)
> >>>>>>>>> Use
> >>>>>>>>>> regex
> >>>>>>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<".
> >>>>>>>>> Read
> >>>>>>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a
> >>>>>>>>>> pattern
> >>>>>>>>>>>>>>>> ".+<" and replace with "".
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> And do read the Posting Guide. Cross-posting to
> >>>>>>>>> StackOverflow and
> >>>>>>>>>>>> Rhelp,
> >>>>>>>>>>>>>>>> at least within hours of each, is considered poor manners.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> David.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like
> >>>>>>>>> it's
> >>>>>>>>>> going
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> be difficult to annotate with the coreNLP library or
> >>>>>>>>> package. I'm
> >>>>>>>>>>>>>>>>> doing natural language processing. In other words, I'm
> >>>>>>>>> curious
> >>>>>>>>>> as to
> >>>>>>>>>>>>>>>>> how I would shave off the dates, that is, to make it look
> >>>>>>>>> like:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> <john> hey
> >>>>>>>>>>>>>>>>> <jane> waiting for plane to Edinburgh
> >>>>>>>>>>>>>>>>> <john> thinking about my boo
> >>>>>>>>>>>>>>>>> <jane> nothing crappy has happened, not really
> >>>>>>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep
> >>>>>>>>>>>>>>>>> <jane> no idea what time it is or where I am really
> >>>>>>>>>>>>>>>>> <john> just know it's london
> >>>>>>>>>>>>>>>>> <jane> you are probably asleep
> >>>>>>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay
> >>>>>>>>>>>>>>>>> <jone>
> >>>>>>>>>>>>>>>>> <jane>
> >>>>>>>>>>>>>>>>> <john> British security is a little more rigorous...
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by
> >>>>>>>>>> writing a
> >>>>>>>>>>>>>>>>> regular expression? such that I create a new object with no
> >>>>>>>>>> numbers
> >>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>> dates.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Michael
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and
> >>>>>>>>> more,
> >>>>>>>>>> see
> >>>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>>>>>> and provide commented, minimal, self-contained,
> >>>>>>>>> reproducible
> >>>>>>>>>> code.
> >>>>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>>>>>> see
> >>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>>>>>> code.
> >>>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>>>>>> see
> >>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>>>>>> code.
> >>>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
> >>>>>>>>> see
> >>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>>>>>> code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> ______________________________________________
> >>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
> >>>>>>>>> code.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>     [[alternative HTML version deleted]]
> >>>>>>>>>>>
> >>>>>>>>>>> ______________________________________________
> >>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>>>>>
> >>>>>>>>>> ______________________________________________
> >>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>   [[alternative HTML version deleted]]
> >>>>>>>>>
> >>>>>>>>> ______________________________________________
> >>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Sent from my phone. Please excuse my brevity.
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide 
> >>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide 
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>
> >>> ______________________________________________
> >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide 
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >
> > ______________________________________________
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to separate string from numbers in a large txt file

Reply via email to