Re: [R] how to separate string from numbers in a large txt file

Boris Steipe Sun, 19 May 2019 15:39:31 -0700

Inline



> On 2019-05-19, at 18:11, Michael Boulineau <michael.p.boulin...@gmail.com> 
> wrote:
> 
> For context:
> 
>> In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and 
>> \\2. The expression says:
>> Substitute ALL of the match with the first captured expression, then " <", 
>> then the second captured expression, then "> ". The rest of the line is >not 
>> substituted and appears as-is.
> 
> Back to me: I guess what's giving me trouble is where to draw the line
> in terms of the end or edge of the expression. Given the code, then,
> 
>> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
>> c <- gsub(b, "\\1<\\2> ", a)
> 
> to me, it would seem as though this is the first captured expression,
> that is, as though \\1 refers back to ^([0-9-]{10} [0-9:]{8} ), since
> there are parenthesis around it, or since [0-9-]{10} [0-9:]{8} is
> enclosed in parentheses.

That's correct: parentheses in regular expressions delimit captured substrings.



> Then it would seem as though [*]{3} is the
> second expression, and (\\w+ \\w+) is the third.

Note that "[*]{3}" has no parentheses, is not captured and is not accounted for 
in the back-references.

\\1 and \\2 refers only to the captured substrings - everything else 
contributes to whether the regex matches at all, but is no longer considered 
after the match.

> According to this
> (admittedly wrong) logic, it would seem as though the <> would go
> around the date--like

No:  it goes around \\2, which is (\\w+ \\w+)

> 
>> 2016-03-20 <19:29:37> *** Jane Doe started a video chat
> 
> The back references here recalls Davis's code earlier:
> 
>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)
> 
> There, commas were put around everything, and there you can see the
> edge of the expression very well. ^(.{10}) = first. (.{8}) = second.
> (<.+>) = third. (.+$) = fourth. So, by the same logic, it would seem
> as though in
> 
>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
> 
> that ^([0-9-]{10} [0-9:]{8} ) is first, that [*]{3} is second, and
> that  (\\w+ \\w+) is third.
> 
> But, if Boris is to be right, and he is, obviously, then it would have
> to be the case that this entire thing, namely, ^([0-9-]{10} [0-9:]{8}
> )[*]{3}, is the first expression,

Actually "[*]{3}" is not part of the first expression - it is discarded because 
not in parentheses

> since only if that were true would
> the <> be able to go around the names, as in
> 
> [3] "2016-01-27 09:15:20 <Jane Doe> Hey "
> 
> Again, so 2016-01-27 09:15:20 would have to be an entire unit, an
> expression.

The word "expression" has a different technical meaning, but colloquially you 
are right.


> So I guess what I don't understand is how ^([0-9-]{10}
> [0-9:]{8} )[*]{3} can be an entire expression, although my hunch would
> be that it has something to do with the ^ or with the space after the
> } and before the (, as in
> 
>> {3} (\\w+
> 

No. Just the parentheses.


> Back to earlier:
> 
>> The rest of the line is not substituted and appears as-is.
> 
> Is that due to the space after the \\2? in
> 
>> "\\1<\\2> 

No, that is because the substitution in gsub() targets only the match of the 
regex - and the string to the end is not part of the regex.


Cheers,
Boris

> Notice space after > and before "
> 
> Michael
> 
> On Sun, May 19, 2019 at 2:31 PM Boris Steipe <boris.ste...@utoronto.ca> wrote:
>> 
>> Inline ...
>> 
>>> On 2019-05-19, at 13:56, Michael Boulineau <michael.p.boulin...@gmail.com> 
>>> wrote:
>>> 
>>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
>>> 
>>> so the ^ signals that the regex BEGINS with a number (that could be
>>> any number, 0-9) that is only 10 characters long (then there's the
>>> dash in there, too, with the 0-9-, which I assume enabled the regex to
>>> grab the - that's between the numbers in the date)
>> 
>> That's right. Note that within a "character class" the hyphen can have tow 
>> meanings: normally it defines a range of characters, but if it appears as 
>> the last character before "]" it is a literal hyphen.
>> 
>>> , followed by a
>>> single space, followed by a unit that could be any number, again, but
>>> that is only 8 characters long this time. For that one, it will
>>> include the colon, hence the 9:, although for that one ([0-9:]{8} ),
>> 
>> Right.
>> 
>> 
>>> I
>>> don't get why the space is on the inside in that one, after the {8},
>> 
>> The space needs to be preserved between the time and the name. I wrote
>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)" # space in the first 
>> captured expression
>> c <- gsub(b, "\\1<\\2> ", a)
>> ... but I could have written
>> b <- "^([0-9-]{10} [0-9:]{8})[*]{3} (\\w+ \\w+)"
>> c <- gsub(b, "\\1 <\\2> ", a)  # space in the substituted string
>> ... same result
>> 
>> 
>>> whereas the space is on the outside with the other one ^([0-9-]{10} ,
>>> directly after the {10}. Why is that?
>> 
>> In the second case, I capture without a space, because I don't want the 
>> space in the results, after the time.
>> 
>> 
>>> 
>>> Then three *** [*]{3}, then the (\\w+ \\w+)", which Boris explained so
>>> well above. I guess I still don't get why this one seemed to have
>>> deleted the *** out of the mix, plus I still don't why it didn't
>>> remove the *** from the first one.
>> 
>> Because the entire first line was not matched since it had a malformed 
>> character preceding the date.
>> 
>>> 
>>> 2016-03-20 19:29:37 *** Jane Doe started a video chat
>>> 2016-03-20 19:30:35 *** John Doe ended a video chat
>>> 2016-04-02 12:59:36 *** Jane Doe started a video chat
>>> 2016-04-02 13:00:43 *** John Doe ended a video chat
>>> 2016-04-02 13:01:08 *** Jane Doe started a video chat
>>> 2016-04-02 13:01:41 *** John Doe ended a video chat
>>> 2016-04-02 13:03:51 *** John Doe started a video chat
>>> 2016-04-02 13:06:35 *** John Doe ended a video chat
>>> 
>>> This is a random sample from the beginning of the txt file with no
>>> edits. The ***s were deleted, all but the first one, the one that had
>>> the ï»¿ but that was taken out by the encoding = "UTF-8". I know that
>>> the function was c <- gsub(b, "\\1<\\2> ", a), so it had a gsub () on
>>> there, the point of which is to do substitution work.
>>> 
>>> Oh, I get it, I think. The \\1<\\2> in the gsub () puts the <> around
>>> the names, so that it's consistent with the rest of the data, so that
>>> the names in the text about that aren't enclosed in the <> are
>>> enclosed like the rest of them. But I still don't get why or how the
>>> gsub () replaced the *** with the <>...
>> 
>> In gsub(b, "\\1<\\2> ", a) the work is done by the backreferences \\1 and 
>> \\2. The expression says:
>> Substitute ALL of the match with the first captured expression, then " <", 
>> then the second captured expression, then "> ". The rest of the line is not 
>> substituted and appears as-is.
>> 
>> 
>>> 
>>> This one is more straightforward.
>>> 
>>>> d <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
>>> 
>>> any number with - for 10 characters, followed by a space. Oh, there's
>>> no space in this one ([0-9:]{8}), after the {8}. Hu. So, then, any
>>> number with : for 8 characters, followed by any two words separated by
>>> a space and enclosed in <>. And then the \\s* is followed by a single
>>> space? Or maybe it puts space on both sides (on the side of the #s to
>>> the left, and then the comment to the right). The (.+)$ is anything
>>> whatsoever until the end.
>> 
>> \s is the metacharacter for "whitespace". \s* means zero or more whitespace. 
>> I'm matching that OUTSIDE of the captured expression, to removes any leading 
>> spaces from the data that goes into the data frame.
>> 
>> 
>> Cheers,
>> Boris
>> 
>> 
>> 
>> 
>>> 
>>> Michael
>>> 
>>> 
>>> On Sun, May 19, 2019 at 4:37 AM Boris Steipe <boris.ste...@utoronto.ca> 
>>> wrote:
>>>> 
>>>> Inline
>>>> 
>>>> 
>>>> 
>>>>> On 2019-05-18, at 20:34, Michael Boulineau 
>>>>> <michael.p.boulin...@gmail.com> wrote:
>>>>> 
>>>>> It appears to have worked, although there were three little quirks.
>>>>> The ; close(con); rm(con) didn't work for me; the first row of the
>>>>> data.frame was all NAs, when all was said and done;
>>>> 
>>>> You will get NAs for lines that can't be matched to the regular 
>>>> expression. That's a good thing, it allows you to test whether your 
>>>> assumptions were valid for the entire file:
>>>> 
>>>> # number of failed strcapture()
>>>> sum(is.na(e$date))
>>>> 
>>>> 
>>>>> and then there
>>>>> were still three *** on the same line where the ï»¿ was apparently
>>>>> deleted.
>>>> 
>>>> This is a sign that something else happened with the line that prevented 
>>>> the regex from matching. In that case you need to investigate more. I see 
>>>> an invalid multibyte character at the beginning of the line you posted 
>>>> below.
>>>> 
>>>>> 
>>>>>> a <- readLines ("hangouts-conversation-6.txt", encoding = "UTF-8")
>>>>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
>>>>>> c <- gsub(b, "\\1<\\2> ", a)
>>>>>> head (c)
>>>>> [1] "2016-01-27 09:14:40 *** Jane Doe started a video chat"
>>>>> [2] "2016-01-27 09:15:20 <Jane Doe>
>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf";
>>>> 
>>>> [...]
>>>> 
>>>>> But, before I do anything else, I'm going to study the regex in this
>>>>> particular code. For example, I'm still not sure why there has to the
>>>>> second \\w+ in the (\\w+ \\w+). Little things like that.
>>>> 
>>>> \w is the metacharacter for alphanumeric characters, \w+ designates 
>>>> something we could call a word. Thus \w+ \w+ are two words separated by a 
>>>> single blank. This corresponds to your example, but, as I wrote 
>>>> previously, you need to think very carefully whether this covers all 
>>>> possible cases (Could there be only one word? More than one blank? Could 
>>>> letters be separated by hyphens or periods?) In most cases we could have 
>>>> more robustly matched everything between "<" and ">" (taking care to test 
>>>> what happens if the message contains those characters). But for the video 
>>>> chat lines we need to make an assumption about what is name and what is 
>>>> not. If "started a video chat" is the only possibility in such lines, you 
>>>> can use this information instead. If there are other possibilities, you 
>>>> need a different strategy. In NLP there is no one-approach-fits-all.
>>>> 
>>>> To validate the structure of the names in your transcripts, you can look at
>>>> 
>>>> patt <- " <.+?> "   # " <any string, not greedy> "
>>>> m <- regexpr(patt, c)
>>>> unique(regmatches(c, m))
>>>> 
>>>> 
>>>> 
>>>> B.
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Michael
>>>>> 
>>>>> 
>>>>> On Sat, May 18, 2019 at 4:30 PM Boris Steipe <boris.ste...@utoronto.ca> 
>>>>> wrote:
>>>>>> 
>>>>>> This works for me:
>>>>>> 
>>>>>> # sample data
>>>>>> c <- character()
>>>>>> c[1] <- "2016-01-27 09:14:40 <Jane Doe> started a video chat"
>>>>>> c[2] <- "2016-01-27 09:15:20 <Jane Doe> 
>>>>>> https://lh3.googleusercontent.com/";
>>>>>> c[3] <- "2016-01-27 09:15:20 <Jane Doe> Hey "
>>>>>> c[4] <- "2016-01-27 09:15:22 <John Doe>  ended a video chat"
>>>>>> c[5] <- "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
>>>>>> c[6] <- "2016-01-27 21:26:57 <John Doe>  ended a video chat"
>>>>>> 
>>>>>> 
>>>>>> # regex  ^(year)       (time)      <(word word)>\\s*(string)$
>>>>>> patt <- "^([0-9-]{10}) ([0-9:]{8}) <(\\w+ \\w+)>\\s*(.+)$"
>>>>>> proto <- data.frame(date = character(),
>>>>>>                  time = character(),
>>>>>>                  name = character(),
>>>>>>                  text = character(),
>>>>>>                  stringsAsFactors = TRUE)
>>>>>> d <- strcapture(patt, c, proto)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>      date     time     name                               text
>>>>>> 1 2016-01-27 09:14:40 Jane Doe               started a video chat
>>>>>> 2 2016-01-27 09:15:20 Jane Doe https://lh3.googleusercontent.com/
>>>>>> 3 2016-01-27 09:15:20 Jane Doe                               Hey
>>>>>> 4 2016-01-27 09:15:22 John Doe                 ended a video chat
>>>>>> 5 2016-01-27 21:07:11 Jane Doe               started a video chat
>>>>>> 6 2016-01-27 21:26:57 John Doe                 ended a video chat
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>> 
>>>>>>> On 2019-05-18, at 18:32, Michael Boulineau 
>>>>>>> <michael.p.boulin...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Going back and thinking through what Boris and William were saying
>>>>>>> (also Ivan), I tried this:
>>>>>>> 
>>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt")
>>>>>>> b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
>>>>>>> c <- gsub(b, "\\1<\\2> ", a)
>>>>>>>> head (c)
>>>>>>> [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe started a video chat"
>>>>>>> [2] "2016-01-27 09:15:20 <Jane Doe>
>>>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf";
>>>>>>> [3] "2016-01-27 09:15:20 <Jane Doe> Hey "
>>>>>>> [4] "2016-01-27 09:15:22 <John Doe>  ended a video chat"
>>>>>>> [5] "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
>>>>>>> [6] "2016-01-27 21:26:57 <John Doe>  ended a video chat"
>>>>>>> 
>>>>>>> The ï»¿ is still there, since I forgot to do what Ivan had suggested, 
>>>>>>> namely,
>>>>>>> 
>>>>>>> a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding
>>>>>>> = "UTF-8")); close(con); rm(con)
>>>>>>> 
>>>>>>> But then the new code is still turning out only NAs when I apply
>>>>>>> strcapture (). This was what happened next:
>>>>>>> 
>>>>>>>> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
>>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
>>>>>>> +                 c, proto=data.frame(stringsAsFactors=FALSE, When="", 
>>>>>>> Who="",
>>>>>>> +                                     What=""))
>>>>>>>> head (d)
>>>>>>> When  Who What
>>>>>>> 1 <NA> <NA> <NA>
>>>>>>> 2 <NA> <NA> <NA>
>>>>>>> 3 <NA> <NA> <NA>
>>>>>>> 4 <NA> <NA> <NA>
>>>>>>> 5 <NA> <NA> <NA>
>>>>>>> 6 <NA> <NA> <NA>
>>>>>>> 
>>>>>>> I've been reading up on regular expressions, too, so this code seems
>>>>>>> spot on. What's going wrong?
>>>>>>> 
>>>>>>> Michael
>>>>>>> 
>>>>>>> On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.ste...@utoronto.ca> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Don't start putting in extra commas and then reading this as csv. That 
>>>>>>>> approach is broken. The correct approach is what Bill outlined: read 
>>>>>>>> everything with readLines(), and then use a proper regular expression 
>>>>>>>> with strcapture().
>>>>>>>> 
>>>>>>>> You need to pre-process the object that readLines() gives you: replace 
>>>>>>>> the contents of the videochat lines, and make it conform to the format 
>>>>>>>> of the other lines before you process it into your data frame.
>>>>>>>> 
>>>>>>>> Approximately something like
>>>>>>>> 
>>>>>>>> # read the raw data
>>>>>>>> tmp <- readLines("hangouts-conversation-6.csv.txt")
>>>>>>>> 
>>>>>>>> # process all video chat lines
>>>>>>>> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) "  # (year time 
>>>>>>>> )*** (word word)
>>>>>>>> tmp <- gsub(patt, "\\1<\\2> ", tmp)
>>>>>>>> 
>>>>>>>> # next, use strcapture()
>>>>>>>> 
>>>>>>>> Note that this makes the assumption that your names are always exactly 
>>>>>>>> two words containing only letters. If that assumption is not true, 
>>>>>>>> more though needs to go into the regex. But you can test that:
>>>>>>>> 
>>>>>>>> patt <- " <\\w+ \\w+> "   #" <word word> "
>>>>>>>> sum( ! grepl(patt, tmp)))
>>>>>>>> 
>>>>>>>> ... will give the number of lines that remain in your file that do not 
>>>>>>>> have a tag that can be interpreted as "Who"
>>>>>>>> 
>>>>>>>> Once that is fine, use Bill's approach - or a regular expression of 
>>>>>>>> your own design - to create your data frame.
>>>>>>>> 
>>>>>>>> Hope this helps,
>>>>>>>> Boris
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 2019-05-17, at 16:18, Michael Boulineau 
>>>>>>>>> <michael.p.boulin...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Very interesting. I'm sure I'll be trying to get rid of the byte order
>>>>>>>>> mark eventually. But right now, I'm more worried about getting the
>>>>>>>>> character vector into either a csv file or data.frame; that way, I can
>>>>>>>>> be able to work with the data neatly tabulated into four columns:
>>>>>>>>> date, time, person, comment. I assume it's a write.csv function, but I
>>>>>>>>> don't know what arguments to put in it. header=FALSE? fill=T?
>>>>>>>>> 
>>>>>>>>> Micheal
>>>>>>>>> 
>>>>>>>>> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller 
>>>>>>>>> <jdnew...@dcn.davis.ca.us> wrote:
>>>>>>>>>> 
>>>>>>>>>> If byte order mark is the issue then you can specify the file 
>>>>>>>>>> encoding as "UTF-8-BOM" and it won't show up in your data any more.
>>>>>>>>>> 
>>>>>>>>>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help 
>>>>>>>>>> <r-help@r-project.org> wrote:
>>>>>>>>>>> The pattern I gave worked for the lines that you originally showed 
>>>>>>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>> data file ('a'), before you put commas into them.  If the name is
>>>>>>>>>>> either of
>>>>>>>>>>> the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed 
>>>>>>>>>>> so
>>>>>>>>>>> something like "(<[^>]*>|[*]{3})".
>>>>>>>>>>> 
>>>>>>>>>>> The " ï»¿" at the start of the imported data may come from the byte
>>>>>>>>>>> order
>>>>>>>>>>> mark that Windows apps like to put at the front of a text file in 
>>>>>>>>>>> UTF-8
>>>>>>>>>>> or
>>>>>>>>>>> UTF-16 format.
>>>>>>>>>>> 
>>>>>>>>>>> Bill Dunlap
>>>>>>>>>>> TIBCO Software
>>>>>>>>>>> wdunlap tibco.com
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
>>>>>>>>>>> michael.p.boulin...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> This seemed to work:
>>>>>>>>>>>> 
>>>>>>>>>>>>> a <- readLines ("hangouts-conversation-6.csv.txt")
>>>>>>>>>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a)
>>>>>>>>>>>>> b [1:84]
>>>>>>>>>>>> 
>>>>>>>>>>>> And the first 85 lines looks like this:
>>>>>>>>>>>> 
>>>>>>>>>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat"
>>>>>>>>>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>>>>>>>>>>>> 
>>>>>>>>>>>> Then they transition to the commas:
>>>>>>>>>>>> 
>>>>>>>>>>>>> b [84:100]
>>>>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>>>>>>>>>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey"
>>>>>>>>>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh"
>>>>>>>>>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo"
>>>>>>>>>>>> 
>>>>>>>>>>>> Even the strange bit on line 6347 was caught by this:
>>>>>>>>>>>> 
>>>>>>>>>>>>> b [6346:6348]
>>>>>>>>>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
>>>>>>>>>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
>>>>>>>>>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion"
>>>>>>>>>>>> 
>>>>>>>>>>>> Perhaps most awesomely, the code catches spaces that are interposed
>>>>>>>>>>>> into the comment itself:
>>>>>>>>>>>> 
>>>>>>>>>>>>> b [4]
>>>>>>>>>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
>>>>>>>>>>>>> b [85]
>>>>>>>>>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey"
>>>>>>>>>>>> 
>>>>>>>>>>>> Notice whether there is a space after the "hey" or not.
>>>>>>>>>>>> 
>>>>>>>>>>>> These are the first two lines:
>>>>>>>>>>>> 
>>>>>>>>>>>> [1] "ï»¿2016-01-27 09:14:40 *** Jane Doe started a video chat"
>>>>>>>>>>>> [2] "2016-01-27,09:15:20,<Jane
>>>>>>>>>>>> Doe>,
>>>>>>>>>>>> 
>>>>>>>>>>> https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
>>>>>>>>>>>> "
>>>>>>>>>>>> 
>>>>>>>>>>>> So, who knows what happened with the ï»¿ at the beginning of [1]
>>>>>>>>>>>> directly above. But notice how there are no commas in [1] but there
>>>>>>>>>>>> appear in [2]. I don't see why really long ones like [2] directly
>>>>>>>>>>>> above would be a problem, were they to be translated into a csv or
>>>>>>>>>>>> data frame column.
>>>>>>>>>>>> 
>>>>>>>>>>>> Now, with the commas in there, couldn't we write this into a csv 
>>>>>>>>>>>> or a
>>>>>>>>>>>> data.frame? Some of this data will end up being garbage, I imagine.
>>>>>>>>>>>> Like in [2] directly above. Or with [83] and [84] at the top of 
>>>>>>>>>>>> this
>>>>>>>>>>>> discussion post/email. Embarrassingly, I've been trying to convert
>>>>>>>>>>>> this into a data.frame or csv but I can't manage to. I've been 
>>>>>>>>>>>> using
>>>>>>>>>>>> the write.csv function, but I don't think I've been getting the
>>>>>>>>>>>> arguments correct.
>>>>>>>>>>>> 
>>>>>>>>>>>> At the end of the day, I would like a data.frame and/or csv with 
>>>>>>>>>>>> the
>>>>>>>>>>>> following four columns: date, time, person, comment.
>>>>>>>>>>>> 
>>>>>>>>>>>> I tried this, too:
>>>>>>>>>>>> 
>>>>>>>>>>>>> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
>>>>>>>>>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
>>>>>>>>>>>> +                 a, proto=data.frame(stringsAsFactors=FALSE,
>>>>>>>>>>> When="",
>>>>>>>>>>>> Who="",
>>>>>>>>>>>> +                                     What=""))
>>>>>>>>>>>> 
>>>>>>>>>>>> But all I got was this:
>>>>>>>>>>>> 
>>>>>>>>>>>>> c [1:100, ]
>>>>>>>>>>>> When  Who What
>>>>>>>>>>>> 1   <NA> <NA> <NA>
>>>>>>>>>>>> 2   <NA> <NA> <NA>
>>>>>>>>>>>> 3   <NA> <NA> <NA>
>>>>>>>>>>>> 4   <NA> <NA> <NA>
>>>>>>>>>>>> 5   <NA> <NA> <NA>
>>>>>>>>>>>> 6   <NA> <NA> <NA>
>>>>>>>>>>>> 
>>>>>>>>>>>> It seems to have caught nothing.
>>>>>>>>>>>> 
>>>>>>>>>>>>> unique (c)
>>>>>>>>>>>> When  Who What
>>>>>>>>>>>> 1 <NA> <NA> <NA>
>>>>>>>>>>>> 
>>>>>>>>>>>> But I like that it converted into columns. That's a really great
>>>>>>>>>>>> format. With a little tweaking, it'd be a great code for this data
>>>>>>>>>>>> set.
>>>>>>>>>>>> 
>>>>>>>>>>>> Michael
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
>>>>>>>>>>>> <r-help@r-project.org> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Consider using readLines() and strcapture() for reading such a
>>>>>>>>>>> file.
>>>>>>>>>>>> E.g.,
>>>>>>>>>>>>> suppose readLines(files) produced a character vector like
>>>>>>>>>>>>> 
>>>>>>>>>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login",
>>>>>>>>>>>>>      "2016-10-21 10:56:29 <John Doe> John_Doe",
>>>>>>>>>>>>>      "2016-10-21 10:56:37 <John Doe> Admit#8242",
>>>>>>>>>>>>>      "October 23, 1819 12:34 <Jane Eyre> I am not an angel")
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Then you can make a data.frame with columns When, Who, and What by
>>>>>>>>>>>>> supplying a pattern containing three parenthesized capture
>>>>>>>>>>> expressions:
>>>>>>>>>>>>>> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
>>>>>>>>>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
>>>>>>>>>>>>>         x, proto=data.frame(stringsAsFactors=FALSE, When="",
>>>>>>>>>>> Who="",
>>>>>>>>>>>>> What=""))
>>>>>>>>>>>>>> str(z)
>>>>>>>>>>>>> 'data.frame':   4 obs. of  3 variables:
>>>>>>>>>>>>> $ When: chr  "2016-10-21 10:35:36" "2016-10-21 10:56:29"
>>>>>>>>>>> "2016-10-21
>>>>>>>>>>>>> 10:56:37" NA
>>>>>>>>>>>>> $ Who : chr  "<Jane Doe>" "<John Doe>" "<John Doe>" NA
>>>>>>>>>>>>> $ What: chr  "What's your login" "John_Doe" "Admit#8242" NA
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Lines that don't match the pattern result in NA's - you might make
>>>>>>>>>>> a
>>>>>>>>>>>> second
>>>>>>>>>>>>> pass over the corresponding elements of x with a new pattern.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can convert the When column from character to time with
>>>>>>>>>>> as.POSIXct().
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Bill Dunlap
>>>>>>>>>>>>> TIBCO Software
>>>>>>>>>>>>> wdunlap tibco.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius
>>>>>>>>>>> <dwinsem...@comcast.net>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote:
>>>>>>>>>>>>>>> OK. So, I named the object test and then checked the 6347th
>>>>>>>>>>> item
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> test <- readLines ("hangouts-conversation.txt)
>>>>>>>>>>>>>>>> test [6347]
>>>>>>>>>>>>>>> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Perhaps where it was getting screwed up is, since the end of
>>>>>>>>>>> this is
>>>>>>>>>>>> a
>>>>>>>>>>>>>>> number (8242), then, given that there's no space between the
>>>>>>>>>>> number
>>>>>>>>>>>>>>> and what ought to be the next row, R didn't know where to draw
>>>>>>>>>>> the
>>>>>>>>>>>>>>> line. Sure enough, it looks like this when I go to the original
>>>>>>>>>>> file
>>>>>>>>>>>>>>> and control f "#8242"
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login
>>>>>>>>>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe
>>>>>>>>>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> An octothorpe is an end of line signifier and is interpreted as
>>>>>>>>>>>> allowing
>>>>>>>>>>>>>> comments. You can prevent that interpretation with suitable
>>>>>>>>>>> choice of
>>>>>>>>>>>>>> parameters to `read.table` or `read.csv`. I don't understand why
>>>>>>>>>>> that
>>>>>>>>>>>>>> should cause anu error or a failure to match that pattern.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Again, it doesn't look like that in the file. Gmail
>>>>>>>>>>> automatically
>>>>>>>>>>>>>>> formats it like that when I paste it in. More to the point, it
>>>>>>>>>>> looks
>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21
>>>>>>>>>>> 10:56:29
>>>>>>>>>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
>>>>>>>>>>>> Admit#82422016-10-21
>>>>>>>>>>>>>>> 11:00:13 <Jane Doe> Okay so you have a discussion
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Notice Admit#82422016. So there's that.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Then I built object test2.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4",
>>>>>>>>>>> test)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This worked for 84 lines, then this happened.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It may have done something but as you later discovered my first
>>>>>>>>>>> code
>>>>>>>>>>>> for
>>>>>>>>>>>>>> the pattern was incorrect. I had tested it (and pasted in the
>>>>>>>>>>> results
>>>>>>>>>>>> of
>>>>>>>>>>>>>> the test) . The way to refer to a capture class is with
>>>>>>>>>>> back-slashes
>>>>>>>>>>>>>> before the numbers, not forward-slashes. Try this:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
>>>>>>>>>>> "\\1,\\2,\\3,\\4",
>>>>>>>>>>>> chrvec)
>>>>>>>>>>>>>>> newvec
>>>>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey"
>>>>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
>>>>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
>>>>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened,
>>>>>>>>>>> not
>>>>>>>>>>>> really"
>>>>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
>>>>>>>>>>> didn't
>>>>>>>>>>>> sleep"
>>>>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
>>>>>>>>>>> where I am
>>>>>>>>>>>>>> really"
>>>>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
>>>>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
>>>>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good
>>>>>>>>>>> eay"
>>>>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
>>>>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
>>>>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
>>>>>>>>>>> more
>>>>>>>>>>>>>> rigorous..."
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I made note of the fact that the 10th and 11th lines had no
>>>>>>>>>>> commas.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> test2 [84]
>>>>>>>>>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That line didn't have any "<" so wasn't matched.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You could remove all none matching lines for pattern of
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> dates<space>times<space>"<"<name>">"<space><anything>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> with:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Do read:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ?read.csv
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ?regex
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> David
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> test2 [85]
>>>>>>>>>>>>>>> [1] "//1,//2,//3,//4"
>>>>>>>>>>>>>>>> test [85]
>>>>>>>>>>>>>>> [1] "2016-07-01 02:50:35 <John Doe> hey"
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Notice how I toggled back and forth between test and test2
>>>>>>>>>>> there. So,
>>>>>>>>>>>>>>> whatever happened with the regex, it happened in the switch
>>>>>>>>>>> from 84
>>>>>>>>>>>> to
>>>>>>>>>>>>>>> 85, I guess. It went on like
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [990] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [991] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [992] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [993] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [994] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [995] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [996] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [997] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [998] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [999] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> [1000] "//1,//2,//3,//4"
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> up until line 1000, then I reached max.print.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Michael
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius <
>>>>>>>>>>>> dwinsem...@comcast.net>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote:
>>>>>>>>>>>>>>>>> Thanks for this tip on etiquette, David. I will be sure and
>>>>>>>>>>> not do
>>>>>>>>>>>>>> that again.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I tried the read.fwf from the foreign package, with a code
>>>>>>>>>>> like
>>>>>>>>>>>> this:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> d <- read.fwf("hangouts-conversation.txt",
>>>>>>>>>>>>>>>>>              widths= c(10,10,20,40),
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> col.names=c("date","time","person","comment"),
>>>>>>>>>>>>>>>>>              strip.white=TRUE)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> But it threw this error:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Error in scan(file = file, what = what, sep = sep, quote =
>>>>>>>>>>> quote,
>>>>>>>>>>>> dec
>>>>>>>>>>>>>> = dec,  :
>>>>>>>>>>>>>>>>> line 6347 did not have 4 elements
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So what does line 6347 look like? (Use `readLines` and print
>>>>>>>>>>> it
>>>>>>>>>>>> out.)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Interestingly, though, the error only happened when I
>>>>>>>>>>> increased the
>>>>>>>>>>>>>>>>> width size. But I had to increase the size, or else I
>>>>>>>>>>> couldn't
>>>>>>>>>>>> "see"
>>>>>>>>>>>>>>>>> anything.  The comment was so small that nothing was being
>>>>>>>>>>>> captured by
>>>>>>>>>>>>>>>>> the size of the column. so to speak.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It seems like what's throwing me is that there's no comma
>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> demarcates the end of the text proper. For example:
>>>>>>>>>>>>>>>> Not sure why you thought there should be a comma. Lines
>>>>>>>>>>> usually end
>>>>>>>>>>>>>>>> with  <cr> and or a <lf>.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Once you have the raw text in a character vector from
>>>>>>>>>>> `readLines`
>>>>>>>>>>>> named,
>>>>>>>>>>>>>>>> say, 'chrvec', then you could selectively substitute commas
>>>>>>>>>>> for
>>>>>>>>>>>> spaces
>>>>>>>>>>>>>>>> with regex. (Now that you no longer desire to remove the dates
>>>>>>>>>>> and
>>>>>>>>>>>>>> times.)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This will not do any replacements when the pattern is not
>>>>>>>>>>> matched.
>>>>>>>>>>>> See
>>>>>>>>>>>>>>>> this test:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
>>>>>>>>>>> "\\1,\\2,\\3,\\4",
>>>>>>>>>>>>>> chrvec)
>>>>>>>>>>>>>>>>> newvec
>>>>>>>>>>>>>>>> [1] "2016-07-01,02:50:35,<john>,hey"
>>>>>>>>>>>>>>>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
>>>>>>>>>>> Edinburgh"
>>>>>>>>>>>>>>>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
>>>>>>>>>>>>>>>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
>>>>>>>>>>> happened, not
>>>>>>>>>>>>>> really"
>>>>>>>>>>>>>>>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
>>>>>>>>>>> didn't
>>>>>>>>>>>>>> sleep"
>>>>>>>>>>>>>>>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
>>>>>>>>>>> where
>>>>>>>>>>>> I am
>>>>>>>>>>>>>>>> really"
>>>>>>>>>>>>>>>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
>>>>>>>>>>>>>>>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
>>>>>>>>>>>>>>>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
>>>>>>>>>>> good
>>>>>>>>>>>> eay"
>>>>>>>>>>>>>>>> [10] "2016-07-01 02:58:56 <jone>"
>>>>>>>>>>>>>>>> [11] "2016-07-01 02:59:34 <jane>"
>>>>>>>>>>>>>>>> [12] "2016-07-01,03:02:48,<john>,British security is a little
>>>>>>>>>>> more
>>>>>>>>>>>>>>>> rigorous..."
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You should probably remove the "empty comment" lines.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> David.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
>>>>>>>>>>>> starbucks2016-07-01
>>>>>>>>>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09
>>>>>>>>>>> <Jane
>>>>>>>>>>>>>>>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe>
>>>>>>>>>>> There was
>>>>>>>>>>>>>>>>> lots of Starbucks in my day2016-07-01 15:35:47
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It was interesting, too, when I pasted the text into the
>>>>>>>>>>> email, it
>>>>>>>>>>>>>>>>> self-formatted into the way I wanted it to look. I had to
>>>>>>>>>>> manually
>>>>>>>>>>>>>>>>> make it look like it does above, since that's the way that it
>>>>>>>>>>>> looks in
>>>>>>>>>>>>>>>>> the txt file. I wonder if it's being organized by XML or
>>>>>>>>>>> something.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Anyways, There's always a space between the two sideways
>>>>>>>>>>> carrots,
>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>> like there is right now: <John Doe> See. Space. And there's
>>>>>>>>>>> always
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> space between the data and time. Like this. 2016-07-01
>>>>>>>>>>> 15:34:30
>>>>>>>>>>>> See.
>>>>>>>>>>>>>>>>> Space. But there's never a space between the end of the
>>>>>>>>>>> comment and
>>>>>>>>>>>>>>>>> the next date. Like this: We were in a starbucks2016-07-01
>>>>>>>>>>> 15:35:02
>>>>>>>>>>>>>>>>> See. starbucks and 2016 are smooshed together.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> This code is also on the table right now too.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> a <- read.table("E:/working
>>>>>>>>>>>>>>>>> directory/-189/hangouts-conversation2.txt", quote="\"",
>>>>>>>>>>>>>>>>> comment.char="", fill=TRUE)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> aa<-gsub("[^[:digit:]]","",h)
>>>>>>>>>>>>>>>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Those last lines are a work in progress. I wish I could
>>>>>>>>>>> import a
>>>>>>>>>>>>>>>>> picture of what it looks like when it's translated into a
>>>>>>>>>>> data
>>>>>>>>>>>> frame.
>>>>>>>>>>>>>>>>> The fill=TRUE helped to get the data in table that kind of
>>>>>>>>>>> sort of
>>>>>>>>>>>>>>>>> works, but the comments keep bleeding into the data and time
>>>>>>>>>>>> column.
>>>>>>>>>>>>>>>>> It's like
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
>>>>>>>>>>>>>>>>> over               there
>>>>>>>>>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> And then, maybe, the "seriously" will be in a column all to
>>>>>>>>>>>> itself, as
>>>>>>>>>>>>>>>>> will be the "I've'"and the "never" etc.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I will use a regular expression if I have to, but it would be
>>>>>>>>>>> nice
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> keep the dates and times on there. Originally, I thought they
>>>>>>>>>>> were
>>>>>>>>>>>>>>>>> meaningless, but I've since changed my mind on that count.
>>>>>>>>>>> The
>>>>>>>>>>>> time of
>>>>>>>>>>>>>>>>> day isn't so important. But, especially since, say, Gmail
>>>>>>>>>>> itself
>>>>>>>>>>>> knows
>>>>>>>>>>>>>>>>> how to quickly recognize what it is, I know it can be done. I
>>>>>>>>>>> know
>>>>>>>>>>>>>>>>> this data has structure to it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Michael
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
>>>>>>>>>>>>>> dwinsem...@comcast.net> wrote:
>>>>>>>>>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
>>>>>>>>>>>>>>>>>>> I have a wild and crazy text file, the head of which looks
>>>>>>>>>>> like
>>>>>>>>>>>> this:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:50:35 <john> hey
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:51:45 <john> thinking about my boo
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not
>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast,
>>>>>>>>>>> didn't
>>>>>>>>>>>> sleep
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where
>>>>>>>>>>> I am
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:54:17 <john> just know it's london
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good
>>>>>>>>>>> eay
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:58:56 <jone>
>>>>>>>>>>>>>>>>>>> 2016-07-01 02:59:34 <jane>
>>>>>>>>>>>>>>>>>>> 2016-07-01 03:02:48 <john> British security is a little
>>>>>>>>>>> more
>>>>>>>>>>>>>> rigorous...
>>>>>>>>>>>>>>>>>> Looks entirely not-"crazy". Typical log file format.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2)
>>>>>>>>>>> Use
>>>>>>>>>>>> regex
>>>>>>>>>>>>>>>>>> (i.e. the sub-function) to strip everything up to the "<".
>>>>>>>>>>> Read
>>>>>>>>>>>>>>>>>> `?regex`. Since that's not a metacharacters you could use a
>>>>>>>>>>>> pattern
>>>>>>>>>>>>>>>>>> ".+<" and replace with "".
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> And do read the Posting Guide. Cross-posting to
>>>>>>>>>>> StackOverflow and
>>>>>>>>>>>>>> Rhelp,
>>>>>>>>>>>>>>>>>> at least within hours of each, is considered poor manners.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> David.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It goes on for a while. It's a big file. But I feel like
>>>>>>>>>>> it's
>>>>>>>>>>>> going
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> be difficult to annotate with the coreNLP library or
>>>>>>>>>>> package. I'm
>>>>>>>>>>>>>>>>>>> doing natural language processing. In other words, I'm
>>>>>>>>>>> curious
>>>>>>>>>>>> as to
>>>>>>>>>>>>>>>>>>> how I would shave off the dates, that is, to make it look
>>>>>>>>>>> like:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> <john> hey
>>>>>>>>>>>>>>>>>>> <jane> waiting for plane to Edinburgh
>>>>>>>>>>>>>>>>>>> <john> thinking about my boo
>>>>>>>>>>>>>>>>>>> <jane> nothing crappy has happened, not really
>>>>>>>>>>>>>>>>>>> <john> plane went by pretty fast, didn't sleep
>>>>>>>>>>>>>>>>>>> <jane> no idea what time it is or where I am really
>>>>>>>>>>>>>>>>>>> <john> just know it's london
>>>>>>>>>>>>>>>>>>> <jane> you are probably asleep
>>>>>>>>>>>>>>>>>>> <jane> I hope fish was fishy in a good eay
>>>>>>>>>>>>>>>>>>> <jone>
>>>>>>>>>>>>>>>>>>> <jane>
>>>>>>>>>>>>>>>>>>> <john> British security is a little more rigorous...
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> To be clear, then, I'm trying to clean a large text file by
>>>>>>>>>>>> writing a
>>>>>>>>>>>>>>>>>>> regular expression? such that I create a new object with no
>>>>>>>>>>>> numbers
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>> dates.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Michael
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and
>>>>>>>>>>> more,
>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>>>>>>>>> and provide commented, minimal, self-contained,
>>>>>>>>>>> reproducible
>>>>>>>>>>>> code.
>>>>>>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
>>>>>>>>>>> see
>>>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>>>>>>>> code.
>>>>>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
>>>>>>>>>>> see
>>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>>>>>>>> code.
>>>>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more,
>>>>>>>>>>> see
>>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>>>>>>>> code.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>>>>>>>> code.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>    [[alternative HTML version deleted]]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>>>> 
>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  [[alternative HTML version deleted]]
>>>>>>>>>>> 
>>>>>>>>>>> ______________________________________________
>>>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Sent from my phone. Please excuse my brevity.
>>>>>>>>> 
>>>>>>>>> ______________________________________________
>>>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>> PLEASE do read the posting guide 
>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>> 
>>>>>>> 
>>>>>>> ______________________________________________
>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide 
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>> 
>>>>> 
>>>>> ______________________________________________
>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide 
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>> 
>>>> 
>>> 
>>> ______________________________________________
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to separate string from numbers in a large txt file

Reply via email to