Hi Dan: Thank you. Was not aware of details of tidyverse package. What a
difference. I was able to use your suggestion - needed to a minor tweaking
to insert a regular expression instead of string_extract_all; but,
unnest_long just did it right.
I appreciate. Best, Sid
Siddhartha Dalal, Professor of Practice
Columbia University

On Wed, Apr 8, 2020 at 10:28 PM Dan Suthers <[email protected]> wrote:

> Use tidyverse packages for data manipulation. They are excellent at this
> sort of thing.
>
> I had a similar problem. I used readr::read_delim to read a .csv from
> Twint's representation of twitter data into tibble 'tweets'. Each tweet
> mentions several users in the tweets$mentions field, in the same format as
> yours but as a string, for example "['repadamschiff', 'realdonaldtrump']"
>
> I used stringr::str_extract_all to turn this string into a list, and then
> tidyr::unnest_longer to turn the single row into one row per each value of
> this list:
>
> mention_edges <-
>   tweets %>%
>   # Extract lists of mentioned users from the string representation.
>   mutate(mentioned_user = str_extract_all(tweets$mentions,
>                                           boundary("word"))) %>%
>   # Unnest each mention into its own row
>   unnest_longer(mentioned_user) %>%
>   # drop tweets that don't mention anyone
>   drop_na(mentioned_user) %>%
>   ... continues with other processing
>
> It is done in memory, but I have been able to run this on a fairly large
> data set.
>
> -- Dan
> On 4/8/20 2:41 AM, Siddhartha R Dalal wrote:
>
> I have many  large dataframes of the following structure with 1 input node
> in each row and multiple output nodes and edge weights.
>   input_node            output_nodes             edge-weights
> id-attr      attribute
> 1    11347-5 ['64837-1', '116228-0']  [0.01001617, 0.01778383] 82249852
>  372856
> 2   116228-0             ['14328-3']
> [0.3505]                     82283186    372892
> 3    39644-0            ['116228-0']
> [0.10184362]                 82273700    372878
> 4   116228-0            ['116228-0']
> [0.21326264]                82278451    372887
> 5   116228-0 ['64827-1', '116228-0']  [0.02947139, 0.08275262] 82249816
>  372855
> >
>
> For example, rows 1 and 5 have 1 input node, 2 output nodes,  the
> corresponding 2 edge weights (they are numbers), and few attributes; rows 2
> through 4 have 1 input, and 1 output, etc .
> How do I read this dataframe in igraph to make a graph while retaining
> attributes. Typically igraph asks for the dataframe to have the first 2
> columns to be individual and output nodes. This is a large dataframe where,
> the # of output nodes could be large in some rows.
> I can imagine doing this by a "for" loop and regex. But, that would be too
> slow and the new dataframe would require more memory. Would appreciate any
> suggestions.
> Thank you. Sid
>
> _______________________________________________
> igraph-help mailing 
> [email protected]https://lists.nongnu.org/mailman/listinfo/igraph-help
>
> --
> Dan Suthers
>
> Professor and Graduate Program Chair
> Dept. of Information and Computer Sciences
> University of Hawaii at Manoa
> 1680 East West Road, POST 309, Honolulu, HI 96822
> (808) 956-3890 office
> Personal: http://www2.hawaii.edu/~suthers/
> Lab: http://lilt.ics.hawaii.edu/
> Department: http://www.ics.hawaii.edu/
>
>
_______________________________________________
igraph-help mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/igraph-help

Reply via email to