Hi Dan: Thank you. Was not aware of details of tidyverse package. What a difference. I was able to use your suggestion - needed to a minor tweaking to insert a regular expression instead of string_extract_all; but, unnest_long just did it right. I appreciate. Best, Sid Siddhartha Dalal, Professor of Practice Columbia University
On Wed, Apr 8, 2020 at 10:28 PM Dan Suthers <[email protected]> wrote: > Use tidyverse packages for data manipulation. They are excellent at this > sort of thing. > > I had a similar problem. I used readr::read_delim to read a .csv from > Twint's representation of twitter data into tibble 'tweets'. Each tweet > mentions several users in the tweets$mentions field, in the same format as > yours but as a string, for example "['repadamschiff', 'realdonaldtrump']" > > I used stringr::str_extract_all to turn this string into a list, and then > tidyr::unnest_longer to turn the single row into one row per each value of > this list: > > mention_edges <- > tweets %>% > # Extract lists of mentioned users from the string representation. > mutate(mentioned_user = str_extract_all(tweets$mentions, > boundary("word"))) %>% > # Unnest each mention into its own row > unnest_longer(mentioned_user) %>% > # drop tweets that don't mention anyone > drop_na(mentioned_user) %>% > ... continues with other processing > > It is done in memory, but I have been able to run this on a fairly large > data set. > > -- Dan > On 4/8/20 2:41 AM, Siddhartha R Dalal wrote: > > I have many large dataframes of the following structure with 1 input node > in each row and multiple output nodes and edge weights. > input_node output_nodes edge-weights > id-attr attribute > 1 11347-5 ['64837-1', '116228-0'] [0.01001617, 0.01778383] 82249852 > 372856 > 2 116228-0 ['14328-3'] > [0.3505] 82283186 372892 > 3 39644-0 ['116228-0'] > [0.10184362] 82273700 372878 > 4 116228-0 ['116228-0'] > [0.21326264] 82278451 372887 > 5 116228-0 ['64827-1', '116228-0'] [0.02947139, 0.08275262] 82249816 > 372855 > > > > For example, rows 1 and 5 have 1 input node, 2 output nodes, the > corresponding 2 edge weights (they are numbers), and few attributes; rows 2 > through 4 have 1 input, and 1 output, etc . > How do I read this dataframe in igraph to make a graph while retaining > attributes. Typically igraph asks for the dataframe to have the first 2 > columns to be individual and output nodes. This is a large dataframe where, > the # of output nodes could be large in some rows. > I can imagine doing this by a "for" loop and regex. But, that would be too > slow and the new dataframe would require more memory. Would appreciate any > suggestions. > Thank you. Sid > > _______________________________________________ > igraph-help mailing > [email protected]https://lists.nongnu.org/mailman/listinfo/igraph-help > > -- > Dan Suthers > > Professor and Graduate Program Chair > Dept. of Information and Computer Sciences > University of Hawaii at Manoa > 1680 East West Road, POST 309, Honolulu, HI 96822 > (808) 956-3890 office > Personal: http://www2.hawaii.edu/~suthers/ > Lab: http://lilt.ics.hawaii.edu/ > Department: http://www.ics.hawaii.edu/ > >
_______________________________________________ igraph-help mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/igraph-help
