Re: [R] Fwd: Re: transpose and split dataframe

Matthew Wed, 01 May 2019 16:05:24 -0700

Thank you very much, David and Jim for your work and solutions.

I have been working through both of them to better learn R. They both 
proceed through a similar logic except David's starts with a character 
matrix and Jim's with a dataframe, and both end with equivalent 
dataframes (  identical(tmmdf, TF2list2)) returns TRUE  ). They have 
both been very helpful. However, there is one attribute of my intended 
final dataframe that is missing.


Looking at part of the final dataframe:

  head(tmmdf)
   AT1G69490 AT1G29860 AT1G29860.1 AT4G18170 AT4G18170.1 AT5G46350
1 *AT4G31950* *AT4G31950*   AT5G64905 *AT4G31950* AT5G64905 *AT4G31950*
2 AT5G24110 AT5G24110   AT1G21120 AT5G24110   AT1G14540 AT5G24110
3 AT1G26380 AT1G05675   AT1G07160 AT1G05675   AT1G21120 AT1G05675

Row 1 has *AT4G31950* in columns 1,2,4 and 6, but AT4G31950 in columns 3 
and 5. What I was aiming at would be that each row would have a unique 
entry so that AT4G31950 is row 1 columns 1,2,4 and 6, and NA is row 1 
columns 3 and 5. AT4G31950 is row 2 columns 3 and 5 and NA is row 2 
columns 1,2,4 and 6. So, it would look like this:

  head(intended_df)
   AT1G69490 AT1G29860 AT1G29860.1 AT4G18170 AT4G18170.1 AT5G46350
1 AT4G31950 AT4G31950     NA                AT4G31950       NA         
AT4G31950

2      NA                NA           AT4G31950       NA            
AT4G31950      NA

I have been trying to adjust the code to get my intended result 
basically by trying to build a dataframe one column at a time from each 
entry in the character matrix, but have not got anything near working yet.

Matthew

On 4/30/2019 6:29 PM, David L Carlson wrote
> If you read the data frame with read.csv() or one of the other read() 
> functions, use the asis=TRUE argument to prevent conversion to factors. If 
> not do the conversion first:
>
> # Convert factors to characters
> DataMatrix <- sapply(TF2list, as.character)
> # Split the vector of hits
> DataList <- sapply(DataMatrix[, 2], strsplit, split=",")
> # Use the values in Regulator to name the parts of the list
> names(DataList) <- DataMatrix[,"Regulator"]
>
> # Now create a data frame
> # How long is the longest list of hits?
> mx <- max(sapply(DataList, length))
> # Now add NAs to vectors shorter than mx
> DataList2 <- lapply(DataList, function(x) c(x, rep(NA, mx-length(x))))
> # Finally convert back to a data frame
> TF2list2 <- do.call(data.frame, DataList2)
>
> Try this on a portion of the list, say 25 lines and print each object to see 
> what is happening.
>
> ----------------------------------------
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77843-4352
>
>
>
>
>
> -----Original Message-----
> From: R-help <r-help-boun...@r-project.org> On Behalf Of Matthew
> Sent: Tuesday, April 30, 2019 4:31 PM
> To: r-help@r-project.org
> Subject: [R] Fwd: Re: transpose and split dataframe
>
> Thanks for your reply. I was trying to simplify it a little, but must
> have got it wrong. Here is the real dataframe, TF2list:
>
>    str(TF2list)
> 'data.frame':    152 obs. of  2 variables:
>    $ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54
> 54 82 82 82 82 82 ...
>    $ hits     : Factor w/ 97 levels
> "AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
> __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
>
>      And the first few lines resulting from dput(head(TF2list)):
>
> dput(head(TF2list))
> structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
> 82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380",
> "AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300",
> "AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600",
> "AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ...
>
> This is another way of looking at the first 4 entries (Regulator is
> tab-separated from hits):
>
> Regulator
>     hits
> 1
> AT1G69490
>    
> AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
> 2
> AT1G29860
>    
> AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
>
> 3
> AT1G2986
>    
> AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830
>
>      So, the goal would be to
>
> first: Transpose the existing dataframe so that the factor Regulator
> becomes a column name (column 1 name = AT1G69490, column2 name
> AT1G29860, etc.) and the hits associated with each Regulator become
> rows. Hits is a comma separated 'list' ( I do not not know if
> technically it is an R list.), so it would have to be comma
> 'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950,
> col 1 row 2 - AT5G24410, etc); like this :
>
> AT1G69490
> AT4G31950
> AT5G24110
> AT1G05675
> AT5G64905
>
> ... I did not include all the rows)
>
> I think it would be best to actually make the first entry a separate
> dataframe ( 1 column with name = AT1G69490 and number of rows depending
> on the number of hits), then make the second column (column name =
> AT1G29860, and number of rows depending on the number of hits) into a
> new dataframe and do a full join of of the two dataframes; continue by
> making the third column (column name = AT1G2986) into a dataframe and
> full join it with the previous; continue for the 152 observations so
> that then end result is a dataframe with 152 columns and number of rows
> depending on the entry with the greatest number of hits. The full joins
> I can do with dplyr, but getting up to that point seems rather difficult.
>
> This would get me what my ultimate goal would be; each Regulator is a
> column name (152 columns) and a given row has either NA or the same hit.
>
>      This seems very difficult to me, but I appreciate any attempt.
>
> Matthew
>
> On 4/30/2019 4:34 PM, David L Carlson wrote:
>>           External Email - Use Caution
>>
>> I think we need more information. Can you give us the structure of the data 
>> with str(YourDataFrame). Alternatively you could copy a small piece into 
>> your email message by copying and pasting the results of the following code:
>>
>> dput(head(YourDataFrame))
>>
>> The data frame you present could not be a data frame since you say "hits" is 
>> a factor with a variable number of elements. If each value of "hits" was a 
>> single character string, it would only have 2 factor levels not 6 and your 
>> efforts to parse the string would make more sense. Transposing to a data 
>> frame would only be possible if each column was padded with NAs to make them 
>> equal in length. Since your example tries use the name TF2list, it is 
>> possible that you do not have a data frame but a list and you have no factor 
>> levels, just character vectors.
>>
>> If you are not familiar with R, it may be helpful to tell us what your 
>> overall goal is rather than an intermediate step. Very likely R can easily 
>> handle what you want by doing things a different way.
>>
>> ----------------------------------------
>> David L Carlson
>> Department of Anthropology
>> Texas A&M University
>> College Station, TX 77843-4352
>>
>>
>>
>> -----Original Message-----
>> From: R-help<r-help-boun...@r-project.org>  On Behalf Of Matthew
>> Sent: Tuesday, April 30, 2019 2:25 PM
>> To: r-help (r-help@r-project.org)<r-help@r-project.org>
>> Subject: [R] transpose and split dataframe
>>
>> I have a data frame that is a lot bigger but for simplicity sake we can
>> say it looks like this:
>>
>> Regulator    hits
>> AT1G69490    AT4G31950,AT5G24110,AT1G26380,AT1G05675
>> AT2G55980    AT2G85403,AT4G89223
>>
>>       In other words:
>>
>> data.frame : 2 obs. of 2 variables
>> $Regulator: Factor w/ 2 levels
>> $hits         : Factor w/ 6 levels
>>
>>      I want to transpose it so that Regulator is now the column headings
>> and each of the AGI numbers now separated by commas is a row. So,
>> AT1G69490 is now the header of the first column and AT4G31950 is row 1
>> of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is header of
>> column 2 and AT2G85403 is row 1 of column 2, etc.
>>
>>      I have tried playing around with strsplit(TF2list[2:2]) and
>> strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
>>
>> Matthew
>>
>> ______________________________________________
>> R-help@r-project.org  mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>       [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Fwd: Re: transpose and split dataframe

Reply via email to