[R] string splitting and testing for enrichment

Iain Gallagher Sat, 20 Jun 2009 07:31:34 -0700

Hi List

I have data in the following form:


Gene    TFBS
NUDC     PPARA(1) HNF4(20) HNF4(96) AHRARNT(104) CACBINDINGPROTEIN(149) 
T3R(167) HLF(191) 
RPA2     STAT4(57) HEB(251) 
TAF12     PAX3(53) YY1(92) BRCA(99) GLI(101) 
EIF3I     NERF(10) P300(10) 
TRAPPC3     HIC1(3) PAX5(17) PAX5(110) NRF1(119) HIC1(122) 
TRAPPC3     EGR(26) ZNF219(27) SP3(32) EGR(32) NFKAPPAB65(89) NFKAPPAB(89) 
RFX(121) ZTA(168) 
NDUFS5     WHN(14) ATF(57) EGR3(59) PAX5(99) SF1(108) NRSE(146) 
TIE1     NRSE(129) 

I would like to test the 2nd column (each value has letters followed by numbers 
in brackets) here for enrichment via fisher.test.

To that end I am trying to create two factors made up of column 1 (Gene) and 
column 2 (TFBS) where each Gene would have several entries matching each TFBS.

My main problem just now is that I can't split the TFBS column into separate 
strings (at the moment that 2nd column is all one string for each Gene).

Here's where I am just now:

test<-as.character(dataIn[,2]) # convert the 2nd column from factor to character
test2<-unlist(strsplit(test[1], ' ')) # split the first element into individual 
strings (only the first element just now because I'm joust trying to get things 
working)
test3<-unlist(strsplit(test2, '\\([0-9]\\)')) # get rid of numbers and brackets

now this does not behave as I hoped - it gives me:

> test3
[1] "PPARA"                  "HNF4(20)"               "HNF4(96)"              
[4] "AHRARNT(104)"           "CACBINDINGPROTEIN(149)" "T3R(167)"              
[7] "HLF(191)"  

ie it only removes the numbers and brackets from the first entry and not the 
others.

Could someone point out my mistake please?

Once I have all the TFBS (letters only) for each Gene I would then count how 
often a TFBS occurs and use this data for a fisher.test testing for enrichment 
of TFBS in the list I have. I'm a rather muddled here though and would 
appreciate advice on whether this is the right approach.

Thanks

Iain

> sessionInfo()
R version 2.9.0 (2009-04-17) 
x86_64-pc-linux-gnu 

locale:
LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     






        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] string splitting and testing for enrichment

Reply via email to