Re: [R] Writing a summary file in R

2011-08-03 Thread a217
gt;=1)/nrow(x) )) ) )
> # You often need the t() function when working with apply functions
>  > df
>  chr strt end frac
> chr1 100   chr1  100 1591
> chr1 200   chr1  200 2601
> chr1 500   chr1  500 7501
> chr11 679 chr11  679 687  0.5
> chr22 100 chr22  100 2001
> chr22 300 chr22  300 4000
> chr3 450   chr3  450 7001
> chr4 100   chr4  100 300  0.5
> chr7 350   chr7  350 600  0.5
> chr9 100   chr9  100 1251
> 
>  > as.data.frame(t(sapply(splinp, function(x) summary(x 
> $methylation )) ) )
>Min. 1st Qu. Median  Mean 3rd Qu. Max.
> chr1 100  0.04  0.0425  0.045 0.045  0.0475 0.05
> chr1 200  0.12  0.1200  0.120 0.120  0.1200 0.12
> chr1 500  0.09  0.0900  0.090 0.090  0.0900 0.09
> chr11 679 0.00  0.0175  0.035 0.035  0.0525 0.07
> chr22 100 0.03  0.0425  0.055 0.055  0.0675 0.08
> chr22 300 0.00  0.  0.000 0.000  0. 0.00
> chr3 450  0.03  0.0300  0.030 0.030  0.0300 0.03
> chr4 100  0.00  0.0125  0.025 0.025  0.0375 0.05
> chr7 350  0.00  0.0150  0.030 0.030  0.0450 0.06
> chr9 100  0.10  0.1000  0.100 0.100  0.1000 0.10
> 
> # The coup de grace: bind the columns together
> 
>  > df.summ <- as.data.frame(t(sapply(splinp, function(x) summary(x 
> $methylation )) ) )
>  > cbind(df, df.summ)
>  chr strt end frac Min. 1st Qu. Median  Mean 3rd Qu. Max.
> chr1 100   chr1  100 1591 0.04  0.0425  0.045 0.045  0.0475 0.05
> chr1 200   chr1  200 2601 0.12  0.1200  0.120 0.120  0.1200 0.12
> chr1 500   chr1  500 7501 0.09  0.0900  0.090 0.090  0.0900 0.09
> chr11 679 chr11  679 687  0.5 0.00  0.0175  0.035 0.035  0.0525 0.07
> chr22 100 chr22  100 2001 0.03  0.0425  0.055 0.055  0.0675 0.08
> chr22 300 chr22  300 4000 0.00  0.  0.000 0.000  0. 0.00
> chr3 450   chr3  450 7001 0.03  0.0300  0.030 0.030  0.0300 0.03
> chr4 100   chr4  100 300  0.5 0.00  0.0125  0.025 0.025  0.0375 0.05
> chr7 350   chr7  350 600  0.5 0.00  0.0150  0.030 0.030  0.0450 0.06
> chr9 100   chr9  100 1251 0.10  0.1000  0.100 0.100  0.1000 0.10
> 
> -- 
> Best;
> 
> David.
> 
>>
>> On Wed, Jul 27, 2011 at 4:02 PM, a217 <aj...@case.edu> wrote:
>>> Hello,
>>>
>>> I have an input file:
>>> http://r.789695.n4.nabble.com/file/n3700031/testOut.txt testOut.txt
>>>
>>> where col 1 is chromosome, column2 is start of region, column 3 is  
>>> end of
>>> region, column 4 and 5 is base position, column 6 is total reads,  
>>> column 7
>>> is methylation data, and column 8 is the strand.
>>>
>>>
>>> I would like a summary output file such as:
>>> http://r.789695.n4.nabble.com/file/n3700031/out.summary.txt  
>>> out.summary.txt
>>>
>>> where column 1 is chromosome, column 2 is start of region, column 3  
>>> is end
>>> of region, column 4 is total reads in general, column 5 is total  
>>> reads >=1,
>>> column 6 is (col4/col5) or the percentage, and at the end I'd like  
>>> to list 6
>>> more columns based on summary results from summary() function in R.
>>>
>>> The summary() function will be used to analyze all of the  
>>> methylation data
>>> (col7 from input) for each region (bounded by col2 and col3).
>>>
>>> For example for chr1 100 159 summary() gives:
>>>  Min. 1st Qu.  MedianMean 3rd Qu.Max.
>>>  0.0400  0.0425  0.0450  0.0450  0.0475  0.0500
>>>
>>> which is simply the methylation data input into summary() only in  
>>> the region
>>> of chr1 100 159.
>>>
>>> I know how to perform all of the required functions line-by-line,  
>>> but the
>>> hard part for me is essentially taking the input data with multiple
>>> positions in each region and assigning all of the summary results  
>>> to one
>>> line identified by the region.
>>>
>>> If any of you have any suggestions I would appreciate it.
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700031.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius, MD
> West Hartford, CT
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


--
View this message in context: 
http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3716837.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Counting rows given conditional

2011-08-04 Thread a217
Hello,

I have an input file that contains multiple columns, but the column I'm
concerned about looks like:

"TR"
5
0
4
1
0
2
0

To count all of the rows in the column I know how to do NROW(x$TR) which
gives 7.

However, I would also like to count only the number of rows with values >=1
(i.e. not 0). I've tried NROW(x$TR>=1) which did not give the intended
output.

Do any of you have any suggestions as to where I'm going wrong?



--
View this message in context: 
http://r.789695.n4.nabble.com/Counting-rows-given-conditional-tp3718541p3718541.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Loops for repetitive task

2011-08-09 Thread a217
Hello,

I have an R script that I use as a template to perform a task for multiple
files (in this case, multiple chromosomes).

What I would like to do is to utilize a simple loop to parse through each
chromosome number so that I don't have to type the same code over and over
again in the R console.

I've tried using:

for(i in 1:22){
etc..
}

and replacing each chromosome number with [[i]], but that did not seem to
work.

Below is the script I have. Basically everywhere you see a '2' I would like
there to be an 'i' so that the script can be applied in a general sense.
Code###

chr2.data<-read.table(file="chr2.out.txt", header=F)
colnames(chr2.data)<-c("chr","start","end","base1","base2","totalreads","methylation","strand")
splc2<-split(chr2.data, paste(chr2.data$chr))
chr2.df<-as.data.frame(t(sapply(splc2, function(x)
list(TR=NROW(x[['totalreads']]),RG1=sum(x[['totalreads']]>=1),
percent=(NROW(x[['totalreads']]>=1)/sum(x[['totalreads']]))
chr2.df.summ<-as.data.frame(t(sapply(splc2, function(x)
summary(x$methylation
chr2.summ<-cbind(chr2.df,chr2.df.summ)

##


Here are some sample input files in case you'd like to test the code:
##
# chr1.out.txt
##
chr1100 159 104 104 1   0.05+
chr1100 159 145 145 1   0.04+
chr1200 260 205 205 1   0.12+
chr1500 750 600 600 1   0.09+

##
# chr2.out.txt
##
chr2100 200 105 105 1   0.03+
chr2100 200 110 110 1   0.08+
chr2300 400 350 350 0   0   +


The code works perfectly fine just typing everything out by hand, but that
is very inefficient given that there are 24 chromosomes for each dataset. I
am just looking for any suggestions as to how I can write a general version
of this code.


--
View this message in context: 
http://r.789695.n4.nabble.com/Loops-for-repetitive-task-tp3732022p3732022.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Generating a histogram with R

2011-07-12 Thread a217
Hello,

I have a sample file:

chr22   100 150 125 21  0.145   +
chr22   200 300 212 13  0.05+
chr22   345 365 351 12  0.09+
chr22   500 750 510 15  0.10+
chr22   500 750 642 9   0.02+
chr22   800 900 850 10  0.05+


where I need to generate a histogram from the data in column 6 (i.e. 0.145,
0.05, etc.). To make it easier to read, I would plot the data as 1-0.05=0.95
for all of the data in column 6.

What I would like to know is how to generate a histogram with the data from
one file? Also, would I be able to generate one histogram from multiple
files as well (with the same format)?

For example, I have multiple files in the same format as the sample file
above, and I would like to make one histogram for all column six data in all
files.

Thank you,
a217

--
View this message in context: 
http://r.789695.n4.nabble.com/Generating-a-histogram-with-R-tp3663350p3663350.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Writing a summary file in R

2011-07-27 Thread a217
Hello,

I have an input file: 
http://r.789695.n4.nabble.com/file/n3700031/testOut.txt testOut.txt 

where col 1 is chromosome, column2 is start of region, column 3 is end of
region, column 4 and 5 is base position, column 6 is total reads, column 7
is methylation data, and column 8 is the strand.


I would like a summary output file such as: 
http://r.789695.n4.nabble.com/file/n3700031/out.summary.txt out.summary.txt 

where column 1 is chromosome, column 2 is start of region, column 3 is end
of region, column 4 is total reads in general, column 5 is total reads >=1,
column 6 is (col4/col5) or the percentage, and at the end I'd like to list 6
more columns based on summary results from summary() function in R.

The summary() function will be used to analyze all of the methylation data
(col7 from input) for each region (bounded by col2 and col3).

For example for chr1 100 159 summary() gives:
 Min. 1st Qu.  MedianMean 3rd Qu.Max. 
 0.0400  0.0425  0.0450  0.0450  0.0475  0.0500

which is simply the methylation data input into summary() only in the region
of chr1 100 159.

I know how to perform all of the required functions line-by-line, but the
hard part for me is essentially taking the input data with multiple
positions in each region and assigning all of the summary results to one
line identified by the region. 

If any of you have any suggestions I would appreciate it.

--
View this message in context: 
http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700031.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Writing a summary file in R

2011-07-27 Thread a217
Yes, that is the general objective. I'll look-into aggregates in R and see if
anything helps.

Thanks,
a217

--
View this message in context: 
http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700071.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Writing a summary file in R

2011-07-27 Thread a217
Thank you both very much! The codes are pretty slick and should greatly help
me in my task.

--
View this message in context: 
http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700382.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Histogram for each ID value

2011-10-17 Thread a217
I have a dataframe in the general format:

chr1 0.5
chr1 0
chr1 0.75
chr2 0
chr2 0
chr3 1
chr3 1
chr3 0.5
chr7 0.75
chr9 1
chr9 1
chr22 0.5
chr22 0.5

where the first column is the chromosome location and the second column is
some value. What I'd like to do is have a histogram created for each chr
location (i.e. a separate histogram for chr1, chr2, chr3, chr7, chr9, and
chr22). I am just having a hard time getting everything to work out and am
hoping for some suggestions.

--
View this message in context: 
http://r.789695.n4.nabble.com/Histogram-for-each-ID-value-tp3911791p3911791.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] replacing percentage of values in data frame

2011-10-19 Thread a217
I've been looking for how to change a certain percentage of values in a data
frame, but I've been struggling to find information in R.

For example:

#example data##
> data
  V1V2V3  V4 V5  V6 V7
1   chr1   500   500 CHH  0 0.5  +
2   chr1   550   550 CHH  0 0.0  +
3   chr2   700   700 CHH  0 0.0  +
4   chr2  1000  1000 CHH  0 0.0  +
5   chr3   100   100 CHH  0 0.0  +
6   chr4   450   450  CG  0 0.0  +
7   chr5   450   450 CHH  0 0.0  +
8   chr5 50034 50034 CHG  0 0.0  +
9   chr7 50055 50055 CHG  0 0.0  +
10 chr10 50063 50063 CHH  0 0.0  +

> dput(data)
structure(list(V1 = structure(c(1L, 1L, 3L, 3L, 4L, 5L, 6L, 6L, 
7L, 2L), .Label = c("chr1", "chr10", "chr2", "chr3", "chr4", 
"chr5", "chr7"), class = "factor"), V2 = c(500L, 550L, 700L, 
1000L, 100L, 450L, 450L, 50034L, 50055L, 50063L), V3 = c(500L, 
550L, 700L, 1000L, 100L, 450L, 450L, 50034L, 50055L, 50063L), 
V4 = structure(c(3L, 3L, 3L, 3L, 3L, 1L, 3L, 2L, 2L, 3L), .Label =
c("CG", 
"CHG", "CHH"), class = "factor"), V5 = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L), V6 = c(0.5, 0, 0, 0, 0, 0, 0, 0, 
0, 0), V7 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = "+", class = "factor")), .Names = c("V1", "V2", 
"V3", "V4", "V5", "V6", "V7"), class = "data.frame", row.names = c(NA, 
-10L))
> 


Say for instance, I'd like to change 20% of values in column 6 to '1'
instead of zero or whatever value may be currently present. How would I
approach this?

I am working with a large data frame and I need to replace values in one of
the columns for 10-20% of the entire dataset. I hope what I am trying to
convey is understandable to you.

--
View this message in context: 
http://r.789695.n4.nabble.com/replacing-percentage-of-values-in-data-frame-tp3920484p3920484.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Help with increasing the speed of script

2011-10-27 Thread a217
I actually have two questions regarding the same script:

#
data <- vector('list', 24)
splc <- vector('list',24)
df.summ <- vector('list',24)

for (i in 1:length(chrData)) 
{
  data[[i]] <- read.table(file=paste('chr',i,'.nonCG.covered.out',sep=''),
header=F)
  colnames(chrData[[i]])<-c("chr","start","end","tot","methy")

  splc[[i]]<-split(data[[i]], paste(data[[i]]$chr, data[[i]]$start,
data[[i]]$end))

  df.summ[[i]]<-as.data.frame(t(sapply(splc[[i]], function(x)
summary(1-x$methylation

  cat("finished reading",paste('chr',i),date(),"\n")
}
#

1) I'll start off first with perhaps the easier question. The above script
is a portion of r code that I run from batch through a pipeline. I've
decided that any process that takes longer than 5 minutes to complete should
include some measure of progress so that the user doesn't just sit there and
wonder if the code is working or not.

So I've tried the command: 

R CMD BATCH -q test.R /dev/tty

which gives the same output as .Rout file would. What I am hoping to
accomplish is to only print the "cat()" results from batch instead of the
entire script.

2) The second issue I have is with the speed of the script. Specifically,
the slow point of the script is:

splc[[i]]<-split(data[[i]], paste(data[[i]]$chr, data[[i]]$start,
data[[i]]$end))

It may operate in quadratic or exponential time (I haven't specifically run
tests) because as the input data increases in size, the time the script
takes longer than a constant-time script would run.

Perhaps the more experienced programmers among you could give me a few
hints/suggestions so that I can head in the right direction.

--
View this message in context: 
http://r.789695.n4.nabble.com/Help-with-increasing-the-speed-of-script-tp3946731p3946731.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.