Waverley wrote:
Thanks for the advice. My question is more on how to do this?
Let me use a biology gene analysis example to illustrate:
In biology, there are always some house keeping genes which differ
little even at pathological conditions.
We know that at different batches, there are external factors affect
the measurements. For example, overall signal intensity might be
different due to lab reagents.
A simplified picture:
Day 1: Using control samples, I have measured #1 to #110 genes and get data.
Day 2: Using disease samples, I have measured again #1 to #110 genes
and get data.
For those two data sets, I noticed the overall signal intensity in Day
1, for each gene, is more than Day 2.
I know, from biological literature, gene 101 to 110, are "house
keeping" genes, should not change much between disease and control.
My questions arise, technically, how do I use gene 101 to 110 values
to adjust the signals of gene 1 to 100 such that the batch effect can
be corrected. The differences revealing from the comparative analysis
of 1 ~ 100 genes between disease and control are due to biology rather
than lab artifacts.
So the question is how to do that mathematically? If I have only one
house keeping gene, then I can divide every gene to that to normalize,
then compare. But now I have 10 genes which can be utilized for
normalization. I assume, the more reference genes to be used, the
better, under this context.
Can you help again?
Thanks much in advance.
That is an inappropriate experimental design that has caused major
problems in the biomedical research literature (look up the famous
Petricoin fiasco - google for petricoin baggerly; Baggerly discovered
the error). You have day and disease completely confounded and no model
can correct for that (day and disease are completely collinear). Once
you randomize the order of samples to be run and analyzed, you can
include day as a blocking factor to adjust for any day effect. If
analyzing log intensity, the regression adjustment for day will involve
a ratio correction on the original scale.
If you are completely correct that the housekeeping genes cannot be
disease-related, there is hope for some kind of internal control if you
make a strong assumption about the time effect being the same for
housekeeping genes as for other genes. But why not just do the proper
design?
Frank
Waverley wrote:
Hi,
I have a question of the method as how to normalize the data sets
according to a set of the internal measurements.
For example, I have performed two batches of experiments contrasting
two different conditions (positive versus negative conditions): one at
a time.
1. each experiment, I measure signals of variable v1 to v100. I want
to understand v1 to v100 change under these two contrasting conditions
2. Also I know different variables v101 to v1110, total of 10 of them,
although they are different from each other, but they would of the
same or similar values under these two contrasting conditions
3. How do I do the internal normalization? How can I use the the
variable v101 to v110 values to normalize the measures of v1 to v100
at either positive or negative condition to minimize batch effect? I
hope the comparisons of values (v1 to v100) between two different
conditions can be more accurate and robust to external noises.
In general, I have a couple of matrices of the same dimensions and a
reference matrix of values to be used as reference values to be
normalize to. How should I do that?
I don't understand your problem well, but in general internal
normalization is by and large an attempt to avoid appropriate modeling
(e.g., incorporating block effects or certain covariates in a regression
model), and results in overstated confidence of the final estimates by
not taking into account the imprecision in the normalizing factors.
Frank
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.