Re: [R] Measure Difference Between Two Distributions

Rainer M Krug Sat, 25 Sep 2010 10:12:22 -0700

On Sat, Sep 25, 2010 at 6:24 PM, Lorenzo Isella <lorenzo.ise...@gmail.com>wrote:


> ld represent the distance as the proportion of maximum possible
>
>> distance, i.e. scaling it to be between 0 and 1.
>>
>> An example:
>> A and B have the same length (x), and you calculate the emd(A, B), which
>> is d.
>> Now you have to determine the maximum distance between these two:
>> remembering the analogy of moving earth, the biggest distance between
>> the two distributions would be if in A, all elements would be in A(1)
>> and all other would be zero, and in B all elements would be zero, except
>> of B(x). Now you can calculate the difference between these two, and you
>> get dmax
>> The last step is to divide d/dmax, i.e. scaling to a value between 0 and
>> 1.
>>
>> this value then can be compared with the same ratio obtained from C and
>> D with length y.
>>
>> One important point to keep in mind when using the emd: if the sum(A) is
>> not the same as sum(B), emd(A,B) is NOT EQUAL to emd(B,A). If this
>> applies to your case, you have to decide what to do, but one option is
>> to standardise A and B so that their sum is the same (effectively
>> comparing the SHAPES and not the actual values.
>>
>
> OK, I see. The standardization part is not a terrible problem, I guess.
> The other bit is less clear (to me). What are A(1) and B(x)? Am I piling up
> all the elements in A and B in a single bin?
> Cheers
>

OK. Some code:

> set.seed(13)
> B <- sample(1:10, 10)
> B
 [1]  8  3  4  1  6  7  9 10  2  5
> set.seed(13)
> A <- sample(1:10, 10)
> B <- sample(1:10, 10)
> A
 [1]  8  3  4  1  6  7  9 10  2  5
> B
 [1]  7  8  9  4 10  2  5  6  3  1
> A[1] <- sum(A)
> A[-1] <- 0
> B[length(B)] <- sum(B)
> B[-length(B)] <- 0
> A
 [1] 55  0  0  0  0  0  0  0  0  0
> B
 [1]  0  0  0  0  0  0  0  0  0 55

And now you can calculate the emd(A, B), which then is the maximum distance
between A and B. Imagine: the distance is the work you have to do to convert
A into B. Work equals distance times mass you have to move. Therefore you
have to maximise the distance you have to carry the earth and the amount you
have to carry. Therefore, in A, piling everything up in the first element,
and in B, piling everything up in the last element, gives you the most work
you have to du, which equals the largest distance.

Even though it is rather straight forward, I should probably integrate a
function in the package which gives you the largest distance between two
distributions - I'll think about it.

Hope this helps,

Cheers,

Rainer



> Lorenzo
>



-- 
NEW GERMAN FAX NUMBER!!!

Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology,
UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Natural Sciences Building
Office Suite 2039
Stellenbosch University
Main Campus, Merriman Avenue
Stellenbosch
South Africa

Cell:           +27 - (0)83 9479 042
Fax:            +27 - (0)86 516 2782
Fax:            +49 - (0)321 2125 2244
email:          rai...@krugs.de

Skype:          RMkrug
Google:         r.m.k...@gmail.com

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Measure Difference Between Two Distributions

Reply via email to