[Numpy-discussion] making np.gradient support unevenly spaced data

2017-01-10 Thread aleba...@gmail.com
Hi all,

I have implemented a proposed enhancement for the np.gradient function that
allows to compute the gradient on non uniform grids. (PR:
https://github.com/numpy/numpy/pull/8446)
The proposed implementation has a behaviour/signature is similar to that of
Matlab/Octave. As argument it can take:
1. A single scalar to specify a sample distance for all dimensions.
2. N scalars to specify a constant sample distance for each dimension.
   i.e. `dx`, `dy`, `dz`, ...
3. N arrays to specify the coordinates of the values along each
   dimension of F. The length of the array must match the size of
   the corresponding dimension
4. Any combination of N scalars/arrays with the meaning of 2. and 3.

e.g., you can do the following:

>>> f = np.array([[1, 2, 6], [3, 4, 5]], dtype=np.float)
>>> dx = 2.
>>> y = [1., 1.5, 3.5]
>>> np.gradient(f, dx, y)
[array([[ 1. ,  1. , -0.5], [ 1. ,  1. , -0.5]]),
 array([[ 2. ,  2. ,  2. ], [ 2. ,  1.7,  0.5]])]

It should not break any existing code since as of 1.12 only scalars or list
of scalars are allowed.
A possible alternative API could be pass arrays of sampling steps instead
of the coordinates.
On the one hand, this would have the advantage of having "differences" both
in the scalar case and in the array case.
On the other hand, if you are dealing with non uniformly-spaced data (e.g,
data is mapped on a grid or it is a time-series), in most cases you already
have the coordinates/timestamps. Therefore, in the case of difference as
argument, you would almost always have a call np.diff before np.gradient.

In the end, I would rather prefer the coordinates option since IMHO it is
more handy, I don't think that would be too much "surprising" and it is
what Matlab already does. Also, it could not easily lead to "silly"
mistakes since the length have to match the size of the corresponding
dimension.

What do you think?

Thanks

Alessandro

-- 
--
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-17 Thread aleba...@gmail.com
Hi Nadav,

I may be wrong, but I think that the result of the current implementation
is actually the expected one.
Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4

P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])

Now, P([1]) = 0.2 and P([2]) = 0.4. However:
P([2] | 1st=[1]) = 0.5 (2 and 3 have the same sampling probability)
P([1] | 1st=[2]) = 1/3 (1 and 3 have probability 0.2 and 0.4 that, once
normalised, translate into 1/3 and 2/3 respectively)
Therefore P([1,2]) = 0.7/3 = 0.2
Similarly, P([1,3]) = 0.2 and P([2,3]) = 1.6/3 = 0.53

What am I missing?

Alessandro


2017-01-17 13:00 GMT+01:00 :

> Hi, I'm looking for a way to find a random sample of C different items out
> of N items, with a some desired probabilty Pi for each item i.
>
> I saw that numpy has a function that supposedly does this,
> numpy.random.choice (with replace=False and a probabilities array), but
> looking at the algorithm actually implemented, I am wondering in what sense
> are the probabilities Pi actually obeyed...
>
> To me, the code doesn't seem to be doing the right thing... Let me explain:
>
> Consider a simple numerical example: We have 3 items, and need to pick 2
> different ones randomly. Let's assume the desired probabilities for item 1,
> 2 and 3 are: 0.2, 0.4 and 0.4.
>
> Working out the equations there is exactly one solution here: The random
> outcome of numpy.random.choice in this case should be [1,2] at probability
> 0.2, [1,3] at probabilty 0.2, and [2,3] at probability 0.6. That is indeed
> a solution for the desired probabilities because it yields item 1 in
> [1,2]+[1,3] = 0.2 + 0.2 = 2*P1 of the trials, item 2 in [1,2]+[2,3] =
> 0.2+0.6 = 0.8 = 2*P2, etc.
>
> However, the algorithm in numpy.random.choice's replace=False generates, if
> I understand correctly, different probabilities for the outcomes: I believe
> in this case it generates [1,2] at probability 0.2, [1,3] also 0.2333,
> and [2,3] at probability 0.5.
>
> My question is how does this result fit the desired probabilities?
>
> If we get [1,2] at probability 0.2 and [1,3] at probability 0.2333,
> then the expect number of "1" results we'll get per drawing is 0.2 +
> 0.2333 = 0.4, and similarly for "2" the expected number 0.7666, and for
> "3" 0.7. As you can see, the proportions are off: Item 2 is NOT twice
> common than item 1 as we originally desired (we asked for probabilities
> 0.2, 0.4, 0.4 for the individual items!).
>
>
> --
> Nadav Har'El
> n...@scylladb.com
> -- next part --
> An HTML attachment was scrubbed...
> URL:  attachments/20170117/d1f0a1db/attachment-0001.html>
>
> --
>
> Subject: Digest Footer
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> --
>
> End of NumPy-Discussion Digest, Vol 124, Issue 24
> *
>



-- 
--
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-17 Thread aleba...@gmail.com
2017-01-17 22:13 GMT+01:00 Nadav Har'El :

>
> On Tue, Jan 17, 2017 at 7:18 PM, aleba...@gmail.com 
> wrote:
>
>> Hi Nadav,
>>
>> I may be wrong, but I think that the result of the current implementation
>> is actually the expected one.
>> Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and 0.4
>>
>> P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
>>
>
> Yes, this formula does fit well with the actual algorithm in the code.
> But, my question is *why* we want this formula to be correct:
>
> Just a note: this formula is correct and it is one of statistics
fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability +
https://en.wikipedia.org/wiki/Bayes%27_theorem
Thus, the result we get from random.choice IMHO definitely makes sense. Of
course, I think we could always discuss about implementing other sampling
methods if they are useful to some application.


>
>> Now, P([1]) = 0.2 and P([2]) = 0.4. However:
>> P([2] | 1st=[1]) = 0.5 (2 and 3 have the same sampling probability)
>> P([1] | 1st=[2]) = 1/3 (1 and 3 have probability 0.2 and 0.4 that,
>> once normalised, translate into 1/3 and 2/3 respectively)
>> Therefore P([1,2]) = 0.7/3 = 0.2
>> Similarly, P([1,3]) = 0.2 and P([2,3]) = 1.6/3 = 0.53
>>
>
> Right, these are the numbers that the algorithm in the current code, and
> the formula above, produce:
>
> P([1,2]) = P([1,3]) = 0.2
> P([2,3]) = 0.5
>
> What I'm puzzled about is that these probabilities do not really fullfill
> the given probability vector 0.2, 0.4, 0.4...
> Let me try to explain explain:
>
> Why did the user choose the probabilities 0.2, 0.4, 0.4 for the three
> items in the first place?
>
> One reasonable interpretation is that the user wants in his random picks
> to see item 1 half the time of item 2 or 3.
> For example, maybe item 1 costs twice as much as item 2 or 3, so picking
> it half as often will result in an equal expenditure on each item.
>
> If the user randomly picks the items individually (a single item at a
> time), he indeed gets exactly this distribution: 0.2 of the time item 1,
> 0.4 of the time item 2, 0.4 of the time item 3.
>
> Now, what happens if he picks not individual items, but pairs of different
> items using numpy.random.choice with two items, replace=false?
> Suddenly, the distribution of the individual items in the results get
> skewed: If we look at the expected number of times we'll see each item in
> one draw of a random pair, we will get:
>
> E(1) = P([1,2]) + P([1,3]) = 0.4
> E(2) = P([1,2]) + P([2,3]) = 0.7
> E(3) = P([1,3]) + P([2,3]) = 0.7
>
> Or renormalizing by dividing by 2:
>
> P(1) = 0.23
> P(2) = 0.38
> P(3) = 0.38
>
> As you can see this is not quite the probabilities we wanted (which were
> 0.2, 0.4, 0.4)! In the random pairs we picked, item 1 was used a bit more
> often than we wanted, and item 2 and 3 were used a bit less often!
>

p is not the probability of the output but the one of the source finite
population. I think that if you want to preserve that distribution, as
Josef pointed out, you have to make extractions independent, that is either
sample with replacement or approximate an infinite population (that is
basically the same thing).  But of course in this case you will also end up
with events [X,X].


> So that brought my question of why we consider these numbers right.
>
> In this example, it's actually possible to get the right item
> distribution, if we pick the pair outcomes with the following probabilties:
>
>P([1,2]) = 0.2(not 0.23 as above)
>P([1,3]) = 0.2
>P([2,3]) = 0.6(not 0.53 as above)
>
> Then, we get exactly the right P(1), P(2), P(3): 0.2, 0.4, 0.4
>
> Interestingly, fixing things like I suggest is not always possible.
> Consider a different probability-vector example for three items - 0.99,
> 0.005, 0.005. Now, no matter which algorithm we use for randomly picking
> pairs from these three items, *each* returned pair will inevitably contain
> one of the two very-low-probability items, so each of those items will
> appear in roughly half the pairs, instead of in a vanishingly small
> percentage as we hoped.
>
> But in other choices of probabilities (like the one in my original
> example), there is a solution. For 2-out-of-3 sampling we can actually show
> a system of three linear equations in three variables, so there is always
> one solution but if this solution has components not valid as probabilities
> (not in [0,1]) we end up with no solution - as happens in the 0.99, 0.005,
> 0.005 example.
>
>
>
>> What am I missing?
>>
&

Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-18 Thread aleba...@gmail.com
2017-01-18 9:35 GMT+01:00 Nadav Har'El :

>
> On Wed, Jan 18, 2017 at 1:58 AM, aleba...@gmail.com 
> wrote:
>
>>
>>
>> 2017-01-17 22:13 GMT+01:00 Nadav Har'El :
>>
>>>
>>> On Tue, Jan 17, 2017 at 7:18 PM, aleba...@gmail.com 
>>> wrote:
>>>
>>>> Hi Nadav,
>>>>
>>>> I may be wrong, but I think that the result of the current
>>>> implementation is actually the expected one.
>>>> Using you example: probabilities for item 1, 2 and 3 are: 0.2, 0.4 and
>>>> 0.4
>>>>
>>>> P([1,2]) = P([2] | 1st=[1]) P([1]) + P([1] | 1st=[2]) P([2])
>>>>
>>>
>>> Yes, this formula does fit well with the actual algorithm in the code.
>>> But, my question is *why* we want this formula to be correct:
>>>
>>> Just a note: this formula is correct and it is one of statistics
>> fundamental law: https://en.wikipedia.org/wiki/Law_of_total_probability
>> + https://en.wikipedia.org/wiki/Bayes%27_theorem
>>
>
> Hi,
>
> Yes, of course the formula is correct, but it doesn't mean we're not
> applying it in the wrong context.
>
> I'll be honest here: I came to numpy.random.choice after I actually coded
> a similar algorithm (with the same results) myself, because like you I
> thought this was the "obvious" and correct algorithm. Only then I realized
> that its output doesn't actually produce the desired probabilities
> specified by the user - even in the cases where that is possible. And I
> started wondering if existing libraries - like numpy - do this differently.
> And it turns out, numpy does it (basically) in the same way as my algorithm.
>
>
>>
>> Thus, the result we get from random.choice IMHO definitely makes sense.
>>
>
> Let's look at what the user asked this function, and what it returns:
>
> User asks: please give me random pairs of the three items, where item 1
> has probability 0.2, item 2 has 0.4, and 3 has 0.4.
>
> Function returns: random pairs, where if you make many random returned
> results (as in the law of large numbers) and look at the items they
> contain, item 1 is 0.2333 of the items, item 2 is 0.38333, and item 3 is
> 0.38333.
> These are not (quite) the probabilities the user asked for...
>
> Can you explain a sense where the user's requested probabilities (0.2,
> 0.4, 0.4) are actually adhered in the results which random.choice returns?
>

I think that the question the user is asking by specifying p is a slightly
different one:
 "please give me random pairs of the three items extracted from a
population of 3 items where item 1 has probability of being extracted of
0.2, item 2 has 0.4, and 3 has 0.4. Also please remove extract items once
extracted."


> Thanks,
> Nadav Har'El.
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


-- 
--
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about numpy.random.choice with probabilties

2017-01-23 Thread aleba...@gmail.com
2017-01-23 15:33 GMT+01:00 Robert Kern :

> On Mon, Jan 23, 2017 at 6:27 AM, Anne Archibald 
> wrote:
> >
> > On Wed, Jan 18, 2017 at 4:13 PM Nadav Har'El  wrote:
> >>
> >> On Wed, Jan 18, 2017 at 4:30 PM,  wrote:
> >>>
>  Having more sampling schemes would be useful, but it's not possible
> to implement sampling schemes with impossible properties.
> >>>
> >>> BTW: sampling 3 out of 3 without replacement is even worse
> >>>
> >>> No matter what sampling scheme and what selection probabilities we
> use, we always have every element with probability 1 in the sample.
> >>
> >> I agree. The random-sample function of the type I envisioned will be
> able to reproduce the desired probabilities in some cases (like the example
> I gave) but not in others. Because doing this correctly involves a set of n
> linear equations in comb(n,k) variables, it can have no solution, or many
> solutions, depending on the n and k, and the desired probabilities. A
> function of this sort could return an error if it can't achieve the desired
> probabilities.
> >
> > It seems to me that the basic problem here is that the
> numpy.random.choice docstring fails to explain what the function actually
> does when called with weights and without replacement. Clearly there are
> different expectations; I think numpy.random.choice chose one that is easy
> to explain and implement but not necessarily what everyone expects. So the
> docstring should be clarified. Perhaps a Notes section:
> >
> > When numpy.random.choice is called with replace=False and non-uniform
> probabilities, the resulting distribution of samples is not obvious.
> numpy.random.choice effectively follows the procedure: when choosing the
> kth element in a set, the probability of element i occurring is p[i]
> divided by the total probability of all not-yet-chosen (and therefore
> eligible) elements. This approach is always possible as long as the sample
> size is no larger than the population, but it means that the probability
> that element i occurs in the sample is not exactly p[i].
>
> I don't object to some Notes, but I would probably phrase it more like we
> are providing the standard definition of the jargon term "sampling without
> replacement" in the case of non-uniform probabilities. To my mind (or more
> accurately, with my background), "replace=False" obviously picks out the
> implemented procedure, and I would have been incredibly surprised if it did
> anything else. If the option were named "unique=True", then I would have
> needed some more documentation to let me know exactly how it was
> implemented.
>
> FWIW, I totally agree with Robert


>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


-- 
--
NOTICE: Dlgs 196/2003 this e-mail and any attachments thereto may contain
confidential information and are intended for the sole use of the
recipient(s) named above. If you are not the intended recipient of this
message you are hereby notified that any dissemination or copying of this
message is strictly prohibited. If you have received this e-mail in error,
please notify the sender either by telephone or by e-mail and delete the
material from any computer. Thank you.
--
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion