[Numpy-discussion] ANN: numpydoc 0.6.0 released

2016-02-13 Thread Ralf Gommers
Hi all,

I'm pleased to announce the release of numpydoc 0.6.0. The main new feature
is support for the Yields section in numpy-style docstrings. This is
described in
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

Numpydoc can be installed from PyPi: https://pypi.python.org/pypi/numpydoc

Cheers,
Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: Windows wheels for testing

2016-02-13 Thread R Schumacher
Have you all conferred with C Gohlke on his 
Windows build bot? I've never seen a description of his recipes.

The MKL linking aside, his binaries always seem to work flawlessly.

- Ray


At 11:16 PM 2/12/2016, you wrote:
AFAIK the vcvarsall.bat error occurs when your 
MSVC directories aren't properly linked in your 
system registry, so Python cannot find the 
file.  This is not a numpy-specific issue, so I 
certainly would agree that that failure is not blocking.


Other than that, this build contains the 
mingw32.lib bug that I fixed 
here, 
but other than that, everything else passes on 
relevant Python versions for 32-bit!


On Sat, Feb 13, 2016 at 4:23 AM, Matthew Brett 
<matthew.br...@gmail.com> wrote:
On Fri, Feb 12, 2016 at 8:18 PM, R Schumacher 
<r...@blue-cove.com> wrote:

> At 03:45 PM 2/12/2016, you wrote:
>>
>> PS C:\tmp> c:\Python35\python -m venv np-testing
>> PS C:\tmp> .\np-testing\Scripts\Activate.ps1
>> (np-testing) PS C:\tmp> pip install -f
>> 
https://nipy.bic.berkeley.edu/scipy_installers/atlas_builds 
numpy nose

>
>
> C:\Python34\Scripts>pip install  "D:\Python
> distros\numpy-1.10.4-cp34-none-win_amd64.whl"
> Unpacking d:\python distros\numpy-1.10.4-cp34-none-win_amd64.whl
> Installing collected packages: numpy
> Successfully installed numpy
> Cleaning up...
>
> C:\Python34\Scripts>..\python
> Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 
2014, 22:16:31) [MSC v.1600 64 bit

> (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
 import numpy
 numpy.test()
> Running unit tests for numpy
> NumPy version 1.10.4
> NumPy relaxed strides checking option: False
> NumPy is installed in C:\Python34\lib\site-packages\numpy
> Python version 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC
> v.1600 64 bit (AMD64)]
> nose version 1.3.7
> 
...FS...
> 
.S..
> 
..C:\Python34\lib\unittest\case.
> py:162: DeprecationWarning: using a 
non-integer number instead of an integer

> will result in an error in the future
>Â  Â callable_obj(*args, **kwargs)
> C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> non-integer number instead of an integer will
> result in an error in the future
>Â  Â callable_obj(*args, **kwargs)
> C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> non-integer number instead of an integer will result i
> n an error in the future
>Â  Â callable_obj(*args, **kwargs)
> 
...S
> 

> 
..C:\Python34\lib\unittest\case.py:162:

> Deprecat
> ionWarning: using a non-integer number instead of an integer will result in
> an error in the future
>Â  Â callable_obj(*args, **kwargs)
> ..C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> non-integer number instead of an integer will result
>Â  in an error in the future
>Â  Â callable_obj(*args, **kwargs)
> C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> non-integer number instead of an integer will result i
> n an error in the future
>Â  Â callable_obj(*args, **kwargs)
> C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> non-integer number instead of an integer will result i
> n an error in the future
>Â  Â callable_obj(*args, **kwargs)
> C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> non-integer number instead of an integer will result i
> n an error in the future
>Â  Â callable_obj(*args, **kwargs)
> 

> 

> 

> 

> 

> 

> 

Re: [Numpy-discussion] Fwd: Windows wheels for testing

2016-02-13 Thread G Young
I've actually had test failures on occasion (i.e. when I run
"numpy.test()") with his builds but overall, they are quite good.  Speaking
of MKL, for anyone who uses conda, does anyone know if it is possible to
link the "mkl" package to the numpy source?  My first guess is no since the
description appears to imply that the package provides runtime libraries
and not a static libraries that numpy would need, but perhaps someone who
knows better can illuminate.

On Sat, Feb 13, 2016 at 3:42 PM, R Schumacher  wrote:

> Have you all conferred with C Gohlke on his Windows build bot? I've never
> seen a description of his recipes.
> The MKL linking aside, his binaries always seem to work flawlessly.
>
> - Ray
>
>
> At 11:16 PM 2/12/2016, you wrote:
>
> AFAIK the vcvarsall.bat error occurs when your MSVC directories aren't
> properly linked in your system registry, so Python cannot find the file.Â
> This is not a numpy-specific issue, so I certainly would agree that that
> failure is not blocking.
>
> Other than that, this build contains the mingw32.lib bug that I fixedÂ
> here , but other than that,
> everything else passes on relevant Python versions for 32-bit!
>
> On Sat, Feb 13, 2016 at 4:23 AM, Matthew Brett 
> wrote:
> On Fri, Feb 12, 2016 at 8:18 PM, R Schumacher  wrote:
> > At 03:45 PM 2/12/2016, you wrote:
> >>
> >> PS C:\tmp> c:\Python35\python -m venv np-testing
> >> PS C:\tmp> .\np-testing\Scripts\Activate.ps1
> >> (np-testing) PS C:\tmp> pip install -f
> >> https://nipy.bic.berkeley.edu/scipy_installers/atlas_builds numpy nose
> >
> >
> > C:\Python34\Scripts>pip install  "D:\Python
> > distros\numpy-1.10.4-cp34-none-win_amd64.whl"
> > Unpacking d:\python distros\numpy-1.10.4-cp34-none-win_amd64.whl
> > Installing collected packages: numpy
> > Successfully installed numpy
> > Cleaning up...
> >
> > C:\Python34\Scripts>..\python
> > Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC v.1600
> 64 bit
> > (AMD64)] on win32
> > Type "help", "copyright", "credits" or "license" for more information.
>  import numpy
>  numpy.test()
> > Running unit tests for numpy
> > NumPy version 1.10.4
> > NumPy relaxed strides checking option: False
> > NumPy is installed in C:\Python34\lib\site-packages\numpy
> > Python version 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC
> > v.1600 64 bit (AMD64)]
> > nose version 1.3.7
> >
> ...FS...
> >
> .S..
> >
> ..C:\Python34\lib\unittest\case.
> > py:162: DeprecationWarning: using a non-integer number instead of an
> integer
> > will result in an error in the future
> >Â  Â callable_obj(*args, **kwargs)
> > C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> > non-integer number instead of an integer will
> > result in an error in the future
> >Â  Â callable_obj(*args, **kwargs)
> > C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> > non-integer number instead of an integer will result i
> > n an error in the future
> >Â  Â callable_obj(*args, **kwargs)
> >
> ...S
> >
> 
> >
> ..C:\Python34\lib\unittest\case.py:162:
> > Deprecat
> > ionWarning: using a non-integer number instead of an integer will result
> in
> > an error in the future
> >Â  Â callable_obj(*args, **kwargs)
> > ..C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> > non-integer number instead of an integer will result
> >Â  in an error in the future
> >Â  Â callable_obj(*args, **kwargs)
> > C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> > non-integer number instead of an integer will result i
> > n an error in the future
> >Â  Â callable_obj(*args, **kwargs)
> > C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> > non-integer number instead of an integer will result i
> > n an error in the future
> >Â  Â callable_obj(*args, **kwargs)
> > C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
> > non-integer number instead of an integer will result i
> > n an error in the future
> >Â  Â callable_obj(*args, **kwargs)
> >
> 
> >
> 
> >
> ..

[Numpy-discussion] Modulus (remainder) function corner cases

2016-02-13 Thread Charles R Harris
Hi All,

I'm curious as to what folks think about some choices in the compution of
the remainder function. As an example where different choices can be made

In [2]: -1e-64 % 1.
Out[2]: 1.0

In [3]: float64(-1e-64) % 1.
Out[3]: 0.99989

The first is Python, the second is in my branch. The first is more accurate
numerically, but the modulus is of the same magnitude as the divisor. The
second maintains the convention that the result must have smaller magnitude
than the divisor. There are other corner cases along the same lines. So the
question is, which is more desirable: maintaining numerical accuracy
or enforcing
mathematical convention? The differences are on the order of an ulp, but
there will be a skew in the distribution of the errors if convention is
maintained.

The Fortran modulo function, which is the same basic function as in my
branch, does not specify any bounds on the result for floating numbers, but
gives only the formula,  modulus(a, b) = a - b*floor(a/b), which has the
advantage of being simple and well defined ;)

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Modulus (remainder) function corner cases

2016-02-13 Thread Charles R Harris
On Sat, Feb 13, 2016 at 9:31 AM, Charles R Harris  wrote:

> Hi All,
>
> I'm curious as to what folks think about some choices in the compution of
> the remainder function. As an example where different choices can be made
>
> In [2]: -1e-64 % 1.
> Out[2]: 1.0
>
> In [3]: float64(-1e-64) % 1.
> Out[3]: 0.99989
>
> The first is Python, the second is in my branch. The first is more accurate
> numerically, but the modulus is of the same magnitude as the divisor. The
> second maintains the convention that the result must have smaller
> magnitude than the divisor. There are other corner cases along the same
> lines. So the question is, which is more desirable: maintaining numerical
> accuracy or enforcing mathematical convention? The differences are on the
> order of an ulp, but there will be a skew in the distribution of the
> errors if convention is maintained.
>
> The Fortran modulo function, which is the same basic function as in my
> branch, does not specify any bounds on the result for floating numbers, but
> gives only the formula,  modulus(a, b) = a - b*floor(a/b), which has the
> advantage of being simple and well defined ;)
>

Note that the other enforced bound is that the result have the same sign as
the divisor. Python enforces that by adjusting the integer part, I enforce
it by adjusting the remainder.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane
I've had a pretty similar idea for a new indexing function 
'split_classes' which would help in your case, which essentially does


def split_classes(c, v):
return [v[c == u] for u in unique(c)]

Your example could be coded as

>>> [sum(c) for c in split_classes(label, data)]
[9, 12, 15]

I feel I've come across the need for such a function often enough that 
it might be generally useful to people as part of numpy. The 
implementation of split_classes above has pretty poor performance 
because it creates many temporary boolean arrays, so my plan for a PR 
was to have a speedy version of it that uses a single pass through v.

(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad idea?) I'd love 
to hear.


I have some further notes and examples here: 
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21


Allan

On 02/12/2016 09:40 AM, Sérgio wrote:

Hello,

This is my first e-mail, I will try to make the idea simple.

Similar to masked array it would be interesting to use a label array to
guide operations.

Ex.:
 >>> x
labelled_array(data =
  [[0 1 2]
  [3 4 5]
  [6 7 8]],
 label =
  [[0 1 2]
  [0 1 2]
  [0 1 2]])

 >>> sum(x)
array([9, 12, 15])

The operations would create a new axis for label indexing.

You could think of it as a collection of masks, one for each label.

I don't know a way to make something like this efficiently without a
loop. Just wondering...

Sérgio.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fwd: Windows wheels for testing

2016-02-13 Thread Jonathan Helmus



On 2/12/16 10:23 PM, Matthew Brett wrote:

On Fri, Feb 12, 2016 at 8:18 PM, R Schumacher  wrote:

At 03:45 PM 2/12/2016, you wrote:

PS C:\tmp> c:\Python35\python -m venv np-testing
PS C:\tmp> .\np-testing\Scripts\Activate.ps1
(np-testing) PS C:\tmp> pip install -f
https://nipy.bic.berkeley.edu/scipy_installers/atlas_builds numpy nose


C:\Python34\Scripts>pip install  "D:\Python
distros\numpy-1.10.4-cp34-none-win_amd64.whl"
Unpacking d:\python distros\numpy-1.10.4-cp34-none-win_amd64.whl
Installing collected packages: numpy
Successfully installed numpy
Cleaning up...

C:\Python34\Scripts>..\python
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC v.1600 64 bit
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import numpy
numpy.test()

Running unit tests for numpy
NumPy version 1.10.4
NumPy relaxed strides checking option: False
NumPy is installed in C:\Python34\lib\site-packages\numpy
Python version 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC
v.1600 64 bit (AMD64)]
nose version 1.3.7
...FS...
.S..
..C:\Python34\lib\unittest\case.
py:162: DeprecationWarning: using a non-integer number instead of an integer
will result in an error in the future
   callable_obj(*args, **kwargs)
C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
non-integer number instead of an integer will
result in an error in the future
   callable_obj(*args, **kwargs)
C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
non-integer number instead of an integer will result i
n an error in the future
   callable_obj(*args, **kwargs)
...S

..C:\Python34\lib\unittest\case.py:162:
Deprecat
ionWarning: using a non-integer number instead of an integer will result in
an error in the future
   callable_obj(*args, **kwargs)
..C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
non-integer number instead of an integer will result
  in an error in the future
   callable_obj(*args, **kwargs)
C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
non-integer number instead of an integer will result i
n an error in the future
   callable_obj(*args, **kwargs)
C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
non-integer number instead of an integer will result i
n an error in the future
   callable_obj(*args, **kwargs)
C:\Python34\lib\unittest\case.py:162: DeprecationWarning: using a
non-integer number instead of an integer will result i
n an error in the future
   callable_obj(*args, **kwargs)











...K.C:\Python34\lib
\site-packages\numpy\ma\core.py:989: RuntimeWarning: invalid value
encountered in multiply
   masked_da = umath.multiply(m, da)
C:\Python34\lib\site-packages\numpy\ma\core.py:989: RuntimeWarning: invalid
value encountered in multiply
   masked_da = umath.multiply(m, da)
.

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane
Sorry, to reply to myself here, but looking at it with fresh eyes maybe 
the performance of the naive version isn't too bad. Here's a comparison 
of the naive vs a better implementation:


def split_classes_naive(c, v):
return [v[c == u] for u in unique(c)]

def split_classes(c, v):
perm = c.argsort()
csrt = c[perm]
div = where(csrt[1:] != csrt[:-1])[0] + 1
return [v[x] for x in split(perm, div)]

>>> c = randint(0,32,size=10)
>>> v = arange(10)
>>> %timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
>>> %timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop

In any case, maybe it is useful to Sergio or others.

Allan

On 02/13/2016 12:11 PM, Allan Haldane wrote:

I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which essentially does

 def split_classes(c, v):
 return [v[c == u] for u in unique(c)]

Your example could be coded as

 >>> [sum(c) for c in split_classes(label, data)]
 [9, 12, 15]

I feel I've come across the need for such a function often enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor performance
because it creates many temporary boolean arrays, so my plan for a PR
was to have a speedy version of it that uses a single pass through v.
(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad idea?) I'd love
to hear.

I have some further notes and examples here:
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

Allan

On 02/12/2016 09:40 AM, Sérgio wrote:

Hello,

This is my first e-mail, I will try to make the idea simple.

Similar to masked array it would be interesting to use a label array to
guide operations.

Ex.:
 >>> x
labelled_array(data =
  [[0 1 2]
  [3 4 5]
  [6 7 8]],
 label =
  [[0 1 2]
  [0 1 2]
  [0 1 2]])

 >>> sum(x)
array([9, 12, 15])

The operations would create a new axis for label indexing.

You could think of it as a collection of masks, one for each label.

I don't know a way to make something like this efficiently without a
loop. Just wondering...

Sérgio.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Nathaniel Smith
I believe this is basically a groupby, which is one of pandas's core
competencies... even if numpy were to add some utilities for this kind of
thing, then I doubt we'd do as well as them, so you might check whether
pandas works for you first :-)
On Feb 12, 2016 6:40 AM, "Sérgio"  wrote:

> Hello,
>
> This is my first e-mail, I will try to make the idea simple.
>
> Similar to masked array it would be interesting to use a label array to
> guide operations.
>
> Ex.:
> >>> x
> labelled_array(data =
>  [[0 1 2]
>  [3 4 5]
>  [6 7 8]],
> label =
>  [[0 1 2]
>  [0 1 2]
>  [0 1 2]])
>
> >>> sum(x)
> array([9, 12, 15])
>
> The operations would create a new axis for label indexing.
>
> You could think of it as a collection of masks, one for each label.
>
> I don't know a way to make something like this efficiently without a loop.
> Just wondering...
>
> Sérgio.
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread josef.pktd
On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane 
wrote:

> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
> the performance of the naive version isn't too bad. Here's a comparison of
> the naive vs a better implementation:
>
> def split_classes_naive(c, v):
> return [v[c == u] for u in unique(c)]
>
> def split_classes(c, v):
> perm = c.argsort()
> csrt = c[perm]
> div = where(csrt[1:] != csrt[:-1])[0] + 1
> return [v[x] for x in split(perm, div)]
>
> >>> c = randint(0,32,size=10)
> >>> v = arange(10)
> >>> %timeit split_classes_naive(c,v)
> 100 loops, best of 3: 8.4 ms per loop
> >>> %timeit split_classes(c,v)
> 100 loops, best of 3: 4.79 ms per loop
>

The usecases I recently started to target for similar things is 1 Million
or more rows and 1 uniques in the labels.
The second version should be faster for large number of uniques, I guess.

Overall numpy is falling far behind pandas in terms of simple groupby
operations. bincount and histogram (IIRC) worked for some cases but are
rather limited.

reduce_at looks nice for cases where it applies.

In contrast to the full sized labels in the original post, I only know of
applications where the labels are 1-D corresponding to rows or columns.

Josef



>
> In any case, maybe it is useful to Sergio or others.
>
> Allan
>
>
> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>
>> I've had a pretty similar idea for a new indexing function
>> 'split_classes' which would help in your case, which essentially does
>>
>>  def split_classes(c, v):
>>  return [v[c == u] for u in unique(c)]
>>
>> Your example could be coded as
>>
>>  >>> [sum(c) for c in split_classes(label, data)]
>>  [9, 12, 15]
>>
>> I feel I've come across the need for such a function often enough that
>> it might be generally useful to people as part of numpy. The
>> implementation of split_classes above has pretty poor performance
>> because it creates many temporary boolean arrays, so my plan for a PR
>> was to have a speedy version of it that uses a single pass through v.
>> (I often wanted to use this function on large datasets).
>>
>> If anyone has any comments on the idea (good idea. bad idea?) I'd love
>> to hear.
>>
>> I have some further notes and examples here:
>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>>
>> Allan
>>
>> On 02/12/2016 09:40 AM, Sérgio wrote:
>>
>>> Hello,
>>>
>>> This is my first e-mail, I will try to make the idea simple.
>>>
>>> Similar to masked array it would be interesting to use a label array to
>>> guide operations.
>>>
>>> Ex.:
>>>  >>> x
>>> labelled_array(data =
>>>   [[0 1 2]
>>>   [3 4 5]
>>>   [6 7 8]],
>>>  label =
>>>   [[0 1 2]
>>>   [0 1 2]
>>>   [0 1 2]])
>>>
>>>  >>> sum(x)
>>> array([9, 12, 15])
>>>
>>> The operations would create a new axis for label indexing.
>>>
>>> You could think of it as a collection of masks, one for each label.
>>>
>>> I don't know a way to make something like this efficiently without a
>>> loop. Just wondering...
>>>
>>> Sérgio.
>>>
>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Jeff Reback
In [10]: pd.options.display.max_rows=10

In [13]: np.random.seed(1234)

In [14]: c = np.random.randint(0,32,size=10)

In [15]: v = np.arange(10)

In [16]: df = DataFrame({'v' : v, 'c' : c})

In [17]: df
Out[17]:
c  v
0  15  0
1  19  1
2   6  2
3  21  3
4  12  4
........
5   7  5
6   2  6
7  27  7
8  28  8
9   7  9

[10 rows x 2 columns]

In [19]: df.groupby('c').count()
Out[19]:
   v
c
0   3136
1   3229
2   3093
3   3121
4   3041
..   ...
27  3128
28  3063
29  3147
30  3073
31  3090

[32 rows x 1 columns]

In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop

In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop

In [22]: df.groupby('c').mean()
Out[22]:
   v
c
0   49883.384885
1   50233.692165
2   48634.116069
3   50811.743992
4   50505.368629
..   ...
27  49715.349425
28  50363.501469
29  50485.395933
30  50190.155223
31  50691.041748

[32 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:29 PM,  wrote:

>
>
> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane 
> wrote:
>
>> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
>> the performance of the naive version isn't too bad. Here's a comparison of
>> the naive vs a better implementation:
>>
>> def split_classes_naive(c, v):
>> return [v[c == u] for u in unique(c)]
>>
>> def split_classes(c, v):
>> perm = c.argsort()
>> csrt = c[perm]
>> div = where(csrt[1:] != csrt[:-1])[0] + 1
>> return [v[x] for x in split(perm, div)]
>>
>> >>> c = randint(0,32,size=10)
>> >>> v = arange(10)
>> >>> %timeit split_classes_naive(c,v)
>> 100 loops, best of 3: 8.4 ms per loop
>> >>> %timeit split_classes(c,v)
>> 100 loops, best of 3: 4.79 ms per loop
>>
>
> The usecases I recently started to target for similar things is 1 Million
> or more rows and 1 uniques in the labels.
> The second version should be faster for large number of uniques, I guess.
>
> Overall numpy is falling far behind pandas in terms of simple groupby
> operations. bincount and histogram (IIRC) worked for some cases but are
> rather limited.
>
> reduce_at looks nice for cases where it applies.
>
> In contrast to the full sized labels in the original post, I only know of
> applications where the labels are 1-D corresponding to rows or columns.
>
> Josef
>
>
>
>>
>> In any case, maybe it is useful to Sergio or others.
>>
>> Allan
>>
>>
>> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>>
>>> I've had a pretty similar idea for a new indexing function
>>> 'split_classes' which would help in your case, which essentially does
>>>
>>>  def split_classes(c, v):
>>>  return [v[c == u] for u in unique(c)]
>>>
>>> Your example could be coded as
>>>
>>>  >>> [sum(c) for c in split_classes(label, data)]
>>>  [9, 12, 15]
>>>
>>> I feel I've come across the need for such a function often enough that
>>> it might be generally useful to people as part of numpy. The
>>> implementation of split_classes above has pretty poor performance
>>> because it creates many temporary boolean arrays, so my plan for a PR
>>> was to have a speedy version of it that uses a single pass through v.
>>> (I often wanted to use this function on large datasets).
>>>
>>> If anyone has any comments on the idea (good idea. bad idea?) I'd love
>>> to hear.
>>>
>>> I have some further notes and examples here:
>>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>>>
>>> Allan
>>>
>>> On 02/12/2016 09:40 AM, Sérgio wrote:
>>>
 Hello,

 This is my first e-mail, I will try to make the idea simple.

 Similar to masked array it would be interesting to use a label array to
 guide operations.

 Ex.:
  >>> x
 labelled_array(data =
   [[0 1 2]
   [3 4 5]
   [6 7 8]],
  label =
   [[0 1 2]
   [0 1 2]
   [0 1 2]])

  >>> sum(x)
 array([9, 12, 15])

 The operations would create a new axis for label indexing.

 You could think of it as a collection of masks, one for each label.

 I don't know a way to make something like this efficiently without a
 loop. Just wondering...

 Sérgio.


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 https://mail.scipy.org/mailman/listinfo/numpy-discussion


>>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/n

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Jeff Reback
These operations get slower as the number of groups increase, but with a
faster function (e.g. the standard ones which are cythonized), the constant
on
the increase is pretty low.

In [23]: c = np.random.randint(0,1,size=10)

In [24]: df = DataFrame({'v' : v, 'c' : c})

In [25]: %timeit df.groupby('c').count()
100 loops, best of 3: 3.18 ms per loop

In [26]: len(df.groupby('c').count())
Out[26]: 1

In [27]: df.groupby('c').count()
Out[27]:
   v
c
0  9
1 11
2  7
3  8
4 16
...   ..
9995  11
9996  13
9997  13
9998   7
  10

[1 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback  wrote:

> In [10]: pd.options.display.max_rows=10
>
> In [13]: np.random.seed(1234)
>
> In [14]: c = np.random.randint(0,32,size=10)
>
> In [15]: v = np.arange(10)
>
> In [16]: df = DataFrame({'v' : v, 'c' : c})
>
> In [17]: df
> Out[17]:
> c  v
> 0  15  0
> 1  19  1
> 2   6  2
> 3  21  3
> 4  12  4
> ........
> 5   7  5
> 6   2  6
> 7  27  7
> 8  28  8
> 9   7  9
>
> [10 rows x 2 columns]
>
> In [19]: df.groupby('c').count()
> Out[19]:
>v
> c
> 0   3136
> 1   3229
> 2   3093
> 3   3121
> 4   3041
> ..   ...
> 27  3128
> 28  3063
> 29  3147
> 30  3073
> 31  3090
>
> [32 rows x 1 columns]
>
> In [20]: %timeit df.groupby('c').count()
> 100 loops, best of 3: 2 ms per loop
>
> In [21]: %timeit df.groupby('c').mean()
> 100 loops, best of 3: 2.39 ms per loop
>
> In [22]: df.groupby('c').mean()
> Out[22]:
>v
> c
> 0   49883.384885
> 1   50233.692165
> 2   48634.116069
> 3   50811.743992
> 4   50505.368629
> ..   ...
> 27  49715.349425
> 28  50363.501469
> 29  50485.395933
> 30  50190.155223
> 31  50691.041748
>
> [32 rows x 1 columns]
>
>
> On Sat, Feb 13, 2016 at 1:29 PM,  wrote:
>
>>
>>
>> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane 
>> wrote:
>>
>>> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
>>> the performance of the naive version isn't too bad. Here's a comparison of
>>> the naive vs a better implementation:
>>>
>>> def split_classes_naive(c, v):
>>> return [v[c == u] for u in unique(c)]
>>>
>>> def split_classes(c, v):
>>> perm = c.argsort()
>>> csrt = c[perm]
>>> div = where(csrt[1:] != csrt[:-1])[0] + 1
>>> return [v[x] for x in split(perm, div)]
>>>
>>> >>> c = randint(0,32,size=10)
>>> >>> v = arange(10)
>>> >>> %timeit split_classes_naive(c,v)
>>> 100 loops, best of 3: 8.4 ms per loop
>>> >>> %timeit split_classes(c,v)
>>> 100 loops, best of 3: 4.79 ms per loop
>>>
>>
>> The usecases I recently started to target for similar things is 1 Million
>> or more rows and 1 uniques in the labels.
>> The second version should be faster for large number of uniques, I guess.
>>
>> Overall numpy is falling far behind pandas in terms of simple groupby
>> operations. bincount and histogram (IIRC) worked for some cases but are
>> rather limited.
>>
>> reduce_at looks nice for cases where it applies.
>>
>> In contrast to the full sized labels in the original post, I only know of
>> applications where the labels are 1-D corresponding to rows or columns.
>>
>> Josef
>>
>>
>>
>>>
>>> In any case, maybe it is useful to Sergio or others.
>>>
>>> Allan
>>>
>>>
>>> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>>>
 I've had a pretty similar idea for a new indexing function
 'split_classes' which would help in your case, which essentially does

  def split_classes(c, v):
  return [v[c == u] for u in unique(c)]

 Your example could be coded as

  >>> [sum(c) for c in split_classes(label, data)]
  [9, 12, 15]

 I feel I've come across the need for such a function often enough that
 it might be generally useful to people as part of numpy. The
 implementation of split_classes above has pretty poor performance
 because it creates many temporary boolean arrays, so my plan for a PR
 was to have a speedy version of it that uses a single pass through v.
 (I often wanted to use this function on large datasets).

 If anyone has any comments on the idea (good idea. bad idea?) I'd love
 to hear.

 I have some further notes and examples here:
 https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

 Allan

 On 02/12/2016 09:40 AM, Sérgio wrote:

> Hello,
>
> This is my first e-mail, I will try to make the idea simple.
>
> Similar to masked array it would be interesting to use a label array to
> guide operations.
>
> Ex.:
>  >>> x
> labelled_array(data =
>   [[0 1 2]
>   [3 4 5]
>   [6 7 8]],
>  label =
>   [[0 1 2]
>   [0 1 2]
>   [0 1 2]])
>
>  >>> sum(x)
> array([9, 12, 15])
>
> The operations would create a new axis for label indexing.
>
> You could think of it

Re: [Numpy-discussion] Subclassing ma.masked_array, code broken after version 1.9

2016-02-13 Thread Jonathan Helmus



On 2/12/16 6:06 PM, Gutenkunst, Ryan N - (rgutenk) wrote:

Hello all,

In 2009 I developed an application that uses a subclass of masked arrays as a 
central data object. My subclass Spectrum possesses additional attributes along 
with many custom methods. It was very convenient to be able to use standard 
numpy functions for doing arithmetic on these objects. However, my code broke 
with numpy 1.10. I've finally had a chance to track down the problem, and I am 
hoping someone can suggest a workaround.

See below for an example, which is as minimal as I could concoct. In this case, I have a 
Spectrum object that I'd like to take the logarithm of using numpy.ma.log, while 
preserving the value of the "folded" attribute. Up to numpy 1.9, this worked as 
expected, but in numpy 1.10 and 1.11 the attribute is not preserved.

The change in behavior appears to be driven by a commit made on Jun 16th, 2015 
by Marten van Kerkwijk. In particular, the commit changed 
_MaskedUnaryOperation.__call__ so that the result array's update_from method is 
no longer called with the input array as the argument, but rather the result of 
the numpy UnaryOperation (old line 889, new line 885). Because that 
UnaryOperation doesn't carry my new attribute, it's not present for update_from 
to access. I notice that similar changes were made to MaskedBinaryOperation, 
although I haven't tested those. It's not clear to me from the commit message 
why this particular change was made, so I don't know whether this new behavior 
is intentional.

I know that subclassing arrays isn't widely encouraged, but it has been very 
convenient in my code. Is it still possible to subclass masked_array in such a 
way that functions like numpy.ma.log preserve additional attributes? If so, can 
someone point me in the right direction?

Thanks!
Ryan

*** Begin example

import numpy
print 'Working with numpy {0}'.format(numpy.__version__)

class Spectrum(numpy.ma.masked_array):
 def __new__(cls, data, mask=numpy.ma.nomask, data_folded=None):
 subarr = numpy.ma.masked_array(data, mask=mask, keep_mask=True,
shrink=True)
 subarr = subarr.view(cls)
 subarr.folded = data_folded

 return subarr

 def __array_finalize__(self, obj):
 if obj is None:
 return
 numpy.ma.masked_array.__array_finalize__(self, obj)
 self.folded = getattr(obj, 'folded', 'unspecified')

 def _update_from(self, obj):
 print('Input to update_from: {0}'.format(repr(obj)))
 numpy.ma.masked_array._update_from(self, obj)
 self.folded = getattr(obj, 'folded', 'unspecified')

 def __repr__(self):
 return 'Spectrum(%s, folded=%s)'\
 % (str(self), str(self.folded))

fs1 = Spectrum([2,3,4.], data_folded=True)
fs2 = numpy.ma.log(fs1)
print('fs2.folded status: {0}'.format(fs2.folded))
print('Expectation is True, achieved with numpy 1.9')

*** End example

--
Ryan Gutenkunst
Assistant Professor
Molecular and Cellular Biology
University of Arizona
phone: (520) 626-0569, office LSS 325
http://gutengroup.mcb.arizona.edu
Latest paper: "Computationally efficient composite likelihood statistics for 
demographic inference"
Molecular Biology and Evolution; http://dx.doi.org/10.1093/molbev/msv255

Ryan,

I'm not sure if you will be able to get this to work as in NumPy 1.9, 
but the __array_wrap__ method is intended to be the mechanism for 
subclasses to set their return type, adjust metadata, etc [1].  
Unfortunately, the numpy.ma.log function does not seem to make a call 
to  __array_wrap__ (at least in NumPy 1.10.2) although numpy.log does:


from __future__ import print_function
import numpy
print('Working with numpy {0}'.format(numpy.__version__))


class Spectrum(numpy.ma.masked_array):
def __new__(cls, data, mask=numpy.ma.nomask, data_folded=None):
subarr = numpy.ma.masked_array(data, mask=mask, keep_mask=True,
   shrink=True)
subarr = subarr.view(cls)
subarr.folded = data_folded

return subarr

def __array_finalize__(self, obj):
if obj is None:
return
numpy.ma.masked_array.__array_finalize__(self, obj)
self.folded = getattr(obj, 'folded', 'unspecified')

def __array_wrap__(self, out_arr, context=None):
print('__array_wrap__ called')
return numpy.ndarray.__array_wrap__(self, out_arr, context)

def __repr__(self):
return 'Spectrum(%s, folded=%s)'\
% (str(self), str(self.folded))

fs1 = Spectrum([2,3,4.], data_folded=True)

print('numpy.ma.log:')
fs2 = numpy.ma.log(fs1)
print('fs2 type:', type(fs2))
print('fs2.folded status: {0}'.format(fs2.folded))

print('numpy.log:')
fs3 = numpy.log(fs1)
print('fs3 type:', type(fs3))
print('fs3.folded status: {0}'.format(fs3.folded))


$ python example.py
Working with numpy 1.10.2
numpy.ma.log:
fs2 type: 
fs2.folded status: unspecified
num

Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread josef.pktd
On Sat, Feb 13, 2016 at 1:42 PM, Jeff Reback  wrote:

> These operations get slower as the number of groups increase, but with a
> faster function (e.g. the standard ones which are cythonized), the
> constant on
> the increase is pretty low.
>
> In [23]: c = np.random.randint(0,1,size=10)
>
> In [24]: df = DataFrame({'v' : v, 'c' : c})
>
> In [25]: %timeit df.groupby('c').count()
> 100 loops, best of 3: 3.18 ms per loop
>
> In [26]: len(df.groupby('c').count())
> Out[26]: 1
>
> In [27]: df.groupby('c').count()
> Out[27]:
>v
> c
> 0  9
> 1 11
> 2  7
> 3  8
> 4 16
> ...   ..
> 9995  11
> 9996  13
> 9997  13
> 9998   7
>   10
>
> [1 rows x 1 columns]
>
>
One other difference across usecases is whether this is a single operation,
or we want to optimize the data format for a large number of different
calculations.  (We have both cases in statsmodels.)

In the latter case it's worth spending some extra computational effort on
rearranging the data to be either sorted or in lists of arrays, (I guess
without having done any timings).

Josef




>
> On Sat, Feb 13, 2016 at 1:39 PM, Jeff Reback  wrote:
>
>> In [10]: pd.options.display.max_rows=10
>>
>> In [13]: np.random.seed(1234)
>>
>> In [14]: c = np.random.randint(0,32,size=10)
>>
>> In [15]: v = np.arange(10)
>>
>> In [16]: df = DataFrame({'v' : v, 'c' : c})
>>
>> In [17]: df
>> Out[17]:
>> c  v
>> 0  15  0
>> 1  19  1
>> 2   6  2
>> 3  21  3
>> 4  12  4
>> ........
>> 5   7  5
>> 6   2  6
>> 7  27  7
>> 8  28  8
>> 9   7  9
>>
>> [10 rows x 2 columns]
>>
>> In [19]: df.groupby('c').count()
>> Out[19]:
>>v
>> c
>> 0   3136
>> 1   3229
>> 2   3093
>> 3   3121
>> 4   3041
>> ..   ...
>> 27  3128
>> 28  3063
>> 29  3147
>> 30  3073
>> 31  3090
>>
>> [32 rows x 1 columns]
>>
>> In [20]: %timeit df.groupby('c').count()
>> 100 loops, best of 3: 2 ms per loop
>>
>> In [21]: %timeit df.groupby('c').mean()
>> 100 loops, best of 3: 2.39 ms per loop
>>
>> In [22]: df.groupby('c').mean()
>> Out[22]:
>>v
>> c
>> 0   49883.384885
>> 1   50233.692165
>> 2   48634.116069
>> 3   50811.743992
>> 4   50505.368629
>> ..   ...
>> 27  49715.349425
>> 28  50363.501469
>> 29  50485.395933
>> 30  50190.155223
>> 31  50691.041748
>>
>> [32 rows x 1 columns]
>>
>>
>> On Sat, Feb 13, 2016 at 1:29 PM,  wrote:
>>
>>>
>>>
>>> On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane 
>>> wrote:
>>>
 Sorry, to reply to myself here, but looking at it with fresh eyes maybe
 the performance of the naive version isn't too bad. Here's a comparison of
 the naive vs a better implementation:

 def split_classes_naive(c, v):
 return [v[c == u] for u in unique(c)]

 def split_classes(c, v):
 perm = c.argsort()
 csrt = c[perm]
 div = where(csrt[1:] != csrt[:-1])[0] + 1
 return [v[x] for x in split(perm, div)]

 >>> c = randint(0,32,size=10)
 >>> v = arange(10)
 >>> %timeit split_classes_naive(c,v)
 100 loops, best of 3: 8.4 ms per loop
 >>> %timeit split_classes(c,v)
 100 loops, best of 3: 4.79 ms per loop

>>>
>>> The usecases I recently started to target for similar things is 1
>>> Million or more rows and 1 uniques in the labels.
>>> The second version should be faster for large number of uniques, I guess.
>>>
>>> Overall numpy is falling far behind pandas in terms of simple groupby
>>> operations. bincount and histogram (IIRC) worked for some cases but are
>>> rather limited.
>>>
>>> reduce_at looks nice for cases where it applies.
>>>
>>> In contrast to the full sized labels in the original post, I only know
>>> of applications where the labels are 1-D corresponding to rows or columns.
>>>
>>> Josef
>>>
>>>
>>>

 In any case, maybe it is useful to Sergio or others.

 Allan


 On 02/13/2016 12:11 PM, Allan Haldane wrote:

> I've had a pretty similar idea for a new indexing function
> 'split_classes' which would help in your case, which essentially does
>
>  def split_classes(c, v):
>  return [v[c == u] for u in unique(c)]
>
> Your example could be coded as
>
>  >>> [sum(c) for c in split_classes(label, data)]
>  [9, 12, 15]
>
> I feel I've come across the need for such a function often enough that
> it might be generally useful to people as part of numpy. The
> implementation of split_classes above has pretty poor performance
> because it creates many temporary boolean arrays, so my plan for a PR
> was to have a speedy version of it that uses a single pass through v.
> (I often wanted to use this function on large datasets).
>
> If anyone has any comments on the idea (good idea. bad idea?) I'd love
> to hear.
>
> I have some further notes and examples here:
> https://gist

Re: [Numpy-discussion] ANN: numpydoc 0.6.0 released

2016-02-13 Thread josef.pktd
On Sat, Feb 13, 2016 at 10:03 AM, Ralf Gommers 
wrote:

> Hi all,
>
> I'm pleased to announce the release of numpydoc 0.6.0. The main new
> feature is support for the Yields section in numpy-style docstrings. This
> is described in
> https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
>
> Numpydoc can be installed from PyPi: https://pypi.python.org/pypi/numpydoc
>


Thanks,

BTW: the status section in the howto still refers to the documentation
editor, which has been retired AFAIK.

Josef



>
>
> Cheers,
> Ralf
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: pandas v0.18.0rc1 - RELEASE CANDIDATE

2016-02-13 Thread Jeff Reback
Hi,

I'm pleased to announce the availability of the first release candidate of
Pandas 0.18.0.
Please try this RC and report any issues here: Pandas Issues

We will be releasing officially in 1-2 weeks or so.

**RELEASE CANDIDATE 1**

This is a major release from 0.17.1 and includes a small number of API
changes, several new features,
enhancements, and performance improvements along with a large number of bug
fixes. We recommend that all
users upgrade to this version.

Highlights include:

   - pandas >= 0.18.0 will no longer support compatibility with Python
   version 2.6 GH7718  or
   version 3.3 GH11273 
   - Moving and expanding window functions are now methods on Series and
   DataFrame similar to .groupby like objects, see here
   

   .
   - Adding support for a RangeIndex as a specialized form of the
Int64Index for
   memory savings, see here
   

   .
   - API breaking .resample changes to make it more .groupby like, see here
   

   - Removal of support for positional indexing with floats, which was
   deprecated since 0.14.0. This will now raise a TypeError, see here
   

   - The .to_xarray() function has been added for compatibility with the xarray
   package  see here
   

   .
   - Addition of the .str.extractall() method
   
,
   and API changes to the the .str.extract() method
   
,
   and the .str.cat() method
   

   - pd.test() top-level nose test runner is available GH4327
   

See the Whatsnew
 for much
more information.

Best way to get this is to install via conda

from
our development channel. Builds for osx-64,linux-64,win-64 for Python 2.7
and Python 3.5 are all available.

conda install pandas=v0.18.0rc1 -c pandas

Thanks to all who made this release happen. It is a very large release!

Jeff
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Suggestion: special-case np.array(range(...)) to be faster

2016-02-13 Thread Antony Lee
Compare (on Python3 -- for Python2, read "xrange" instead of "range"):

In [2]: %timeit np.array(range(100), np.int64)
10 loops, best of 3: 156 ms per loop

In [3]: %timeit np.arange(100, dtype=np.int64)
1000 loops, best of 3: 853 µs per loop


Note that while iterating over a range is not very fast, it is still much
better than the array creation:

In [4]: from collections import deque

In [5]: %timeit deque(range(100), 1)
10 loops, best of 3: 25.5 ms per loop


On one hand, special cases are awful. On the other hand, the range builtin
is probably important enough to deserve a special case to make this
construction faster. Or not? I initially opened this as
https://github.com/numpy/numpy/issues/7233 but it was suggested there that
this should be discussed on the ML first.

(The real issue which prompted this suggestion: I was building sparse
matrices using scipy.sparse.csc_matrix with some indices specified using
range, and that construction step turned out to take a significant portion
of the time because of the calls to np.array).

Antony
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Suggestion: special-case np.array(range(...)) to be faster

2016-02-13 Thread josef.pktd
On Sat, Feb 13, 2016 at 8:57 PM, Antony Lee  wrote:

> Compare (on Python3 -- for Python2, read "xrange" instead of "range"):
>
> In [2]: %timeit np.array(range(100), np.int64)
> 10 loops, best of 3: 156 ms per loop
>
> In [3]: %timeit np.arange(100, dtype=np.int64)
> 1000 loops, best of 3: 853 µs per loop
>
>
> Note that while iterating over a range is not very fast, it is still much
> better than the array creation:
>
> In [4]: from collections import deque
>
> In [5]: %timeit deque(range(100), 1)
> 10 loops, best of 3: 25.5 ms per loop
>
>
> On one hand, special cases are awful. On the other hand, the range builtin
> is probably important enough to deserve a special case to make this
> construction faster. Or not? I initially opened this as
> https://github.com/numpy/numpy/issues/7233 but it was suggested there
> that this should be discussed on the ML first.
>
> (The real issue which prompted this suggestion: I was building sparse
> matrices using scipy.sparse.csc_matrix with some indices specified using
> range, and that construction step turned out to take a significant portion
> of the time because of the calls to np.array).
>


IMO: I don't see a reason why this should be supported. There is np.arange
after all for this usecase, and from_iter.
range and the other guys are iterators, and in several cases we can use
larange = list(range(...)) as a short cut to get python list.for python 2/3
compatibility.

I think this might be partially a learning effect in the python 2 to 3
transition. After using almost only python 3 for maybe a year, I don't
think it's difficult to remember the differences when writing code that is
py 2.7 and py 3.x compatible.


It's just **another** thing to watch out for if milliseconds matter in your
application.

Josef


>
> Antony
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Suggestion: special-case np.array(range(...)) to be faster

2016-02-13 Thread josef.pktd
On Sat, Feb 13, 2016 at 9:43 PM,  wrote:

>
>
> On Sat, Feb 13, 2016 at 8:57 PM, Antony Lee 
> wrote:
>
>> Compare (on Python3 -- for Python2, read "xrange" instead of "range"):
>>
>> In [2]: %timeit np.array(range(100), np.int64)
>> 10 loops, best of 3: 156 ms per loop
>>
>> In [3]: %timeit np.arange(100, dtype=np.int64)
>> 1000 loops, best of 3: 853 µs per loop
>>
>>
>> Note that while iterating over a range is not very fast, it is still much
>> better than the array creation:
>>
>> In [4]: from collections import deque
>>
>> In [5]: %timeit deque(range(100), 1)
>> 10 loops, best of 3: 25.5 ms per loop
>>
>>
>> On one hand, special cases are awful. On the other hand, the range
>> builtin is probably important enough to deserve a special case to make this
>> construction faster. Or not? I initially opened this as
>> https://github.com/numpy/numpy/issues/7233 but it was suggested there
>> that this should be discussed on the ML first.
>>
>> (The real issue which prompted this suggestion: I was building sparse
>> matrices using scipy.sparse.csc_matrix with some indices specified using
>> range, and that construction step turned out to take a significant portion
>> of the time because of the calls to np.array).
>>
>
>
> IMO: I don't see a reason why this should be supported. There is np.arange
> after all for this usecase, and from_iter.
> range and the other guys are iterators, and in several cases we can use
> larange = list(range(...)) as a short cut to get python list.for python 2/3
> compatibility.
>
> I think this might be partially a learning effect in the python 2 to 3
> transition. After using almost only python 3 for maybe a year, I don't
> think it's difficult to remember the differences when writing code that is
> py 2.7 and py 3.x compatible.
>
>
> It's just **another** thing to watch out for if milliseconds matter in
> your application.
>


side question: Is there a simple way to distinguish a iterator or generator
from an iterable data structure?

Josef



>
> Josef
>
>
>>
>> Antony
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [Suggestion] Labelled Array

2016-02-13 Thread Allan Haldane

Impressive!

Possibly there's still a case for including a 'groupby' function in 
numpy itself since it's a generally useful operation, but I do see less 
of a need given the nice pandas functionality.


At least, next time someone asks a stackoverflow question like the ones 
below someone should tell them to use pandas!


(copied from my gist for future list reference).

http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy

Allan


On 02/13/2016 01:39 PM, Jeff Reback wrote:

In [10]: pd.options.display.max_rows=10

In [13]: np.random.seed(1234)

In [14]: c = np.random.randint(0,32,size=10)

In [15]: v = np.arange(10)

In [16]: df = DataFrame({'v' : v, 'c' : c})

In [17]: df
Out[17]:
 c  v
0  15  0
1  19  1
2   6  2
3  21  3
4  12  4
........
5   7  5
6   2  6
7  27  7
8  28  8
9   7  9

[10 rows x 2 columns]

In [19]: df.groupby('c').count()
Out[19]:
v
c
0   3136
1   3229
2   3093
3   3121
4   3041
..   ...
27  3128
28  3063
29  3147
30  3073
31  3090

[32 rows x 1 columns]

In [20]: %timeit df.groupby('c').count()
100 loops, best of 3: 2 ms per loop

In [21]: %timeit df.groupby('c').mean()
100 loops, best of 3: 2.39 ms per loop

In [22]: df.groupby('c').mean()
Out[22]:
v
c
0   49883.384885
1   50233.692165
2   48634.116069
3   50811.743992
4   50505.368629
..   ...
27  49715.349425
28  50363.501469
29  50485.395933
30  50190.155223
31  50691.041748

[32 rows x 1 columns]


On Sat, Feb 13, 2016 at 1:29 PM, mailto:josef.p...@gmail.com>> wrote:



On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane
mailto:allanhald...@gmail.com>> wrote:

Sorry, to reply to myself here, but looking at it with fresh
eyes maybe the performance of the naive version isn't too bad.
Here's a comparison of the naive vs a better implementation:

def split_classes_naive(c, v):
 return [v[c == u] for u in unique(c)]

def split_classes(c, v):
 perm = c.argsort()
 csrt = c[perm]
 div = where(csrt[1:] != csrt[:-1])[0] + 1
 return [v[x] for x in split(perm, div)]

>>> c = randint(0,32,size=10)
>>> v = arange(10)
>>> %timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
>>> %timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop


The usecases I recently started to target for similar things is 1
Million or more rows and 1 uniques in the labels.
The second version should be faster for large number of uniques, I
guess.

Overall numpy is falling far behind pandas in terms of simple
groupby operations. bincount and histogram (IIRC) worked for some
cases but are rather limited.

reduce_at looks nice for cases where it applies.

In contrast to the full sized labels in the original post, I only
know of applications where the labels are 1-D corresponding to rows
or columns.

Josef


In any case, maybe it is useful to Sergio or others.

Allan


On 02/13/2016 12:11 PM, Allan Haldane wrote:

I've had a pretty similar idea for a new indexing function
'split_classes' which would help in your case, which
essentially does

  def split_classes(c, v):
  return [v[c == u] for u in unique(c)]

Your example could be coded as

  >>> [sum(c) for c in split_classes(label, data)]
  [9, 12, 15]

I feel I've come across the need for such a function often
enough that
it might be generally useful to people as part of numpy. The
implementation of split_classes above has pretty poor
performance
because it creates many temporary boolean arrays, so my plan
for a PR
was to have a speedy version of it that uses a single pass
through v.
(I often wanted to use this function on large datasets).

If anyone has any comments on the idea (good idea. bad
idea?) I'd love
to hear.

I have some further notes and examples here:
https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21

Allan

On 02/12/2016 09:40 AM, Sérgio wrote:

Hello,

This is my first e-mail, I will try to make the idea simple.