from:"Francesc Alted"

Re: [Numpy-discussion] Proposed Roadmap Overview

2012-02-20 Thread Francesc Alted

On Feb 20, 2012, at 6:18 PM, Dag Sverre Seljebotn wrote:
> You need at least a slightly different Python API to get anywhere, so 
> numexpr/Theano is the right place to work on an implementation of this 
> idea. Of course it would be nice if numexpr/Theano offered something as 
> convenient as
> 
> with lazy:
> arr = A + B + C # with all of these NumPy arrays
> # compute upon exiting…

Hmm, that would be cute indeed.  Do you have an idea on how the code in the 
with context could be passed to the Python AST compiler (à la 
numexpr.evaluate("A + B + C"))?

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ndarray and lazy evaluation (was: Proposed Rodmap Overview)

2012-02-20 Thread Francesc Alted

On Feb 20, 2012, at 6:46 PM, Dag Sverre Seljebotn wrote:

> On 02/20/2012 09:24 AM, Olivier Delalleau wrote:
>> Hi Dag,
>> 
>> Would you mind elaborating a bit on that example you mentioned at the
>> end of your email? I don't quite understand what behavior you would like
>> to achieve
> 
> Sure, see below. I think we should continue discussion on numpy-discuss.
> 
> I wrote:
> 
>> You need at least a slightly different Python API to get anywhere, so
>> numexpr/Theano is the right place to work on an implementation of this
>> idea. Of course it would be nice if numexpr/Theano offered something as
>> convenient as
>> 
>> with lazy:
>> arr = A + B + C # with all of these NumPy arrays
>> # compute upon exiting...
> 
> More information:
> 
> The disadvantage today of using Theano (or numexpr) is that they require 
> using a different API, so that one has to learn and use Theano "from the 
> ground up", rather than just slap it on in an optimization phase.
> 
> The alternative would require extensive changes to NumPy, so I guess 
> Theano authors or Francesc would need to push for this.
> 
> The alternative would be (with A, B, C ndarray instances):
> 
> with theano.lazy:
> arr = A + B + C
> 
> On __enter__, the context manager would hook into NumPy to override it's 
> arithmetic operators. Then it would build a Theano symbolic tree instead 
> of performing computations right away.
> 
> In addition to providing support for overriding arithmetic operators, 
> slicing etc., it would be necesarry for "arr" to be an ndarray instance 
> which is "not yet computed" (data-pointer set to NULL, and store a 
> compute-me callback and some context information).
> 
> Finally, the __exit__ would trigger computation. For other operations 
> which need the data pointer (e.g., single element lookup) one could 
> either raise an exception or trigger computation.
> 
> This is just a rough sketch. It is not difficult "in principle", but of 
> course there's really a massive amount of work involved to work support 
> for this into the NumPy APIs.
> 
> Probably, we're talking a NumPy 3.0 thing, after the current round of 
> refactorings have settled...
> 
> Please: Before discussing this further one should figure out if there's 
> manpower available for it; no sense in hashing out a castle in the sky 
> in details.

I see.  Mark Wiebe already suggested the same thing some time ago:

https://github.com/numpy/numpy/blob/master/doc/neps/deferred-ufunc-evaluation.rst

> Also it would be better to talk in person about this if 
> possible (I'm in Berkeley now and will attend PyData and PyCon).

Nice.  Most of Continuum crew (me included) will be attending to both 
conferences.  Mark W. will make PyCon only, but will be a good occasion to 
discuss this further.

See you,

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposed Roadmap Overview

2012-02-20 Thread Francesc Alted

On Feb 20, 2012, at 7:08 PM, Dag Sverre Seljebotn wrote:

> On 02/20/2012 09:34 AM, Christopher Jordan-Squire wrote:
>> On Mon, Feb 20, 2012 at 9:18 AM, Dag Sverre Seljebotn
>>   wrote:
>>> On 02/20/2012 08:55 AM, Sturla Molden wrote:
>>>> Den 20.02.2012 17:42, skrev Sturla Molden:
>>>>> There are still other options than C or C++ that are worth considering.
>>>>> One would be to write NumPy in Python. E.g. we could use LLVM as a
>>>>> JIT-compiler and produce the performance critical code we need on the fly.
>>>>> 
>>>>> 
>>>> 
>>>> LLVM and its C/C++ frontend Clang are BSD licenced. It compiles faster
>>>> than GCC and often produces better machine code. They can therefore be
>>>> used inside an array library. It would give a faster NumPy, and we could
>>>> keep most of it in Python.
>>> 
>>> I think it is moot to focus on improving NumPy performance as long as in
>>> practice all NumPy operations are memory bound due to the need to take a
>>> trip through system memory for almost any operation. C/C++ is simply
>>> "good enough". JIT is when you're chasing a 2x improvement or so, but
>>> today NumPy can be 10-20x slower than a Cython loop.
>>> 
>> 
>> I don't follow this. Could you expand a bit more? (Specifically, I
>> wasn't aware that numpy could be 10-20x slower than a cython loop, if
>> we're talking about the base numpy library--so core operations. I'm
> 
> The problem with NumPy is the temporaries needed -- if you want to compute
> 
> A + B + np.sqrt(D)
> 
> then, if the arrays are larger than cache size (a couple of megabytes), 
> then each of those operations will first transfer the data in and out 
> over the memory bus. I.e. first you compute an element of sqrt(D), then 
> the result of that is put in system memory, then later the same number 
> is read back in order to add it to an element in B, and so on.
> 
> The compute-to-bandwidth ratio of modern CPUs is between 30:1 and 
> 60:1... so in extreme cases it's cheaper to do 60 additions than to 
> transfer a single number from system memory.
> 
> It is much faster to only transfer an element (or small block) from each 
> of A, B, and D to CPU cache, then do the entire expression, then 
> transfer the result back. This is easy to code in Cython/Fortran/C and 
> impossible with NumPy/Python.
> 
> This is why numexpr/Theano exists.

Well, I can't speak for Theano (it is quite more general than numexpr, and more 
geared towards using GPUs, right?), but this was certainly the issue that make 
David Cooke to create numexpr.  A more in-deep explanation about this problem 
can be seen in:

http://www.euroscipy.org/talk/1657

which includes some graphical explanations.

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.longlong casts to int

2012-02-23 Thread Francesc Alted

On Feb 23, 2012, at 3:06 AM, Pierre Haessig wrote:

> Hi,
> Le 23/02/2012 02:24, Matthew Brett a écrit :
>> Luckily I was in fact using longdouble in the live code,
> I had never "exotic" floating point precision, so thanks for your post
> which made me take a look at docstring and documentation.
> 
> If I got it right from the docstring, 'np.longdouble', 'np.longfloat'
> are all in fact 'np.float128'.
> (numpy 1.5)

That in fact depends on the platform you are using.  Typically, for 32-bit 
platforms, 'np.longfloat' and 'np.longdouble' are bound to 'np.float96', while 
in 64-bit are to 'np.float128'.

> However, I was surprised that float128 is not mentioned in the array of
> available types in the user guide.
> http://docs.scipy.org/doc/numpy/user/basics.types.html
> Is there a specific reason for this absence, or is just about visiting
> the documentation wiki ;-) ?

The reason is most probably that you cannot get a float96 or float128 whenever 
you want (depends on your architecture), so adding these types to the manual 
could be misleading.  However, I'd advocate to document them while warning 
about platform portability issues.

> Additionally, I don't know what are the writing guidelines of the user
> guide, but would it make sense to add some "new numpy 1.x" messages as
> in the Python doc. I'm thinking here of np.float16. I know it exists
> from messages on this mailing list but my 1.5 don't have it.

float16 was introduced in NumPy 1.6, IIRC.

> PS : I found float128 mentionned in the reference
> http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#built-in-scalar-types
> However, it is not as easily readable as the user guide (which makes
> sense !).
> 
> Does the following statements mean that those types are not available on
> all platforms ?
> float96 96 bits, platform?  
> float128 128 bits, platform?

Exactly.  I'd update this to read:

float9696 bits.  Only available on 32-bit (i386) platforms.
float128  128 bits.  Only available on 64-bit (AMD64) platforms.

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.longlong casts to int

2012-02-23 Thread Francesc Alted

On Feb 23, 2012, at 5:43 AM, Nathaniel Smith wrote:

> On Thu, Feb 23, 2012 at 11:40 AM, Francesc Alted  
> wrote:
>> Exactly.  I'd update this to read:
>> 
>> float9696 bits.  Only available on 32-bit (i386) platforms.
>> float128  128 bits.  Only available on 64-bit (AMD64) platforms.
> 
> Except float96 is actually 80 bits. (Usually?) Plus some padding…

Good point.  The thing is that they actually use 96 bit for storage purposes 
(this is due to alignment requirements).

Another quirk related with this is that MSVC automatically maps long double to 
64-bit doubles:

http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx

Not sure on why they did that (portability issues?).

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.longlong casts to int

2012-02-23 Thread Francesc Alted

On Feb 23, 2012, at 6:06 AM, Francesc Alted wrote:
> On Feb 23, 2012, at 5:43 AM, Nathaniel Smith wrote:
> 
>> On Thu, Feb 23, 2012 at 11:40 AM, Francesc Alted  
>> wrote:
>>> Exactly.  I'd update this to read:
>>> 
>>> float9696 bits.  Only available on 32-bit (i386) platforms.
>>> float128  128 bits.  Only available on 64-bit (AMD64) platforms.
>> 
>> Except float96 is actually 80 bits. (Usually?) Plus some padding…
> 
> Good point.  The thing is that they actually use 96 bit for storage purposes 
> (this is due to alignment requirements).
> 
> Another quirk related with this is that MSVC automatically maps long double 
> to 64-bit doubles:
> 
> http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx
> 
> Not sure on why they did that (portability issues?).

Hmm, yet another quirk (this time in NumPy itself).  On 32-bit platforms:

In [16]: np.longdouble
Out[16]: numpy.float96

In [17]: np.finfo(np.longdouble).eps
Out[17]: 1.084202172485504434e-19

while on 64-bit ones:

In [8]: np.longdouble
Out[8]: numpy.float128

In [9]: np.finfo(np.longdouble).eps
Out[9]: 1.084202172485504434e-19

i.e. NumPy is saying that the eps (machine epsilon) is the same on both 
platforms, despite the fact that one uses 80-bit precision and the other 
128-bit precision.  For the 80-bit, the eps should be ():

In [5]: 1 / 2**63.
Out[5]: 1.0842021724855044e-19

[http://en.wikipedia.org/wiki/Extended_precision]

which is correctly stated by NumPy, while for 128-bit (quad precision), eps 
should be:

In [6]: 1 / 2**113.
Out[6]: 9.62964972193618e-35

[http://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format]

If nobody objects, I'll file a bug about this.

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.longlong casts to int

2012-02-23 Thread Francesc Alted

On Feb 23, 2012, at 10:26 AM, Matthew Brett wrote:

> Hi,
> 
> On Thu, Feb 23, 2012 at 4:23 AM, Francesc Alted  wrote:
>> On Feb 23, 2012, at 6:06 AM, Francesc Alted wrote:
>>> On Feb 23, 2012, at 5:43 AM, Nathaniel Smith wrote:
>>> 
>>>> On Thu, Feb 23, 2012 at 11:40 AM, Francesc Alted  
>>>> wrote:
>>>>> Exactly.  I'd update this to read:
>>>>> 
>>>>> float9696 bits.  Only available on 32-bit (i386) platforms.
>>>>> float128  128 bits.  Only available on 64-bit (AMD64) platforms.
>>>> 
>>>> Except float96 is actually 80 bits. (Usually?) Plus some padding…
>>> 
>>> Good point.  The thing is that they actually use 96 bit for storage 
>>> purposes (this is due to alignment requirements).
>>> 
>>> Another quirk related with this is that MSVC automatically maps long double 
>>> to 64-bit doubles:
>>> 
>>> http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx
>>> 
>>> Not sure on why they did that (portability issues?).
>> 
>> Hmm, yet another quirk (this time in NumPy itself).  On 32-bit platforms:
>> 
>> In [16]: np.longdouble
>> Out[16]: numpy.float96
>> 
>> In [17]: np.finfo(np.longdouble).eps
>> Out[17]: 1.084202172485504434e-19
>> 
>> while on 64-bit ones:
>> 
>> In [8]: np.longdouble
>> Out[8]: numpy.float128
>> 
>> In [9]: np.finfo(np.longdouble).eps
>> Out[9]: 1.084202172485504434e-19
>> 
>> i.e. NumPy is saying that the eps (machine epsilon) is the same on both 
>> platforms, despite the fact that one uses 80-bit precision and the other 
>> 128-bit precision.  For the 80-bit, the eps should be ():
>> 
>> In [5]: 1 / 2**63.
>> Out[5]: 1.0842021724855044e-19
>> 
>> [http://en.wikipedia.org/wiki/Extended_precision]
>> 
>> which is correctly stated by NumPy, while for 128-bit (quad precision), eps 
>> should be:
>> 
>> In [6]: 1 / 2**113.
>> Out[6]: 9.62964972193618e-35
>> 
>> [http://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format]
>> 
>> If nobody objects, I'll file a bug about this.
> 
> There was half a proposal for renaming these guys in the interests of clarity:
> 
> http://mail.scipy.org/pipermail/numpy-discussion/2011-October/058820.html

Oh, my bad. thanks for pointing this out!

> I'd be happy to write this up as a NEP.

Or even better, adapt the docs to say something like:

float9696 bits storage, 80-bit precision.  Only available on 32-bit 
platforms.
float128  128 bits storage, 80-bit precision.  Only available on 64-bit 
platforms.

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] mkl usage

2012-02-23 Thread Francesc Alted

On Feb 23, 2012, at 1:33 PM, Neal Becker wrote:

> Is mkl only used for linear algebra?  Will it speed up e.g., elementwise 
> transendental functions?

Yes, MKL comes with VML that has this type of optimizations:

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm

Also, see some speedups in a numexpr linked against MKL here:

http://code.google.com/p/numexpr/wiki/NumexprVML

See also how native multi-threading implementation in numexpr beats MKL's one 
(at least for this particular example).

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] mkl usage

2012-02-23 Thread Francesc Alted

On Feb 23, 2012, at 2:19 PM, Neal Becker wrote:

> Pauli Virtanen wrote:
> 
>> 23.02.2012 20:44, Francesc Alted kirjoitti:
>>> On Feb 23, 2012, at 1:33 PM, Neal Becker wrote:
>>> 
>>>> Is mkl only used for linear algebra?  Will it speed up e.g., elementwise
>>>> transendental functions?
>>> 
>>> Yes, MKL comes with VML that has this type of optimizations:
>> 
>> And also no, in the sense that Numpy and Scipy don't use VML.
>> 
> 
> My question is:
> 
> "Should I purchase MKL?"
> 
> To what extent will it speed up my existing python code, without my having to 
> exert (much) effort?
> 
> So that would be numpy/scipy.

Pauli already answered you.  If you are restricted to use numpy/scipy and your 
aim is to accelerate the evaluation of transcendental functions, then there is 
no point in purchasing MKL.  If you can open your spectrum and use numexpr, 
then I think you should ponder about it.

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Francesc Alted

On Feb 26, 2012, at 1:16 PM, Warren Weckesser wrote:
> For anyone benchmarking software like this, be sure to clear the disk cache 
> before each run.  In linux:
> 
> $ sync
> $ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
> 

It is also a good idea to run a disk-cache enabled test too, just to better see 
how things can be improved in your code.  Disk subsystem is pretty slow, and 
during development you can get much better feedback by looking at load times 
from memory, not from disk (also, tests run much faster, so you can save a lot 
of devel time).

> In Mac OSX:
> 
> $ purge

Now that I switched to a Mac, this is good to know.  Thanks!

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Francesc Alted

On Feb 26, 2012, at 1:49 PM, Nathaniel Smith wrote:

> On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
>  wrote:
>> On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith  wrote:
>>> For this kind of benchmarking, you'd really rather be measuring the
>>> CPU time, or reading byte streams that are already in memory. If you
>>> can process more MB/s than the drive can provide, then your code is
>>> effectively perfectly fast. Looking at this number has a few
>>> advantages:
>>>  - You get more repeatable measurements (no disk buffers and stuff
>>> messing with you)
>>>  - If your code can go faster than your drive, then the drive won't
>>> make your benchmark look bad
>>>  - There are probably users out there that have faster drives than you
>>> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
>>> array), so it's nice to be able to measure optimizations even after
>>> they stop mattering on your equipment.
>> 
>> 
>> For anyone benchmarking software like this, be sure to clear the disk cache
>> before each run.  In linux:
> 
> Err, my argument was that you should do exactly the opposite, and just
> worry about hot-cache times (or time reading a big in-memory buffer,
> to avoid having to think about the OS's caching strategies).
> 
> Clearing the disk cache is very important for getting meaningful,
> repeatable benchmarks in code where you know that the cache will
> usually be cold and where hitting the disk will have unpredictable
> effects (i.e., pretty much anything doing random access, like
> databases, which have complicated locality patterns, you may or may
> not trigger readahead, etc.). But here we're talking about pure
> sequential reads, where the disk just goes however fast it goes, and
> your code can either keep up or not.

Exactly.

> One minor point where the OS interface could matter: it's good to set
> up your code so it can use mmap() instead of read(), since this can
> reduce overhead. read() has to copy the data from the disk into OS
> memory, and then from OS memory into your process's memory; mmap()
> skips the second step.

Cool.  Nice trick!

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] [Numpy] quadruple precision

2012-02-29 Thread Francesc Alted

On Feb 29, 2012, at 11:52 AM, Pierre Haessig wrote:

> Hi,
> 
> Le 29/02/2012 16:22, Paweł Biernat a écrit :
>> Is there any way to interact with Fortran's real(16) (supported by gcc
>> and Intel's ifort) data type from numpy? By real(16) I mean the
>> binary128 type as in IEEE 754. (In C this data type is experimentally
>> supported as __float128 (gcc) and _Quad (Intel's icc).) 
> I googled a bit this "__float128". It seems a fairly new addition (GCC
> 4.6, released March 2011).
> The related point in the changelog [1] is :
> 
> "GCC now ships with the LGPL-licensed libquadmath library, which
> provides quad-precision mathematical functions for targets with a
> __float128 datatype. __float128 is available for targets on 32-bit x86,
> x86-64 and Itanium architectures. The libquadmath library is
> automatically built on such targets when building the Fortran compiler."

Great find!

> It seems this __float128 is newcomer in the "picture of data types" that
> Matthew just mentioned.
> As David says, arithmetic with such a 128 bits data type is probably not
> "hardwired" in most processors (I mean Intel & friends) which are
> limited to 80 bits ("long doubles") so it may be a bit slow. However,
> this GCC implementation with libquadmath seems to create some level of
> abstraction. Maybe this is one acceptably good way for a real "IEEE
> float 128" dtype in numpy ?

That would be really nice.  The problem here is two-folded:

* Backwards-compatibility.  float128 should represent a different data-type 
than before, so we probably should find a new name (and charcode!) for 
quad-precision.  Maybe quad128?

* Compiler-dependency.  The new type will be only available on platforms that 
has GCC 4.6 or above.  Again, using the new name for this should be fine.  On 
platforms/compilers not supporting the quad128 thing, it should not be defined.

Uh, I foresee many portability problems for people using this, but perhaps it 
is worth the mess.

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] subclassing array in c

2012-03-08 Thread Francesc Alted

Sure.  Check the memcheck tool of Valgrind:

 http://valgrind.org/info/tools.html#memcheck

It is a really amazing tool.

Francesc

On Mar 8, 2012, at 11:00 PM, Christoph Gohle wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi again,
> 
> I don't want to look as if I want other people do my work, so I would like to 
> ask if there is a simple way of tracing memory leaks (without recompiling the 
> python interpreter)?
> 
> Cheers,
> Christoph
> Am 09.03.2012 um 01:22 schrieb Christoph Gohle:
> 
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>> 
>> Hi
>> Am 08.03.2012 um 20:39 schrieb Pauli Virtanen:
>> 
>>> 08.03.2012 17:37, Christoph Gohle kirjoitti:
>>>> thanks for testing. I have now tried on different platforms. I get
>>>> all kinds of crashes on os x (now with numpy 1.6.1) and windows
>>>> with numpy 1.6.0. On Ubuntu with numpy 1.3.0 I get a hughe memory
>>>> leak...
>>>> 
>>>> Any hints would be welcome.
>>> 
>>> The type object inherits `tp_alloc` from Numpy. This routine always
>>> allocates memory of size NPY_SIZEOF_PYARRAYOBJECT for the
>>> PyArrayObject. Therefore, the write to new->unit in your
>>> __array_finalize__ goes to unallocated memory.
>>> 
>>> This is probably a bug in Numpy --- arrayobject.c:array_alloc should
>>> respect the size specified by the subtype.
>>> 
>>> A workaround is probably to specify a suitable tp_alloc routine yourself:
>>> 
>>> PyType_GenericAlloc,/* tp_alloc */
>>>  unitArray_new,  /* tp_new */
>>>  _PyObject_Del   /* tp_free */
>>> 
>> OK, I did that. And I get no more segfaults as far as I can tell. But there 
>> is still a memory leak:
>> 
>> In [1]: import spampub
>> 
>> In [2]: a=[spampub.UnitArray(i,{'s':i}) for i in xrange(10)]
>> 
>> In [3]: del a
>> 
>> after the last two statements, python uses ~60MB more memory than before.
>> 
>> Thanks for your help
>> Christoph
>>> -- 
>>> Pauli Virtanen
>>> 
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> 
>> 
>> Christoph Gohle
>> - --
>> Max-Planck-Institut für Quantenoptik
>> Abteilung Quantenvielteilchensysteme
>> Hans-Kopfermann-Strasse 1
>> 85748 Garching
>> 
>> christoph.go...@mpq.mpg.de
>> tel: +49 89 32905 283
>> fax: +49 89 32905 313
>> 
>> 
>> 
>> -BEGIN PGP SIGNATURE-
>> Version: GnuPG/MacGPG2 v2.0.14 (Darwin)
>> 
>> iEYEARECAAYFAk9ZTVgACgkQLYu25rCEIztbcwCfcyeQ+FtKTOwFUGbleX/CrjPi
>> nZcAnj86kejcAO45YbX+I+rxhU9kq4PU
>> =KGdt
>> -END PGP SIGNATURE-
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> Christoph Gohle
> - --
> Max-Planck-Institut für Quantenoptik
> Abteilung Quantenvielteilchensysteme
> Hans-Kopfermann-Strasse 1
> 85748 Garching
> 
> christoph.go...@mpq.mpg.de
> tel: +49 89 32905 283
> fax: +49 89 32905 313
> 
> 
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG/MacGPG2 v2.0.14 (Darwin)
> 
> iEYEARECAAYFAk9ZqnQACgkQLYu25rCEIzthWACgi0dYy2nh83w57Ho8emkvJZ8z
> KrkAnistJfaU29tzul8nrJBYsrdmksJk
> =Iyr4
> -END PGP SIGNATURE-
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy videos

2012-03-12 Thread Francesc Alted

On Mar 12, 2012, at 5:23 PM, Abhishek Pratap wrote:

> Super awesome. I love how the python community in general keeps the
> recordings available for free.
> 
> @Adam : I do have some problems that I can hit numpy with, mainly
> bigData based. So in summary I have millions/billions of rows of
> biological data on which I want to run some computation but at the
> same time have a capability to do quick lookup. I am not sure if numpy
> will be applicable for quick lookups  by a string based key right ??

PyTables does precisely that.  Allows to do out-of-core operations with large 
arrays, store tables with an unlimited number of rows on-disk and, by using its 
integrated indexing engine (OPSI), you can perform quick lookups based on 
strings (or whatever other type).  Look into these examples:

http://www.pytables.org/moin/HowToUse#Selectingvalues

HTH,

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy videos

2012-03-13 Thread Francesc Alted

On Mar 13, 2012, at 7:31 AM, Sturla Molden wrote:

> On 12.03.2012 23:23, Abhishek Pratap wrote:
>> Super awesome. I love how the python community in general keeps the
>> recordings available for free.
>> 
>> @Adam : I do have some problems that I can hit numpy with, mainly
>> bigData based. So in summary I have millions/billions of rows of
>> biological data on which I want to run some computation but at the
>> same time have a capability to do quick lookup. I am not sure if numpy
>> will be applicable for quick lookups  by a string based key right ??
> 
> 
> Jason Kinser's book on Python for bioinformatics might be of interest. Though 
> I don't always agree with his NumPy coding style.
> 
> As for "big data", it is a problem regardless of language. The HDF5 library 
> might be of help (cf. PyTables or h5py, I actually prefer the latter).

Yes, however IMO PyTables does adapt better to the OP lookup user case.  For 
example, let's suppose a very simple key-value problem, where we need to locate 
a certain value by using a key.  Using h5py I get:

In [1]: import numpy as np

In [2]: N = 100*1000

In [3]: sa = np.fromiter((('key'+str(i), i) for i in xrange(N)), dtype="S8,i4")

In [4]: import h5py

In [5]: f = h5py.File('h5py.h5', 'w')

In [6]: d = f.create_dataset('sa', data=sa)

In [7]: time [val for val in d if val[0] == 'key500']
CPU times: user 28.34 s, sys: 0.06 s, total: 28.40 s
Wall time: 29.25 s
Out[7]: [('key500', 500)]

Another option is to use fancy selection:

In [8]: time d[d['f0']=='key500']
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
Out[8]: 
array([('key500', 500)], 
  dtype=[('f0', 'S8'), ('f1', 'http://pytables.github.com/usersguide/optimization.html#accelerating-your-searches

for more detailed rational and benchmarks in big datasets.

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy + MKL problems

2012-03-16 Thread Francesc Alted

On Mar 16, 2012, at 8:00 PM, Glen Jenness wrote:

> Dear users,
> I was playing around with my numpy configuration, and it is now no longer 
> working.  Whenever I try to run anything with it, I get:
> 
> ...MKL FATAL ERROR: /opt/intel/mkl/10.0.3.020/lib/em64t/: cannot read file 
> data: Is a directory
> 
> My site.cfg file is (in case it helps!):
> [DEFAULT]
> library_dirs = /usr/lib
> include_dirs = /usr/include
> [fftw]
> libraries = fftw3
> [mkl]
> library_dirs = /opt/intel/mkl/10.0.3.020/lib/em64t
> include_dirs = /opt/intel/mkl/10.0.3.020/include
> mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core
> 
> I have followed the directions for installing it from source, and it was 
> working fine earlier today.  But I'm not really sure what I had changed so it 
> would go from working to not working.

This might be related with:

http://projects.scipy.org/numpy/ticket/993

being fixed in the last few hours.  Could you please bisect 
(http://webchick.net/node/99) and tell us which commit is the bad one?

Thanks!

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy + MKL problems

2012-03-16 Thread Francesc Alted

On Mar 16, 2012, at 8:28 PM, Glen Jenness wrote:

> Fransesc,
> I don't think that's the problem-I'm working with a version of numpy I had 
> downloaded back in Dec/Jan and have not updated it since.
> 
> However, doing a "LD_PRELOAD" seems to have fixed the problem, but now 
> whenever I'm running a script I get:
> 
>  python 2.4.3 GCC 4.1.2 20070626 (Red Hat 4.1.2-14) 64bit ELF on Linux x86_64 
> redhat 5.2 Final
> python: symbol lookup error: 
> /opt/intel/mkl/10.0.3.020/lib/em64t/libmkl_lapack.so: undefined symbol: 
> mkl_lapack_dgetrf

So, if numpy has not changed, then something else does, right?  Have you 
upgraded MKL? GCC? Installed Intel C compiler?

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Looking for people interested in helping with Python compiler to LLVM

2012-03-20 Thread Francesc Alted

On Mar 20, 2012, at 12:49 PM, mark florisson wrote:
>> Cython and Numba certainly overlap.  However, Cython requires:
>> 
>>1) learning another language
>>2) creating an extension module --- loading bit-code files and 
>> dynamically executing (even on a different machine from the one that 
>> initially created them) can be a powerful alternative for run-time 
>> compilation and distribution of code.
>> 
>> These aren't show-stoppers obviously.   But, I think some users would prefer 
>> an even simpler approach to getting fast-code than Cython (which currently 
>> doesn't do enought type-inference and requires building a dlopen extension 
>> module).
> 
> Dag and I have been discussing this at PyCon, and here is my take on
> it (at this moment :).
> 
> Definitely, if you can avoid Cython then that is easier and more
> desirable in many ways. So perhaps we can create a third project
> called X (I'm not very creative, maybe ArrayExprOpt), that takes an
> abstract syntax tree in a rather simple form, performs code
> optimizations such as rewriting loops with array accesses to vector
> expressions, fusing vector expressions and loops, etc, and spits out a
> transformed AST containing these optimizations. If runtime information
> is given such as actual shape and stride information the
> transformations could figure out there and then whether to do things
> like collapsing, axes swapping, blocking (as in, introducing more axes
> or loops to retain discontiguous blocks in the cache), blocked memory
> copies to contiguous chunks, etc. The AST could then also say whether
> the final expressions are vectorizable. Part of this functionality is
> already in numpy's nditer, except that this would be implicit and do
> more (and hopefully with minimal overhead).
> 
> So numba, Cython and maybe numexpr could use the functionality, simply
> by building the AST from Python and converting back (if necessary) to
> its own AST. As such, the AST optimizer would be only part of any
> (runtime) compiler's pipeline, and it should be very flexible to
> retain any information (metadata regarding actual types, control flow
> information, etc) provided by the original AST. It would not do
> control flow analysis, type inference or promotion, etc, but only deal
> with abstract types like integers, reals and arrays (C, Fortran or
> partly contiguous or strided). It would not deal with objects, but
> would allow to insert nodes like UnreorderableNode and SideEffectNode
> wrapping parts of the original AST. In short, it should be as easy as
> possible to convert from an original AST to this project's AST and
> back again afterwards.

I think this is a very interesting project, and certainly projects like numba 
can benefit of it.  So, in order to us have an idea on what you are after, can 
we assume that your project (call it X) would be kind of an compiler optimizer, 
and then the produced, optimized code could be feed into numba for optimized 
LLVM code generation (that on its turn, can be run on top of CPUs or GPUs or a 
combination)?  Is that correct?

Giving that my interpretation above is correct, it is bit more difficult to me 
to see how your X project could be of benefit for numexpr.  In fact, I actually 
see this the other way round: once the optimizer has discovered the 
vectorization parts, then go one step further and generate code that uses 
numexpr automatically (call this, vectorization through numexpr).  This is what 
you mean, or I'm missing something?

> As the project matures many optimizations may be added that deal with
> all sorts of loop restructuring and ways to efficiently utilize the
> cache as well as enable vectorization and possibly parallelism.
> Perhaps it could even generate a different AST depending on whether
> execution target the CPU or the GPU (with optionally available
> information such as cache sizes, GPU shared/local memory sizes, etc).
> 
> Seeing that this would be a part of my master dissertation, my
> supervisor would require me to write the code, so at least until
> August I think I would have to write (at least the bulk of) this.
> Otherwise I can also make other parts of my dissertation's project
> more prominent to make up for it. Anyway, my question is, is there
> interest from at least the numba and numexpr projects (if code can be
> transformed into vector operations, it makes sense to use numexpr for
> that, I'm not sure what numba's interest is in that).

I'm definitely interested for the numexpr part.  It is just that I'm still 
struggling to see the big picture on this.  But the general idea is really 
appealing.

Thanks,

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Looking for people interested in helping with Python compiler to LLVM

2012-03-20 Thread Francesc Alted

On Mar 20, 2012, at 2:29 PM, Dag Sverre Seljebotn wrote:
> Francesc Alted  wrote:
> 
>> On Mar 20, 2012, at 12:49 PM, mark florisson wrote:
>>>> Cython and Numba certainly overlap.  However, Cython requires:
>>>> 
>>>>   1) learning another language
>>>>   2) creating an extension module --- loading bit-code files
>> and dynamically executing (even on a different machine from the one
>> that initially created them) can be a powerful alternative for run-time
>> compilation and distribution of code.
>>>> 
>>>> These aren't show-stoppers obviously.   But, I think some users
>> would prefer an even simpler approach to getting fast-code than Cython
>> (which currently doesn't do enought type-inference and requires
>> building a dlopen extension module).
>>> 
>>> Dag and I have been discussing this at PyCon, and here is my take on
>>> it (at this moment :).
>>> 
>>> Definitely, if you can avoid Cython then that is easier and more
>>> desirable in many ways. So perhaps we can create a third project
>>> called X (I'm not very creative, maybe ArrayExprOpt), that takes an
>>> abstract syntax tree in a rather simple form, performs code
>>> optimizations such as rewriting loops with array accesses to vector
>>> expressions, fusing vector expressions and loops, etc, and spits out
>> a
>>> transformed AST containing these optimizations. If runtime
>> information
>>> is given such as actual shape and stride information the
>>> transformations could figure out there and then whether to do things
>>> like collapsing, axes swapping, blocking (as in, introducing more
>> axes
>>> or loops to retain discontiguous blocks in the cache), blocked memory
>>> copies to contiguous chunks, etc. The AST could then also say whether
>>> the final expressions are vectorizable. Part of this functionality is
>>> already in numpy's nditer, except that this would be implicit and do
>>> more (and hopefully with minimal overhead).
>>> 
>>> So numba, Cython and maybe numexpr could use the functionality,
>> simply
>>> by building the AST from Python and converting back (if necessary) to
>>> its own AST. As such, the AST optimizer would be only part of any
>>> (runtime) compiler's pipeline, and it should be very flexible to
>>> retain any information (metadata regarding actual types, control flow
>>> information, etc) provided by the original AST. It would not do
>>> control flow analysis, type inference or promotion, etc, but only
>> deal
>>> with abstract types like integers, reals and arrays (C, Fortran or
>>> partly contiguous or strided). It would not deal with objects, but
>>> would allow to insert nodes like UnreorderableNode and SideEffectNode
>>> wrapping parts of the original AST. In short, it should be as easy as
>>> possible to convert from an original AST to this project's AST and
>>> back again afterwards.
>> 
>> I think this is a very interesting project, and certainly projects like
>> numba can benefit of it.  So, in order to us have an idea on what you
>> are after, can we assume that your project (call it X) would be kind of
>> an compiler optimizer, and then the produced, optimized code could be
>> feed into numba for optimized LLVM code generation (that on its turn,
>> can be run on top of CPUs or GPUs or a combination)?  Is that correct?
> 
> I think so. Another way of thinking about it is that it is a reimplementation 
> of the logic in the (good and closed source) Fortran 90 compilers, in a 
> reusable component for inclusion in various compilers.
> 
> Various c++ metaprogramming libraries (like Blitz++) are similar too.

Aha, thanks.

>> Giving that my interpretation above is correct, it is bit more
>> difficult to me to see how your X project could be of benefit for
>> numexpr.  In fact, I actually see this the other way round: once the
>> optimizer has discovered the vectorization parts, then go one step
>> further and generate code that uses numexpr automatically (call this,
>> vectorization through numexpr).  This is what you mean, or I'm missing
>> something?
> 
> No. I think in some ways this is a competitor to numexpr -- you would gut out 
> the middle of numexpr and keep the frontend and backend, but use this to 
> optimize iteration order and blocking strategies.

I see.  Yes, I can easily see Mark's project X + numba more as a competitor 
(killer?) to numexpr too.

> 
> I think the goal is for higher performa

Re: [Numpy-discussion] \\\\\SPAM\\\\\ Re: \\\\\SPAM\\\\\ Re: Numpy forIronPython 2.7 DLR app?

2012-04-02 Thread Francesc Alted

On 4/2/12 10:46 AM, William Johnston wrote:
> Hello,
>
> My email server went down.
>
> Did anyone respond to this post?

You can check the mail archive here:

http://mail.scipy.org/pipermail/numpy-discussion


-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Why is numpy.abs so much slower on complex64 than complex128 under windows 32-bit?

2012-04-10 Thread Francesc Alted

On 4/10/12 6:44 AM, Henry Gomersall wrote:
Here is the body of a post I made on stackoverflow, but it seems to be 
a non-obvious issue. I was hoping someone here might be able to shed 
light on it...

On my 32-bit Windows Vista machine I notice a significant (5x) 
slowdown when taking the absolute values of a fairly large 
|numpy.complex64| array when compared to a |numpy.complex128| array.

|>>>  import  numpy
>>>  a=  numpy.random.randn(256,2048)  +  1j*numpy.random.randn(256,2048)
>>>  b=  numpy.complex64(a)
>>>  timeit c=  numpy.float32(numpy.abs(a))
10  loops,  best of3:  27.5  ms per loop
>>>  timeit c=  numpy.abs(b)
1  loops,  best of3:  143  ms per loop
|

Obviously, the outputs in both cases are the same (to operating 
precision).

I do not notice the same effect on my Ubuntu 64-bit machine (indeed, 
as one might expect, the double precision array operation is a bit 
slower).

Is there a rational explanation for this?

Is this something that is common to all windows?

I cannot tell for sure, but it looks like the windows version of NumPy 
is casting complex64 to complex128 internally.  I'm guessing here, but 
numexpr lacks the complex64 type, so it has to internally do the upcast, 
and I'm seeing kind of the same slowdown:

In [6]: timeit numpy.abs(a)
100 loops, best of 3: 10.7 ms per loop

In [7]: timeit numpy.abs(b)
100 loops, best of 3: 8.51 ms per loop

In [8]: timeit numexpr.evaluate("abs(a)")
100 loops, best of 3: 1.67 ms per loop

In [9]: timeit numexpr.evaluate("abs(b)")
100 loops, best of 3: 4.96 ms per loop

In my case I'm seeing only a 3x slowdown, but this is because numexpr is 
not re-casting the outcome to complex64, while windows might be doing 
this.  Just to make sure, can you run this:

In [10]: timeit c = numpy.complex64(numpy.abs(numpy.complex128(b)))
100 loops, best of 3: 12.3 ms per loop

In [11]: timeit c = numpy.abs(b)
100 loops, best of 3: 8.45 ms per loop

in your windows box and see if they raise similar results?

In a related note of confusion, the times above are notably (and 
consistently) different (shorter) to that I get doing a naive `st = 
time.time(); numpy.abs(a); print time.time()-st`. Is this to be expected?

This happens a lot, yes, specially when your code is memory-bottlenecked 
(a very common situation).  The explanation is simple: when your 
datasets are small enough to fit in CPU cache, the first time the timing 
loop runs, it brings all your working set to cache, so the second time 
the computation is evaluated, the time does not have to fetch data from 
memory, and by the time you run the loop 10 times or more, you are 
discarding any memory effect.  However, when you run the loop only once, 
you are considering the memory fetch time too (which is often much more 
realistic).

--
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Why is numpy.abs so much slower on complex64 than complex128 under windows 32-bit?

2012-04-10 Thread Francesc Alted

On 4/10/12 9:55 AM, Henry Gomersall wrote:
> On 10/04/2012 16:36, Francesc Alted wrote:
>> In [10]: timeit c = numpy.complex64(numpy.abs(numpy.complex128(b)))
>> 100 loops, best of 3: 12.3 ms per loop
>>
>> In [11]: timeit c = numpy.abs(b)
>> 100 loops, best of 3: 8.45 ms per loop
>>
>> in your windows box and see if they raise similar results?
>>
> No, the results are somewhat the same as before - ~40ms for the first
> (upcast/downcast) case and ~150ms for the direct case (both *much*
> slower than yours!). This is versus ~28ms for operating directly on
> double precisions.

Okay, so it seems that something is going on wrong with the performance 
of pure complex64 abs() for Windows.

>
> I'm using numexpr in the end, but this is slower than numpy.abs under linux.

Oh, you mean the windows version of abs(complex64) in numexpr is slower 
than a pure numpy.abs(complex64) under linux?  That's weird, because 
numexpr has an independent implementation of the complex operations from 
NumPy machinery.  Here it is how abs() is implemented in numexpr:

static void
nc_abs(cdouble *x, cdouble *r)
{
 r->real = sqrt(x->real*x->real + x->imag*x->imag);
 r->imag = 0;
}

[as I said, only the double precision version is implemented, so you 
have to add here the cost of the cast too]

Hmm, considering all of these facts, it might be that sqrtf() on windows 
is under-performing?  Can you try this:

In [68]: a = numpy.linspace(0, 1, 1e6)

In [69]: b = numpy.float32(a)

In [70]: timeit c = numpy.sqrt(a)
100 loops, best of 3: 5.64 ms per loop

In [71]: timeit c = numpy.sqrt(b)
100 loops, best of 3: 3.77 ms per loop

and tell us the results for windows?

PD: if you are using numexpr on windows, you may want to use the MKL 
linked version, which uses the abs of MKL, that should have considerably 
better performance.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Why is numpy.abs so much slower on complex64 than complex128 under windows 32-bit?

2012-04-10 Thread Francesc Alted

On 4/10/12 11:43 AM, Henry Gomersall wrote:
> On 10/04/2012 17:57, Francesc Alted wrote:
>>> I'm using numexpr in the end, but this is slower than numpy.abs under linux.
>> Oh, you mean the windows version of abs(complex64) in numexpr is slower
>> than a pure numpy.abs(complex64) under linux?  That's weird, because
>> numexpr has an independent implementation of the complex operations from
>> NumPy machinery.  Here it is how abs() is implemented in numexpr:
>>
>> static void
>> nc_abs(cdouble *x, cdouble *r)
>> {
>>r->real = sqrt(x->real*x->real + x->imag*x->imag);
>>r->imag = 0;
>> }
>>
>> [as I said, only the double precision version is implemented, so you
>> have to add here the cost of the cast too]
> hmmm, I can't seem to reproduce that assertion, so ignore it.
>
>> Hmm, considering all of these facts, it might be that sqrtf() on windows
>> is under-performing?  Can you try this:
>>
>> In [68]: a = numpy.linspace(0, 1, 1e6)
>>
>> In [69]: b = numpy.float32(a)
>>
>> In [70]: timeit c = numpy.sqrt(a)
>> 100 loops, best of 3: 5.64 ms per loop
>>
>> In [71]: timeit c = numpy.sqrt(b)
>> 100 loops, best of 3: 3.77 ms per loop
>>
>> and tell us the results for windows?
> In [18]: timeit c = numpy.sqrt(a)
> 100 loops, best of 3: 21.4 ms per loop
>
> In [19]: timeit c = numpy.sqrt(b)
> 100 loops, best of 3: 12.5 ms per loop
>
> So, all sensible there it seems.
>
> Taking this to the next stage...
>
> In [95]: a = numpy.random.randn(256,2048) + 1j*numpy.random.randn(256,2048)
>
> In [96]: b = numpy.complex64(a)
>
> In [97]: timeit numpy.sqrt(a*numpy.conj(a))
> 10 loops, best of 3: 61.9 ms per loop
>
> In [98]: timeit numpy.sqrt(b*numpy.conj(b))
> 10 loops, best of 3: 27.2 ms per loop
>
> In [99]: timeit numpy.abs(a)  # for comparison
> 10 loops, best of 3: 30 ms per loop
>
> In [100]: timeit numpy.abs(b)  # and again (slow slow slow)
> 1 loops, best of 3: 153 ms per loop
>
> I can confirm the results are correct. So, it really is in numpy.abs.

Yup, definitely seems an issues of numpy.abs for complex64 on windows.  
Could you file a ticket on this please?

>> PD: if you are using numexpr on windows, you may want to use the MKL
>> linked version, which uses the abs of MKL, that should have considerably
>> better performance.
> I'd love to - I presume this would mean me buying an MKL license? If
> not, where do I find the MKL linked version?

Well, depending on what you do, you may want to use Golke's version:

http://www.lfd.uci.edu/~gohlke/pythonlibs/

where part of the packages here comes with MKL included (in particular 
NumPy/numexpr).

However, after having a look at numexpr sources, I found that the abs() 
version is not using MKL (apparently due to some malfunction of the 
latter; maybe this has been solved already).
So, don't expect a speedup by using MKL in this case.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] sparse array data

2012-05-02 Thread Francesc Alted

, count=N)

In [32]: ct
Out[32]:
ctable((100,), |V16) nbytes: 15.26 MB; cbytes: 1.76 MB; ratio: 8.67
   cparams := cparams(clevel=5, shuffle=True)
[(0, 0, 0.0), (0, 1, 1.0), (1, 0, 4.0), ..., (48, 1, 
9409.0), (49, 0, 9604.0), (49, 1, 9801.0)]

That is saying that the ctable object is requiring just 1.76 MB (compare 
this with the 8 MB that requires the equivalent dense NumPy array).

One inconvenient of this approach is that it is generally much slower 
than using a dense array representation:

In [30]: time [ sum(row.value for row in ct.where('(dim1==%d)' % (i,))) 
for i in range(2) ]
CPU times: user 1.80 s, sys: 0.00 s, total: 1.81 s
Wall time: 1.81 s
Out[30]: [1.6165056e+17, 1.7674e+17]

In [33]: t = np.fromiter((i*i for i in xrange(N)), 
dtype='f8').reshape(N/2,2)

In [34]: time t.sum(axis=0)
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
Out[34]: array([  1.6167e+17,   1.6667e+17])

Probably, implementing a native sum() operation on top of ctable objects 
would help improving performance here.  Alternatively, you could 
accelerate these operations by using the Table object in PyTables [2] 
and indexing  the dimensions for getting much improved speed for 
accessing elements in big sparse arrays.  Using a table in a relational 
database (indexed for dimensions) could be an option too.

[2] https://github.com/PyTables/PyTables

Hope this helps,

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] sparse array data

2012-05-02 Thread Francesc Alted

On 5/2/12 4:07 PM, Stéfan van der Walt wrote:
> Hi Francesc
>
> On Wed, May 2, 2012 at 1:53 PM, Francesc Alted  wrote:
>> and add another one for the actual values of the array.  For a 3-D
>> sparse array, this looks like:
>>
>> dim0 | dim1 | dim2 | value
>> ==
>> 0 |   0  |   0  | val0
>> 0 |  10  | 100  | val1
>>20 |   5  | 202  | val2
> What's the distinction between this and a coo_matrix?

Well, as the OP said, coo_matrix does not support dimensions larger than 
2, right?

In [4]: coo_matrix((3,4), dtype=np.int8).todense()
Out[4]:
matrix([[0, 0, 0, 0],
 [0, 0, 0, 0],
 [0, 0, 0, 0]], dtype=int8)


In [5]: coo_matrix((2,3,2), dtype=np.int8).todense()
---
TypeError Traceback (most recent call last)
/Users/faltet/ in ()
> 1 coo_matrix((2,3,2), dtype=np.int8).todense()

/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/scipy/sparse/coo.py
 
in __init__(self, arg1, shape, dtype, copy)
 127 obj, ij = arg1
 128 except:
--> 129 raise TypeError('invalid input format')
 130
 131 try:

TypeError: invalid input format

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] sparse array data

2012-05-02 Thread Francesc Alted

On 5/2/12 4:20 PM, Nathaniel Smith wrote:
> On Wed, May 2, 2012 at 9:53 PM, Francesc Alted  wrote:
>> On 5/2/12 11:16 AM, Wolfgang Kerzendorf wrote:
>>> Hi all,
>>>
>>> I'm currently writing a code that needs three dimensional data (for the 
>>> physicists it's dimensions are atom, ion, level). The problem is that not 
>>> all combinations do exist (a sparse array). Sparse matrices in scipy only 
>>> deal with two dimensions. The operations that I need to do on those are 
>>> running functions like exp(item/constant) on all of the items. I also want 
>>> to sum them up in the last dimension. What's the best way to make a class 
>>> that takes this kind of data and does the required operations fast. Maybe 
>>> some phycisists have implemented these things already. Any thoughts?
>> Curiously enough, I have recently been discussing with Travis O. about
>> how to represent sparse matrices with complete generality.  One of the
>> possibilities is to use what Travis call "synthetic dimensions".  The
>> idea behind it is easy: use a table with as many columns as dimensions,
>> and add another one for the actual values of the array.  For a 3-D
>> sparse array, this looks like:
>>
>> dim0 | dim1 | dim2 | value
>> ==
>> 0 |   0  |   0  | val0
>> 0 |  10  | 100  | val1
>>20 |   5  | 202  | val2
> This coordinate format is also what's used by the MATLAB Tensor
> Toolbox. They have a paper justifying this choice and describing some
> tricks for how to work with them:
>http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1
> (Spoiler: you use a lot of sort operations. Conveniently, timsort
> appears to be perfectly adapted for their algorithmic requirements.)

Uh, I do not have access to the article, but yes, sorting makes a lot of 
sense for these scenarios (this is why I was suggesting using indexes, 
which is sort of sorting too).

>
> I'm not sure why one would make up a new term like "synthetic
> dimensions" though, it's just standard coordinate format...

Yeah, this seems a bit weird.  Perhaps Travis was referring to other 
things and I mixed concepts.  Anyways.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] sparse array data

2012-05-02 Thread Francesc Alted

On 5/2/12 5:28 PM, Stéfan van der Walt wrote:
> On Wed, May 2, 2012 at 3:20 PM, Francesc Alted  wrote:
>> On 5/2/12 4:07 PM, Stéfan van der Walt wrote:
>> Well, as the OP said, coo_matrix does not support dimensions larger than
>> 2, right?
> That's just an implementation detail, I would imagine--I'm trying to
> figure out if there is a new principle behind "synthetic dimensions"?

No, no new things under the sun.  We were just talking about many 
different things and this suddenly appeared in our talk.  Nothing really 
serious.

> By the way, David Cournapeau mentioned using b-trees for sparse ops a
> while ago; did you ever talk to him about those ideas?

Yup, the b-tree idea fits very well for indexing the coordinates.  
Although one problem with b-trees is that they do not compress well in 
general.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] question about in-place operations

2012-05-22 Thread Francesc Alted

On 5/22/12 8:47 PM, Dag Sverre Seljebotn wrote:
> On 05/22/2012 04:54 PM, Massimo DiPierro wrote:
>> For now I will be doing this:
>>
>> import numpy
>> import time
>>
>> a=numpy.zeros(200)
>> b=numpy.zeros(200)
>> c=1.0
>>
>> # naive solution
>> t0 = time.time()
>> for i in xrange(len(a)):
>>   a[i] += c*b[i]
>> print time.time()-t0
>>
>> # possible solution
>> n=10
>> t0 = time.time()
>> for i in xrange(0,len(a),n):
>>   a[i:i+n] += c*b[i:i+n]
>> print time.time()-t0
>>
>> the second "possible" solution appears 1000x faster then the former in my 
>> tests and uses little extra memory. It is only 2x slower than b*=c.
>>
>> Any reason not to do it?
> No, this is perfectly fine, you just manually did what numexpr does.

Yeah.  You basically re-discovered the blocking technique.  For a more 
general example on how to apply the blocking technique with NumPy see 
the section "CPU vs Memory Benchmark" in:

https://python.g-node.org/python-autumnschool-2010/materials/starving_cpus

Of course numexpr has less overhead (and can use multiple cores) than 
using plain NumPy.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] question about in-place operations

2012-05-24 Thread Francesc Alted

On 5/22/12 9:08 PM, Massimo Di Pierro wrote:
> This problem is linear so probably Ram IO bound. I do not think I
> would benefit much for multiple cores. But I will give it a try. In
> the short term this is good enough for me.

Yeah, this what common sense seems to indicate, that RAM IO bound 
problems do not benefit from using multiple cores.  But reality is 
different:

 >>> import numpy as np
 >>> a = np.arange(1e8)
 >>> c = 1.0
 >>> time a*c
CPU times: user 0.22 s, sys: 0.20 s, total: 0.43 s
Wall time: 0.43 s
array([  0.e+00,   1.e+00,   2.e+00, ...,
  9.9970e+07,   9.9980e+07,   9.9990e+07])

Using numexpr with 1 thread:

 >>> import numexpr as ne
 >>> ne.set_num_threads(1)
8
 >>> time ne.evaluate("a*c")
CPU times: user 0.20 s, sys: 0.25 s, total: 0.45 s
Wall time: 0.45 s
array([  0.e+00,   1.e+00,   2.e+00, ...,
  9.9970e+07,   9.9980e+07,   9.9990e+07])

while using 8 threads (the machine has 8 physical cores):

 >>> ne.set_num_threads(8)
1
 >>> time ne.evaluate("a*c")
CPU times: user 0.39 s, sys: 0.68 s, total: 1.07 s
Wall time: 0.14 s
array([  0.e+00,   1.e+00,   2.e+00, ...,
  9.9970e+07,   9.9980e+07,   9.9990e+07])

which is 3x faster than using 1 single thread (look at wall time figures).

It has to be clear that this is purely due to the fact that several 
cores can transmit data to/from memory from/to CPU faster than one 
single core.  I have seen this behavior lots of times; for example, in 
slide 21 of this presentation:

http://pydata.org/pycon2012/numexpr-cython/slides.pdf

one can see how using several cores can speed up not only a polynomial 
computation, but also the simple expression "y = x", which is 
essentially a memory copy.

Another example where this effect can be seen is the Blosc compressor.  
For example, in:

http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks

the first points on each of the plots means that Blosc is in compression 
level 0, that is, it does not compress at all, and it basically copies 
data from origin to destination buffers.  Still, one can see that using 
several threads can accelerate this copy well beyond memcpy speed.

So, definitely, several cores can make your memory I/O bounded 
computations go faster.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-14 Thread Francesc Alted

On 6/13/12 8:12 PM, Nathaniel Smith wrote:
>>> I'm also worried that I still don't see any signs that you're working
>>> with the downstream libraries that this functionality is intended to
>>> be useful for, like the various HDF5 libraries and pandas. I really
>>> don't think this functionality can be merged to numpy until we have
>>> affirmative statements from those developers that they are excited
>>> about it and will use it, and since they're busy people, it's pretty
>>> much your job to track them down and make sure that your code will
>>> solve their problems.
>> Francesc is certainly aware of this work, and I emailed Wes earlier this
>> week, I probably should have mentioned that, though. Hopefully they will
>> have time to contribute their thoughts. I also imagine Travis can speak
>> on behalf of the users he has interacted with over the last several
>> years that have requested this feature that don't happen to follow
>> mailing lists.
> I'm glad Francesc and Wes are aware of the work, but my point was that
> that isn't enough. So if I were in your position and hoping to get
> this code merged, I'd be trying to figure out how to get them more
> actively on board?

Sorry to chime in late.  Yes, I am aware of the improvements that Bryan 
(and Mark) are proposing.  My position here is that I'm very open to 
this (at least from a functional point of view; I have to recognize that 
I have not had a look into the code).

The current situation for the HDF5 wrappers (at least PyTables ones) is 
that, due to the lack of support of enums in NumPy itself, we had to 
come with a specific solution for this.  Our approach was pretty simple: 
basically providing an exhaustive set or list of possible, named values 
for different integers.  And although I'm not familiar with the 
implementation details (it was Ivan Vilata who implemented this part), I 
think we used an internal dictionary for doing the translation while 
PyTables is presenting the enums to the user.

Bryan is implementing a much more complete (and probably more efficient) 
support for enums in NumPy.  As this is new functionality, and PyTables 
does not trust on it, there is not an immediate danger (i.e. a backward 
incompatibility) on introducing the new enums in NumPy.  But they could 
be used for future PyTables versions (and other HDF5 wrappers), which is 
a good thing indeed.

My 2 cents,

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] SSE Optimization

2012-07-10 Thread Francesc Alted

On 7/10/12 5:07 PM, Fode wrote:
> I am interested in adding SSE optimizations to numpy, where should I 
> start?

Well, to my knowledge there is not many open source code (Intel MKL and 
AMD ACML do not enter in this section) that uses the SSE, but a good 
start could be:

http://gruntthepeon.free.fr/ssemath/

I'd say that NumPy could benefit a lot of integrating optimized versions 
for transcendental functions (as the link above).

Good luck!

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] [ANN] carray 0.5 released

2012-08-21 Thread Francesc Alted

Announcing carray 0.5
=

What's new
--

carray 0.5 supports completely transparent storage on-disk in addition
to memory.  That means that everything that can be done with an
in-memory container can be done using the disk instead.

The advantages of a disk-based container is that your addressable space
is much larger than just your available memory.  Also, as carray is
based on a chunked and compressed data layout based on the super-fast
Blosc compression library, and the different cache levels existing in
both modern operating systems and the internal carray machinery, the
data access speed is very good.

The format chosen for the persistence layer is based on the 'bloscpack'
library (thanks to Valentin Haenel for his inspiration) and described in
'persistence.rst', although not everything has been implemented yet.
You may want to contribute by proposing enhancements to it.  See:
https://github.com/FrancescAlted/carray/wiki/PersistenceProposal

CAVEAT: The bloscpack format is still evolving, so don't trust on
forward compatibility of the format, at least until 1.0, where the
internal format will be declared frozen.

For more detailed info, see the release notes in:
https://github.com/FrancescAlted/carray/wiki/Release-0.5


What it is
--

carray is a chunked container for numerical data.  Chunking allows for
efficient enlarging/shrinking of data container.  In addition, it can
also be compressed for reducing memory/disk needs.  The compression
process is carried out internally by Blosc, a high-performance
compressor that is optimized for binary data.

carray can use numexpr internally so as to accelerate many vector and
query operations (although it can use pure NumPy for doing so too).
numexpr can use optimize the memory usage and use several cores for
doing the computations, so it is blazing fast.  Moreover, with the
introduction of a carray/ctable disk-based container (in version 0.5),
it can be used for seamlessly performing out-of-core computations.

carray comes with an exhaustive test suite and fully supports both
32-bit and 64-bit platforms.  Also, it is typically tested on both UNIX
and Windows operating systems.

Resources
-

Visit the main carray site repository at:
http://github.com/FrancescAlted/carray

You can download a source package from:
http://carray.pytables.org/download

Manual:
http://carray.pytables.org/docs/manual

Home of Blosc compressor:
http://blosc.pytables.org

User's mail list:
car...@googlegroups.com
http://groups.google.com/group/carray



Enjoy!

-- Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: python-blosc 1.0.4 released

2012-09-14 Thread Francesc Alted

=
Announcing python-blosc 1.0.4
=

What is it?
===

A Python wrapper for the Blosc compression library.

Blosc (http://blosc.pytables.org) is a high performance compressor
optimized for binary data.  It has been designed to transmit data to
the processor cache faster than the traditional, non-compressed,
direct memory fetch approach via a memcpy() OS call.

Blosc works well for compressing numerical arrays that contains data
with relatively low entropy, like sparse data, time series, grids with
regular-spaced values, etc.

python-blosc is a Python package that wraps it.

What is new?


Optimized the amount of data copied during compression (using
_PyBytes_Resize() now instead of previous PyBytes_FromStringAndSize()).

This leads to improvements in compression speed ranging from 1.2x for
highly compressible chunks up to 7x for mostly uncompressible data.
Thanks to Valentin Haenel for this nice contribution.

For more info, you can see the release notes in:

https://github.com/FrancescAlted/python-blosc/wiki/Release-notes

More docs and examples are available in the Quick User's Guide wiki page:

https://github.com/FrancescAlted/python-blosc/wiki/Quick-User's-Guide

Download sources


Go to:

http://github.com/FrancescAlted/python-blosc

and download the most recent release from there.

Blosc is distributed using the MIT license, see LICENSES/BLOSC.txt for
details.

Mailing list


There is an official mailing list for Blosc at:

bl...@googlegroups.com
http://groups.google.es/group/blosc


-- Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] [ANN] python-blosc 1.0.5 released

2012-09-16 Thread Francesc Alted

=
Announcing python-blosc 1.0.5
=

What is it?
===

A Python wrapper for the Blosc compression library.

Blosc (http://blosc.pytables.org) is a high performance compressor
optimized for binary data.  It has been designed to transmit data to
the processor cache faster than the traditional, non-compressed,
direct memory fetch approach via a memcpy() OS call.

Blosc works well for compressing numerical arrays that contains data
with relatively low entropy, like sparse data, time series, grids with
regular-spaced values, etc.

python-blosc is a Python package that wraps it.

What is new?


- Upgraded to latest Blosc 1.1.4.

- Better handling of condition errors, and improved memory releasing in
   case of errors (thanks to Valentin Haenel and Han Genuit).

- Better handling of types (should compile without warning now, at least
   with GCC).

For more info, you can see the release notes in:

https://github.com/FrancescAlted/python-blosc/wiki/Release-notes

More docs and examples are available in the Quick User's Guide wiki page:

https://github.com/FrancescAlted/python-blosc/wiki/Quick-User's-Guide

Download sources


Go to:

http://github.com/FrancescAlted/python-blosc

and download the most recent release from there.

Blosc is distributed using the MIT license, see LICENSES/BLOSC.txt for
details.

Mailing list


There is an official mailing list for Blosc at:

bl...@googlegroups.com
http://groups.google.es/group/blosc

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted

On 11/7/12 8:41 PM, Neal Becker wrote:
> Would you expect numexpr without MKL to give a significant boost?

Yes.  Have a look at how numexpr's own multi-threaded virtual machine 
compares with numexpr using VML:

http://code.google.com/p/numexpr/wiki/NumexprVML

As it can be seen, the best results are obtained by using the 
multi-threaded VM in numexpr in combination with a single-threaded VML 
engine.  Caution: I did these benchmarks some time ago (couple of 
years?), so it might be that multi-threaded VML would have improved by 
now.  If performance is critical, some experiments should be done first 
so as to find the optimal configuration.

At any rate, VML will let you to optimally leverage the SIMD 
instructions in the cores, allowing to compute, for example, exp() in 1 
or 2 clock cycles (depending on the vector length, the number of cores 
in your system and the data precision):

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/exp.html

Pretty amazing.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted

On 11/8/12 12:35 AM, Chris Barker wrote:
> On Wed, Nov 7, 2012 at 11:41 AM, Neal Becker  wrote:
>> Would you expect numexpr without MKL to give a significant boost?
> It can, depending on the use case:
>   -- It can remove a lot of uneccessary temporary creation.
>   -- IIUC, it works on blocks of data at a time, and thus can keep
> things in cache more when working with large data sets.

Well, the temporaries are still created, but the thing is that, by 
working with small blocks at a time, these temporaries fit in CPU cache, 
preventing copies into main memory.  I like to name this the 'blocking 
technique', as explained in slide 26 (and following) in:

https://python.g-node.org/wiki/_media/starving_cpu/starving-cpu.pdf

A better technique is to reduce the block size to the minimal expression 
(1 element), so temporaries are stored in registers in CPU instead of 
small blocks in cache, hence preventing copies even in *cache*.  Numba 
(https://github.com/numba/numba) follows this approach, which is pretty 
optimal as can be seen in slide 37 of the lecture above.

>-- It can (optionally) use multiple threads for easy parallelization.

No, the *total* amount of cores detected in the system is the default in 
numexpr; if you want less, you will need to use 
set_num_threads(nthreads) function.  But agreed, sometimes using too 
many threads could effectively be counter-producing.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numexpr question

2012-11-08 Thread Francesc Alted

On 11/8/12 1:37 PM, Neal Becker wrote:
> I'm interested in trying numexpr, but have a question (not sure where's the 
> best
> forum to ask).
>
> The examples I see use
>
> ne.evaluate ("some string...")
>
> When used within a loop, I would expect the compilation from the string form 
> to
> add significant overhead.  I would have thought a pre-compiled form would be
> available, similar to a precompiled regexp.  yes/no?

numexpr comes with an internal cache for recent expressions, so if 
ne.evaluate() is in a loop, the compiled expression will be re-used 
without problems.  So you don't have to worry about caching it yourself.

The best forum for discussing numexpr is this:

https://groups.google.com/forum/?fromgroups#!forum/numexpr

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted

On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
> On 11/07/2012 08:41 PM, Neal Becker wrote:
>> Would you expect numexpr without MKL to give a significant boost?
> If you need higher performance than what numexpr can give without using
> MKL, you could look at code such as this:
>
> https://github.com/herumi/fmath/blob/master/fmath.hpp#L480

Hey, that's cool.  I was a bit disappointed not finding this sort of 
work in open space.  It seems that this lacks threading support, but 
that should be easy to implement by using OpenMP directives.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted

On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
> On 11/08/2012 06:06 PM, Francesc Alted wrote:
>> On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
>>> On 11/07/2012 08:41 PM, Neal Becker wrote:
>>>> Would you expect numexpr without MKL to give a significant boost?
>>> If you need higher performance than what numexpr can give without using
>>> MKL, you could look at code such as this:
>>>
>>> https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
>> Hey, that's cool.  I was a bit disappointed not finding this sort of
>> work in open space.  It seems that this lacks threading support, but
>> that should be easy to implement by using OpenMP directives.
> IMO this is the wrong place to introduce threading; each thread should
> call expd_v on its chunks. (Which I think is how you said numexpr
> currently uses VML anyway.)

Oh sure, but then you need a blocked engine for performing the 
computations too.  And yes, by default numexpr uses its own threading 
code rather than the existing one in VML (but that can be changed by 
playing with set_num_threads/set_vml_num_threads).  It always stroked to 
me as a little strange that the internal threading in numexpr was more 
efficient than VML one, but I suppose this is because the latter is more 
optimized to deal with large blocks instead of those of medium size (4K) 
in numexpr.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted

On 11/8/12 7:55 PM, Dag Sverre Seljebotn wrote:
> On 11/08/2012 06:59 PM, Francesc Alted wrote:
>> On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
>>> On 11/08/2012 06:06 PM, Francesc Alted wrote:
>>>> On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
>>>>> On 11/07/2012 08:41 PM, Neal Becker wrote:
>>>>>> Would you expect numexpr without MKL to give a significant boost?
>>>>> If you need higher performance than what numexpr can give without using
>>>>> MKL, you could look at code such as this:
>>>>>
>>>>> https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
>>>> Hey, that's cool.  I was a bit disappointed not finding this sort of
>>>> work in open space.  It seems that this lacks threading support, but
>>>> that should be easy to implement by using OpenMP directives.
>>> IMO this is the wrong place to introduce threading; each thread should
>>> call expd_v on its chunks. (Which I think is how you said numexpr
>>> currently uses VML anyway.)
>> Oh sure, but then you need a blocked engine for performing the
>> computations too.  And yes, by default numexpr uses its own threading
> I just meant that you can use a chunked OpenMP for-loop wherever in your
> code that you call expd_v. A "five-line blocked engine", if you like :-)
>
> IMO that's the right location since entering/exiting OpenMP blocks takes
> some time.

Yes, I meant precisely this first hand.

>> code rather than the existing one in VML (but that can be changed by
>> playing with set_num_threads/set_vml_num_threads).  It always stroked to
>> me as a little strange that the internal threading in numexpr was more
>> efficient than VML one, but I suppose this is because the latter is more
>> optimized to deal with large blocks instead of those of medium size (4K)
>> in numexpr.
> I don't know enough about numexpr to understand this :-)
>
> I guess I just don't see the motivation to use VML threading or why it
> should be faster? If you pass a single 4K block to a threaded VML call
> then I could easily see lots of performance problems: a)
> starting/stopping threads or signalling the threads of a pool is a
> constant overhead per "parallel section", b) unless you're very careful
> to only have VML touch the data, and VML always schedules elements in
> the exact same way, you're going to have the cache lines of that 4K
> block shuffled between L1 caches of different cores for different
> operations...
>
> As I said, I'm mostly ignorant about how numexpr works, that's probably
> showing :-)

No, on the contrary, you rather hit the core of the issue (or part of 
it).  On one hand, VML needs large blocks in order to maximize the 
performance of the pipeline and in the other hand numexpr tries to 
minimize block size in order to make temporaries as small as possible 
(so avoiding the use of the higher level caches).  From this tension 
(and some benchmarking work) the size of 4K (btw, this is the number of 
*elements*, so the size is actually either 16 KB and 32 KB for single 
and double precision respectively) was derived.  Incidentally, for 
numexpr with no VML support, the size is reduced to 1K elements (and 
perhaps it could be reduced a bit more, but anyways).

Anyway, this is way too low level to be discussed here, although we can 
continue on the numexpr list if you are interested in more details.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Numpy's policy for releasing memory

2012-11-13 Thread Francesc Alted

On 11/13/12 10:27 AM, Austin Bingham wrote:
> OK, if numpy is just subject to Python's behavior then what I'm seeing 
> must be due to the vagaries of Python. I've noticed that things like 
> removing a particular line of code or reordering seemingly unrelated 
> calls (unrelated to the memory issue, that is) can affect when memory 
> is reported as free. I'll just assume that everything is in order and 
> carry on. Thanks!

Profiling memory can be tricky because the operating system may not 
return memory *immediately* as requested, and it might mislead you in 
some situations.  So do not trust too much in memory profilers to be too 
exact and rather focus on the big picture (i.e. my app is reclaiming a 
lot of memory for a large amount o time?  if yes, then start worrying, 
but not before).

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Crash using "reshape"...

2012-11-21 Thread Francesc Alted

On 11/21/12 10:12 AM, Terry J. Ligocki wrote:
> I am having a problem with "reshape" crashing:
>
> > python
> Python 2.6.4 (r264:75706, Jan 16 2010, 21:11:47)
> [GCC 4.3.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy
> >>> numpy.version.version
> '1.6.2'
> >>> npData = numpy.ones([701,701,7899],dtype=numpy.dtype('b'))
> >>> npDataSubset = npData[[slice(0,700),slice(0,700),slice(0,5000)]]
> >>> npDataOutput = npDataSubset.reshape([700*700*5000],order='F')
> Segmentation fault
>
> If I change the "5000" to a "4000", everything is fine.  I'm not 
> running out of memory - my system had 48 GB of memory and nothing else 
> is using a significant portion of this memory.
>
> Note:  700x700x4000 = 1,960,000,000 < 2^31 and 700x700x5000 = 
> 245000 > 2^31.  I suspect somewhere in the underlying code there 
> is a signed 32-bit integer being used for an index/pointer offset 
> (this is running on a 64-bit machine).

Yes, looks like a 32-bit issue.  Sometimes you can have 32-bit software 
installed in 64-bit machines, so that might be your problem.  What's the 
equivalent of numpy.intp in your machine?  Mine is:

In []: import numpy as np

In []: np.intp
Out[]: numpy.int64

If you see 'numpy.int32' here then that is the problem.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Crash using "reshape"...

2012-11-21 Thread Francesc Alted

On 11/21/12 11:25 AM, Terry J. Ligocki wrote:
> I just checked, "numpy.intp" is "" in my 
> installation of Python and NumPy.  It was a good thing to check but it 
> looks like there's still may be a signed 32-bit integer somewhere in 
> the code (or my build(s))...

Okay.  I can reproduce that too (using 1.6.1).  Could you please file a 
ticket for this?  Smells like a bug to me.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] the difference between "+" and np.add?

2012-11-22 Thread Francesc Alted

On 11/22/12 1:41 PM, Chao YUE wrote:
> Dear all,
>
> if I have two ndarray arr1 and arr2 (with the same shape), is there 
> some difference when I do:
>
> arr = arr1 + arr2
>
> and
>
> arr = np.add(arr1, arr2),
>
> and then if I have more than 2 arrays: arr1, arr2, arr3, arr4, arr5, 
> then I cannot use np.add anymore as it only recieves 2 arguments.
> then what's the best practice to add these arrays? should I do
>
> arr = arr1 + arr2 + arr3 + arr4 + arr5
>
> or I do
>
> arr = np.sum(np.array([arr1, arr2, arr3, arr4, arr5]), axis=0)?
>
> because I just noticed recently that there are functions like np.add, 
> np.divide, np.substract... before I am using all like directly 
> arr1/arr2, rather than np.divide(arr1,arr2).

As Nathaniel said, there is not a difference in terms of *what* is 
computed.  However, the methods that you suggested actually differ on 
*how* they are computed, and that has dramatic effects on the time 
used.  For example:

In []: arr1, arr2, arr3, arr4, arr5 = [np.arange(1e7) for x in range(5)]

In []: %time arr1 + arr2 + arr3 + arr4 + arr5
CPU times: user 0.05 s, sys: 0.10 s, total: 0.14 s
Wall time: 0.15 s
Out[]:
array([  0.e+00,   5.e+00,   1.e+01, ...,
  4.9850e+07,   4.9900e+07,   4.9950e+07])

In []: %time np.sum(np.array([arr1, arr2, arr3, arr4, arr5]), axis=0)
CPU times: user 2.98 s, sys: 0.15 s, total: 3.13 s
Wall time: 3.14 s
Out[]:
array([  0.e+00,   5.e+00,   1.e+01, ...,
  4.9850e+07,   4.9900e+07,   4.9950e+07])

The difference is how memory is used.  In the first case, the additional 
memory was just a temporary with the size of the operands, while for the 
second case a big temporary has to be created, so the difference in is 
speed is pretty large.

There are also ways to minimize the size of temporaries, and numexpr is 
one of the simplests:

In []: import numexpr as ne

In []: %time ne.evaluate('arr1 + arr2 + arr3 + arr4 + arr5')
CPU times: user 0.04 s, sys: 0.04 s, total: 0.08 s
Wall time: 0.04 s
Out[]:
array([  0.e+00,   5.e+00,   1.e+01, ...,
  4.9850e+07,   4.9900e+07,   4.9950e+07])

Again, the computations are the same, but how you manage memory is critical.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] the difference between "+" and np.add?

2012-11-28 Thread Francesc Alted

On 11/23/12 8:00 PM, Chris Barker - NOAA Federal wrote:
> On Thu, Nov 22, 2012 at 6:20 AM, Francesc Alted  wrote:
>> As Nathaniel said, there is not a difference in terms of *what* is
>> computed.  However, the methods that you suggested actually differ on
>> *how* they are computed, and that has dramatic effects on the time
>> used.  For example:
>>
>> In []: arr1, arr2, arr3, arr4, arr5 = [np.arange(1e7) for x in range(5)]
>>
>> In []: %time arr1 + arr2 + arr3 + arr4 + arr5
>> CPU times: user 0.05 s, sys: 0.10 s, total: 0.14 s
>> Wall time: 0.15 s
>> There are also ways to minimize the size of temporaries, and numexpr is
>> one of the simplests:
> but you can also use np.add (and friends) to reduce the number of
> temporaries. It can make a difference:
>
> In [11]: def add_5_arrays(arr1, arr2, arr3, arr4, arr5):
> : result = arr1 + arr2
> : np.add(result, arr3, out=result)
> : np.add(result, arr4, out=result)
> : np.add(result, arr5, out=result)
>
> In [13]: timeit arr1 + arr2 + arr3 + arr4 + arr5
> 1 loops, best of 3: 528 ms per loop
>
> In [17]: timeit add_5_arrays(arr1, arr2, arr3, arr4, arr5)
> 1 loops, best of 3: 293 ms per loop
>
> (don't have numexpr on this machine for a comparison)

Yes, you are right.  However, numexpr still can beat this:

In [8]: timeit arr1 + arr2 + arr3 + arr4 + arr5
10 loops, best of 3: 138 ms per loop

In [9]: timeit add_5_arrays(arr1, arr2, arr3, arr4, arr5)
10 loops, best of 3: 74.3 ms per loop

In [10]: timeit ne.evaluate("arr1 + arr2 + arr3 + arr4 + arr5")
10 loops, best of 3: 20.8 ms per loop

The reason is that numexpr is multithreaded (using 6 cores above), and 
for memory-bounded problems like this one, fetching data in different 
threads is more efficient than using a single thread:

In [12]: timeit arr1.copy()
10 loops, best of 3: 41 ms per loop

In [13]: ne.set_num_threads(1)
Out[13]: 6

In [14]: timeit ne.evaluate("arr1")
10 loops, best of 3: 30.7 ms per loop

In [15]: ne.set_num_threads(6)
Out[15]: 1

In [16]: timeit ne.evaluate("arr1")
100 loops, best of 3: 13.4 ms per loop

I.e., the joy of multi-threading is that it not only buys you CPU speed, 
but can also bring your data from memory faster.  So yeah, modern 
applications *do* need multi-threading for getting good performance.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Conditional update of recarray field

2012-11-28 Thread Francesc Alted

On 11/28/12 1:47 PM, Bartosz wrote:
> Hi,
>
> I try to update values in a single field of numpy record array based on
> a condition defined in another array. I found that that the result
> depends on the order in which I apply the boolean indices/field names.
>
> For example:
>
> cond = np.zeros(5, dtype=np.bool)
> cond[2:] = True
> X = np.rec.fromarrays([np.arange(5)], names='a')
> X[cond]['a'] = -1
> print X
>
> returns:  [(0,) (1,) (2,) (3,) (4,)] (the values were not updated)
>
> X['a'][cond] = -1
> print X
>
> returns: [(0,) (1,) (-1,) (-1,) (-1,)] (it worked this time).
>
> I find this behaviour very confusing. Is it expected?

Yes, it is.  In the first idiom, X[cond] is a fancy indexing operation 
and the result is not a view, so what you are doing is basically 
modifying the temporary object that results from the indexing.  In the 
second idiom, X['a'] is returning a *view* of the original object, so 
this is why it works.

>   Would it be
> possible to emit a warning message in the case of "faulty" assignments?

The only solution that I can see for this is that the fancy indexing 
would return a view, and not a different object, but NumPy containers 
are not prepared for this.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Conditional update of recarray field

2012-11-28 Thread Francesc Alted

Hey Bartosz,

On 11/28/12 3:26 PM, Bartosz wrote:
> Thanks for answer, Francesc.
>
> I understand now that fancy indexing returns a copy of a recarray. Is
> it also true for standard ndarrays? If so, I do not understand why
> X['a'][cond]=-1 should work.

Yes, that's a good question.  No, in this case the boolean array `cond` 
is passed to the __setitem__() of the original view, so this is why this 
works.  The first idiom is concatenating the fancy indexing with another 
indexing operation, and NumPy needs to create a temporary for executing 
this, so the second indexing operation acts over a copy, not a view.

And yes, fancy indexing returning a copy is standard for all ndarrays.

Hope it is clearer now (although admittedly it is a bit strange at first 
sight),

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Byte aligned arrays

2012-12-19 Thread Francesc Alted

On 12/19/12 5:47 PM, Henry Gomersall wrote:
> On Wed, 2012-12-19 at 15:57 +, Nathaniel Smith wrote:
>> Not sure which interface is more useful to users. On the one hand,
>> using funny dtypes makes regular non-SIMD access more cumbersome, and
>> it forces your array size to be a multiple of the SIMD word size,
>> which might be inconvenient if your code is smart enough to handle
>> arbitrary-sized arrays with partial SIMD acceleration (i.e., using
>> SIMD for most of the array, and then a slow path to handle any partial
>> word at the end). OTOH, if your code *is* that smart, you should
>> probably just make it smart enough to handle a partial word at the
>> beginning as well and then you won't need any special alignment in the
>> first place, and representing each SIMD word as a single numpy scalar
>> is an intuitively appealing model of how SIMD works. OTOOH, just
>> adding a single argument np.array() is a much simpler to explain than
>> some elaborate scheme involving the creation of special custom dtypes.
> If it helps, my use-case is in wrapping the FFTW library. This _is_
> smart enough to deal with unaligned arrays, but it just results in a
> performance penalty. In the case of an FFT, there are clearly going to
> be issues with the powers of two indices in the array not lying on a
> suitable n-byte boundary (which would be the case with a misaligned
> array), but I imagine it's not unique.
>
> The other point is that it's easy to create a suitable power of two
> array that should always bypass any special case unaligned code (e.g.
> with floats, any multiple of 4 array length will fill every 16-byte
> word).
>
> Finally, I think there is significant value in auto-aligning the array
> based on an appropriate inspection of the cpu capabilities (or
> alternatively, a function that reports back the appropriate SIMD
> alignment). Again, this makes it easier to wrap libraries that may
> function with any alignment, but benefit from optimum alignment.

Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on 
systems (Linux and Mac OSX):

In []: np.empty(1).data
Out[]: 

In []: np.empty(1).data
Out[]: 

In []: np.empty(1).data
Out[]: 

In []: np.empty(1).data
Out[]: 

[Check that the last digit in the addresses above is always 0]

The only scenario that I see that this would create unaligned arrays is 
for machines having AVX.  But provided that the Intel architecture is 
making great strides in fetching unaligned data, I'd be surprised that 
the difference in performance would be even noticeable.

Can you tell us which difference in performance are you seeing for an 
AVX-aligned array and other that is not AVX-aligned?  Just curious.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Byte aligned arrays

2012-12-20 Thread Francesc Alted

On 12/20/12 9:53 AM, Henry Gomersall wrote:
> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>> The only scenario that I see that this would create unaligned arrays
>> is
>> for machines having AVX.  But provided that the Intel architecture is
>> making great strides in fetching unaligned data, I'd be surprised
>> that
>> the difference in performance would be even noticeable.
>>
>> Can you tell us which difference in performance are you seeing for an
>> AVX-aligned array and other that is not AVX-aligned?  Just curious.
> Further to this point, from an Intel article...
>
> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>
> "Aligning data to vector length is always recommended. When using Intel
> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
> bytes. Similarly, to achieve best results use Intel AVX instructions on
> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
> instructions on unaligned 32-byte vectors means that every second load
> will be across a cache-line split, since the cache line is 64 bytes.
> This doubles the cache line split rate compared to Intel SSE code that
> uses 16-byte vectors. A high cache-line split rate in memory-intensive
> code is extremely likely to cause performance degradation. For that
> reason, it is highly recommended to align the data to 32 bytes for use
> with Intel AVX."
>
> Though it would be nice to put together a little example of this!

Indeed, an example is what I was looking for.  So provided that I have 
access to an AVX capable machine (having 6 physical cores), and that MKL 
10.3 has support for AVX, I have made some comparisons using the 
Anaconda Python distribution (it ships with most packages linked against 
MKL 10.3).

Here it is a first example using a DGEMM operation.  First using a NumPy 
that is not turbo-loaded with MKL:

In [34]: a = np.linspace(0,1,1e7)

In [35]: b = a.reshape(1000, 1)

In [36]: c = a.reshape(1, 1000)

In [37]: time d = np.dot(b,c)
CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s
Wall time: 7.63 s

In [38]: time d = np.dot(c,b)
CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s
Wall time: 78.89 s

This is getting around 2.6 GFlop/s.  Now, with a MKL 10.3 NumPy and 
AVX-unaligned data:

In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p))
Out[7]: '0x7fcdef3b4010'  # 16 bytes alignment

In [8]: a = np.ndarray(1e7, "f8", p)

In [9]: a[:] = np.linspace(0,1,1e7)

In [10]: b = a.reshape(1000, 1)

In [11]: c = a.reshape(1, 1000)

In [37]: %timeit d = np.dot(b,c)
10 loops, best of 3: 164 ms per loop

In [38]: %timeit d = np.dot(c,b)
1 loops, best of 3: 1.65 s per loop

That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).

Now, using MKL 10.3 and AVX-aligned data:

In [21]: p2 = ctypes.create_string_buffer(int(8e7+16)); 
hex(ctypes.addressof(p))
Out[21]: '0x7f8cb9598010'

In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:]  # skip the first 16 bytes 
(now is 32-bytes aligned)

In [23]: a2[:] = np.linspace(0,1,1e7)

In [24]: b2 = a2.reshape(1000, 1)

In [25]: c2 = a2.reshape(1, 1000)

In [35]: %timeit d2 = np.dot(b2,c2)
10 loops, best of 3: 163 ms per loop

In [36]: %timeit d2 = np.dot(c2,b2)
1 loops, best of 3: 1.67 s per loop

So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX 
data is negligible.

One may argue that DGEMM is CPU-bounded and that memory access plays 
little role here, and that is certainly true.  So, let's go with a more 
memory-bounded problem, like computing a transcendental function with 
numexpr.  First with a with NumPy and numexpr with no MKL support:

In [8]: a = np.linspace(0,1,1e8)

In [9]: %time b = np.sin(a)
CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s
Wall time: 1.42 s

In [10]: import numexpr as ne

In [12]: %time b = ne.evaluate("sin(a)")
CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s
Wall time: 0.37 s

This time is around 4x faster than regular 'sin' in libc, and about the 
same speed than a memcpy():

In [13]: %time c = a.copy()
CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s
Wall time: 0.39 s

Now, with a MKL-aware numexpr and non-AVX alignment:

In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p))
Out[8]: '0x7fce435da010'  # 16 bytes alignment

In [9]: a = np.ndarray(1e8, "f8", p)

In [10]: a[:] = np.linspace(0,1,1e8)

In [11]: %time b = ne.evaluate("sin(a)")
CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s
Wall time: 0.15 s

That is, more than 2x faster than a memcpy() in this system, meaning 
that the problem is truly memory-bounded.  So now, with an AVX aligned 
buffer:

In [14]: a2 = a[2:]  # skip the first 16 bytes

In [15]: %time b = ne.evaluate(&qu

Re: [Numpy-discussion] Byte aligned arrays

2012-12-21 Thread Francesc Alted

On 12/20/12 7:35 PM, Henry Gomersall wrote:
> On Thu, 2012-12-20 at 15:23 +0100, Francesc Alted wrote:
>> On 12/20/12 9:53 AM, Henry Gomersall wrote:
>>> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>>>> The only scenario that I see that this would create unaligned
>> arrays
>>>> is
>>>> for machines having AVX.  But provided that the Intel architecture
>> is
>>>> making great strides in fetching unaligned data, I'd be surprised
>>>> that
>>>> the difference in performance would be even noticeable.
>>>>
>>>> Can you tell us which difference in performance are you seeing for
>> an
>>>> AVX-aligned array and other that is not AVX-aligned?  Just curious.
>>> Further to this point, from an Intel article...
>>>
>>>
>> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>>> "Aligning data to vector length is always recommended. When using
>> Intel
>>> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
>>> bytes. Similarly, to achieve best results use Intel AVX instructions
>> on
>>> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
>>> instructions on unaligned 32-byte vectors means that every second
>> load
>>> will be across a cache-line split, since the cache line is 64 bytes.
>>> This doubles the cache line split rate compared to Intel SSE code
>> that
>>> uses 16-byte vectors. A high cache-line split rate in
>> memory-intensive
>>> code is extremely likely to cause performance degradation. For that
>>> reason, it is highly recommended to align the data to 32 bytes for
>> use
>>> with Intel AVX."
>>>
>>> Though it would be nice to put together a little example of this!
>> Indeed, an example is what I was looking for.  So provided that I
>> have
>> access to an AVX capable machine (having 6 physical cores), and that
>> MKL
>> 10.3 has support for AVX, I have made some comparisons using the
>> Anaconda Python distribution (it ships with most packages linked
>> against
>> MKL 10.3).
> 
>
>> All in all, it is not clear that AVX alignment would have an
>> advantage,
>> even for memory-bounded problems.  But of course, if Intel people are
>> saying that AVX alignment is important is because they have use cases
>> for asserting this.  It is just that I'm having a difficult time to
>> find
>> these cases.
> Thanks for those examples, they were very interesting. I managed to
> temporarily get my hands on a machine with AVX and I have shown some
> speed-up with aligned arrays.
>
> FFT (using my wrappers) gives about a 15% speedup.
>
> Also this convolution code:
> https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
>
> Shows a small but repeatable speed-up (a few %) when using some aligned
> loads (as many as I can work out to use!).

Okay, so a 15% is significant, yes.  I'm still wondering why I did not 
get any speedup at all using MKL, but probably the reason is that it 
manages the unaligned corners of the datasets first, and then uses an 
aligned access for the rest of the data (but just guessing here).

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Byte aligned arrays

2012-12-21 Thread Francesc Alted

On 12/21/12 11:58 AM, Henry Gomersall wrote:
> On Fri, 2012-12-21 at 11:34 +0100, Francesc Alted wrote:
>>> Also this convolution code:
>>> https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
>>>
>>> Shows a small but repeatable speed-up (a few %) when using some
>> aligned
>>> loads (as many as I can work out to use!).
>> Okay, so a 15% is significant, yes.  I'm still wondering why I did
>> not
>> get any speedup at all using MKL, but probably the reason is that it
>> manages the unaligned corners of the datasets first, and then uses an
>> aligned access for the rest of the data (but just guessing here).
> With SSE in that convolution code example above (in which all alignments
> need be considered for each output element), I note a significant
> speedup by creating 4 copies of the float input array using memcopy,
> each shifted by 1 float (so the 5th element is aligned again). Despite
> all the extra copies its still quicker than using an unaligned load.
> However, when one tries the same trick with 8 copies for AVX it's
> actually slower than the SSE case.
>
> The fastest AVX (and any) implementation I have so far is with
> 16-aligned arrays (made with 4 copies as with SSE), with alternate
> aligned and unaligned loads (which is always at worst 16-byte aligned).
>
> Fascinating stuff!

Yes, to say the least.  And it supports the fact that, when fine tuning 
memory access performance, there is no replacement for experimentation 
(in some weird ways many times :)

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Byte aligned arrays

2012-12-21 Thread Francesc Alted

On 12/21/12 1:35 PM, Dag Sverre Seljebotn wrote:
> On 12/20/2012 03:23 PM, Francesc Alted wrote:
>> On 12/20/12 9:53 AM, Henry Gomersall wrote:
>>> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>>>> The only scenario that I see that this would create unaligned arrays
>>>> is
>>>> for machines having AVX.  But provided that the Intel architecture is
>>>> making great strides in fetching unaligned data, I'd be surprised
>>>> that
>>>> the difference in performance would be even noticeable.
>>>>
>>>> Can you tell us which difference in performance are you seeing for an
>>>> AVX-aligned array and other that is not AVX-aligned?  Just curious.
>>> Further to this point, from an Intel article...
>>>
>>> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>>>
>>> "Aligning data to vector length is always recommended. When using Intel
>>> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
>>> bytes. Similarly, to achieve best results use Intel AVX instructions on
>>> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
>>> instructions on unaligned 32-byte vectors means that every second load
>>> will be across a cache-line split, since the cache line is 64 bytes.
>>> This doubles the cache line split rate compared to Intel SSE code that
>>> uses 16-byte vectors. A high cache-line split rate in memory-intensive
>>> code is extremely likely to cause performance degradation. For that
>>> reason, it is highly recommended to align the data to 32 bytes for use
>>> with Intel AVX."
>>>
>>> Though it would be nice to put together a little example of this!
>> Indeed, an example is what I was looking for.  So provided that I have
>> access to an AVX capable machine (having 6 physical cores), and that MKL
>> 10.3 has support for AVX, I have made some comparisons using the
>> Anaconda Python distribution (it ships with most packages linked against
>> MKL 10.3).
>>
>> Here it is a first example using a DGEMM operation.  First using a NumPy
>> that is not turbo-loaded with MKL:
>>
>> In [34]: a = np.linspace(0,1,1e7)
>>
>> In [35]: b = a.reshape(1000, 1)
>>
>> In [36]: c = a.reshape(1, 1000)
>>
>> In [37]: time d = np.dot(b,c)
>> CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s
>> Wall time: 7.63 s
>>
>> In [38]: time d = np.dot(c,b)
>> CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s
>> Wall time: 78.89 s
>>
>> This is getting around 2.6 GFlop/s.  Now, with a MKL 10.3 NumPy and
>> AVX-unaligned data:
>>
>> In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p))
>> Out[7]: '0x7fcdef3b4010'  # 16 bytes alignment
>>
>> In [8]: a = np.ndarray(1e7, "f8", p)
>>
>> In [9]: a[:] = np.linspace(0,1,1e7)
>>
>> In [10]: b = a.reshape(1000, 1)
>>
>> In [11]: c = a.reshape(1, 1000)
>>
>> In [37]: %timeit d = np.dot(b,c)
>> 10 loops, best of 3: 164 ms per loop
>>
>> In [38]: %timeit d = np.dot(c,b)
>> 1 loops, best of 3: 1.65 s per loop
>>
>> That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).
>>
>> Now, using MKL 10.3 and AVX-aligned data:
>>
>> In [21]: p2 = ctypes.create_string_buffer(int(8e7+16));
>> hex(ctypes.addressof(p))
>> Out[21]: '0x7f8cb9598010'
>>
>> In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:]  # skip the first 16 bytes
>> (now is 32-bytes aligned)
>>
>> In [23]: a2[:] = np.linspace(0,1,1e7)
>>
>> In [24]: b2 = a2.reshape(1000, 1)
>>
>> In [25]: c2 = a2.reshape(1, 1000)
>>
>> In [35]: %timeit d2 = np.dot(b2,c2)
>> 10 loops, best of 3: 163 ms per loop
>>
>> In [36]: %timeit d2 = np.dot(c2,b2)
>> 1 loops, best of 3: 1.67 s per loop
>>
>> So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX
>> data is negligible.
>>
>> One may argue that DGEMM is CPU-bounded and that memory access plays
>> little role here, and that is certainly true.  So, let's go with a more
>> memory-bounded problem, like computing a transcendental function with
>> numexpr.  First with a with NumPy and numexpr with no MKL support:
>>
>> In [8]: a = np.linspace(0,1,1e8)
>>
>> In [9]: %time b = np.sin(a)
>> CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s
>> Wall time: 1.4

Re: [Numpy-discussion] ANN: NumPy 1.7.0 release

2013-02-10 Thread Francesc Alted

Exciting stuff. Thanks a lot to you and everybody implied in the release
for an amazing job.

Francesc
El 10/02/2013 2:25, "Ondřej Čertík"  va escriure:

> Hi,
>
> I'm pleased to announce the availability of the final release of
> NumPy 1.7.0.
>
> Sources and binary installers can be found at
> https://sourceforge.net/projects/numpy/files/NumPy/1.7.0/
>
> This release is equivalent to the 1.7.0rc2 release, since no more problems
> were found. For release notes see below.
>
> I would like to thank everybody who contributed to this release.
>
> Cheers,
> Ondrej
>
>
> =
> NumPy 1.7.0 Release Notes
> =
>
> This release includes several new features as well as numerous bug fixes
> and
> refactorings. It supports Python 2.4 - 2.7 and 3.1 - 3.3 and is the last
> release that supports Python 2.4 - 2.5.
>
> Highlights
> ==
>
> * ``where=`` parameter to ufuncs (allows the use of boolean arrays to
> choose
>   where a computation should be done)
> * ``vectorize`` improvements (added 'excluded' and 'cache' keyword, general
>   cleanup and bug fixes)
> * ``numpy.random.choice`` (random sample generating function)
>
>
> Compatibility notes
> ===
>
> In a future version of numpy, the functions np.diag, np.diagonal, and the
> diagonal method of ndarrays will return a view onto the original array,
> instead of producing a copy as they do now. This makes a difference if you
> write to the array returned by any of these functions. To facilitate this
> transition, numpy 1.7 produces a FutureWarning if it detects that you may
> be attempting to write to such an array. See the documentation for
> np.diagonal for details.
>
> Similar to np.diagonal above, in a future version of numpy, indexing a
> record array by a list of field names will return a view onto the original
> array, instead of producing a copy as they do now. As with np.diagonal,
> numpy 1.7 produces a FutureWarning if it detects that you may be attempting
> to write to such an array. See the documentation for array indexing for
> details.
>
> In a future version of numpy, the default casting rule for UFunc out=
> parameters will be changed from 'unsafe' to 'same_kind'. (This also applies
> to in-place operations like a += b, which is equivalent to np.add(a, b,
> out=a).) Most usages which violate the 'same_kind' rule are likely bugs, so
> this change may expose previously undetected errors in projects that depend
> on NumPy. In this version of numpy, such usages will continue to succeed,
> but will raise a DeprecationWarning.
>
> Full-array boolean indexing has been optimized to use a different,
> optimized code path.   This code path should produce the same results,
> but any feedback about changes to your code would be appreciated.
>
> Attempting to write to a read-only array (one with ``arr.flags.writeable``
> set to ``False``) used to raise either a RuntimeError, ValueError, or
> TypeError inconsistently, depending on which code path was taken. It now
> consistently raises a ValueError.
>
> The .reduce functions evaluate some reductions in a different order
> than in previous versions of NumPy, generally providing higher performance.
> Because of the nature of floating-point arithmetic, this may subtly change
> some results, just as linking NumPy to a different BLAS implementations
> such as MKL can.
>
> If upgrading from 1.5, then generally in 1.6 and 1.7 there have been
> substantial code added and some code paths altered, particularly in the
> areas of type resolution and buffered iteration over universal functions.
> This might have an impact on your code particularly if you relied on
> accidental behavior in the past.
>
> New features
> 
>
> Reduction UFuncs Generalize axis= Parameter
> ---
>
> Any ufunc.reduce function call, as well as other reductions like sum, prod,
> any, all, max and min support the ability to choose a subset of the axes to
> reduce over. Previously, one could say axis=None to mean all the axes or
> axis=# to pick a single axis.  Now, one can also say axis=(#,#) to pick a
> list of axes for reduction.
>
> Reduction UFuncs New keepdims= Parameter
> 
>
> There is a new keepdims= parameter, which if set to True, doesn't throw
> away the reduction axes but instead sets them to have size one.  When this
> option is set, the reduction result will broadcast correctly to the
> original operand which was reduced.
>
> Datetime support
> 
>
> .. note:: The datetime API is *experimental* in 1.7.0, and may undergo
> changes
>in future versions of NumPy.
>
> There have been a lot of fixes and enhancements to datetime64 compared
> to NumPy 1.6:
>
> * the parser is quite strict about only accepting ISO 8601 dates, with a
> few
>   convenience extensions
> * converts between units correctly
> * datetime arithmetic works correctly
> * business day functionality (allows t

Re: [Numpy-discussion] pip install numpy throwing a lot of output.

2013-02-12 Thread Francesc Alted

On 2/12/13 1:37 PM, Daπid wrote:
> I have just upgraded numpy with pip on Linux 64 bits with Python 2.7,
> and I got *a lot* of output, so much it doesn't fit in the terminal.
> Most of it are gcc commands, but there are many different errors
> thrown by the compiler. Is this expected?

Yes, I think that's expected. Just to make sure, can you send some 
excerpts of the errors that you are getting?

>
> I am not too worried as the test suite passes, but pip is supposed to
> give only meaningful output (or at least, this is what the creators
> intended).

Well, pip needs to compile the libraries prior to install them, so 
compile messages are meaningful. Another question would be to reduce the 
amount of compile messages by default in NumPy, but I don't think this 
is realistic (and even not desirable).

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] pip install numpy throwing a lot of output.

2013-02-12 Thread Francesc Alted

On 2/12/13 3:18 PM, Daπid wrote:
> On 12 February 2013 14:58, Francesc Alted  wrote:
>> Yes, I think that's expected. Just to make sure, can you send some
>> excerpts of the errors that you are getting?
> Actually the errors are at the beginning of the process, so they are
> out of the reach of my terminal right now. Seems like pip doesn't keep
> a log in case of success.

Well, I think these errors are part of the auto-discovering process of 
the functions supported by the libraries in the hosting OS (kind of 
`autoconf`for Python), so they can be considered 'normal'.

>
> The ones I can see are mostly warnings of unused variables and
> functions, maybe this is the expected behaviour for a library? This
> errors come from a complete reinstall instead of the original upgrade
> (the cat closed the terminal, worst excuse ever!):
[clip]

These ones are not errors, but warnings. While it should be desirable to 
avoid any warning during the compilation process, not many libraries 
fulfill this (but patches for removing them are accepted).

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] GSOC 2013

2013-03-06 Thread Francesc Alted

On 3/5/13 7:14 PM, Kurt Smith wrote:
> On Tue, Mar 5, 2013 at 1:45 AM, Eric Firing  wrote:
>> On 2013/03/04 9:01 PM, Nicolas Rougier wrote:
>>>>> This made me think of a serious performance limitation of structured 
>>>>> dtypes: a
>>>>> structured dtype is always "packed", which may lead to terrible byte 
>>>>> alignment
>>>>> for common types.  For instance, `dtype([('a', 'u1'), ('b',
>>>>> 'u8')]).itemsize == 9`,
>>>>> meaning that the 8-byte integer is not aligned as an equivalent C-struct's
>>>>> would be, leading to all sorts of horrors at the cache and register level.
>> Doesn't the "align" kwarg of np.dtype do what you want?
>>
>> In [2]: dt = np.dtype(dict(names=['a', 'b'], formats=['u1', 'u8']),
>> align=True)
>>
>> In [3]: dt.itemsize
>> Out[3]: 16
> Thanks!  That's what I get for not checking before posting.
>
> Consider this my vote to make `aligned=True` the default.

I would not run too much.  The example above takes 9 bytes to host the 
structure, while a `aligned=True` will take 16 bytes.  I'd rather let 
the default as it is, and in case performance is critical, you can 
always copy the unaligned field to a new (homogeneous) array.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] aligned / unaligned structured dtype behavior (was: GSOC 2013)

2013-03-07 Thread Francesc Alted

On 3/6/13 7:42 PM, Kurt Smith wrote:
> And regarding performance, doing simple timings shows a 30%-ish
> slowdown for unaligned operations:
>
> In [36]: %timeit packed_arr['b']**2
> 100 loops, best of 3: 2.48 ms per loop
>
> In [37]: %timeit aligned_arr['b']**2
> 1000 loops, best of 3: 1.9 ms per loop

Hmm, that clearly depends on the architecture.  On my machine:

In [1]: import numpy as np

In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)

In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)

In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)

In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)

In [6]: baligned = aligned_arr['b']

In [7]: bpacked = packed_arr['b']

In [8]: %timeit baligned**2
1000 loops, best of 3: 1.96 ms per loop

In [9]: %timeit bpacked**2
100 loops, best of 3: 7.84 ms per loop

That is, the unaligned column is 4x slower (!).  numexpr allows somewhat 
better results:

In [11]: %timeit numexpr.evaluate('baligned**2')
1000 loops, best of 3: 1.13 ms per loop

In [12]: %timeit numexpr.evaluate('bpacked**2')
1000 loops, best of 3: 865 us per loop

Yes, in this case, the unaligned array goes faster (as much as 30%).  I 
think the reason is that numexpr optimizes the unaligned access by doing 
a copy of the different chunks in internal buffers that fits in L1 
cache.  Apparently this is very beneficial in this case (not sure why, 
though).

>
> Whereas summing shows just a 10%-ish slowdown:
>
> In [38]: %timeit packed_arr['b'].sum()
> 1000 loops, best of 3: 1.29 ms per loop
>
> In [39]: %timeit aligned_arr['b'].sum()
> 1000 loops, best of 3: 1.14 ms per loop

On my machine:

In [14]: %timeit baligned.sum()
1000 loops, best of 3: 1.03 ms per loop

In [15]: %timeit bpacked.sum()
100 loops, best of 3: 3.79 ms per loop

Again, the 4x slowdown is here.  Using numexpr:

In [16]: %timeit numexpr.evaluate('sum(baligned)')
100 loops, best of 3: 2.16 ms per loop

In [17]: %timeit numexpr.evaluate('sum(bpacked)')
100 loops, best of 3: 2.08 ms per loop

Again, the unaligned case is (sligthly better).  In this case numexpr is 
a bit slower that NumPy because sum() is not parallelized internally.  
Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy 
could help improving unaligned performance. Worth a try?

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] aligned / unaligned structured dtype behavior

2013-03-07 Thread Francesc Alted

On 3/7/13 6:47 PM, Francesc Alted wrote:
> On 3/6/13 7:42 PM, Kurt Smith wrote:
>> And regarding performance, doing simple timings shows a 30%-ish
>> slowdown for unaligned operations:
>>
>> In [36]: %timeit packed_arr['b']**2
>> 100 loops, best of 3: 2.48 ms per loop
>>
>> In [37]: %timeit aligned_arr['b']**2
>> 1000 loops, best of 3: 1.9 ms per loop
>
> Hmm, that clearly depends on the architecture.  On my machine:
>
> In [1]: import numpy as np
>
> In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True)
>
> In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False)
>
> In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt)
>
> In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt)
>
> In [6]: baligned = aligned_arr['b']
>
> In [7]: bpacked = packed_arr['b']
>
> In [8]: %timeit baligned**2
> 1000 loops, best of 3: 1.96 ms per loop
>
> In [9]: %timeit bpacked**2
> 100 loops, best of 3: 7.84 ms per loop
>
> That is, the unaligned column is 4x slower (!).  numexpr allows 
> somewhat better results:
>
> In [11]: %timeit numexpr.evaluate('baligned**2')
> 1000 loops, best of 3: 1.13 ms per loop
>
> In [12]: %timeit numexpr.evaluate('bpacked**2')
> 1000 loops, best of 3: 865 us per loop

Just for completeness, here it is what Theano gets:

In [18]: import theano

In [20]: a = theano.tensor.vector()

In [22]: f = theano.function([a], a**2)

In [23]: %timeit f(baligned)
100 loops, best of 3: 7.74 ms per loop

In [24]: %timeit f(bpacked)
100 loops, best of 3: 12.6 ms per loop

So yeah, Theano is also slower for the unaligned case (but less than 2x 
in this case).

>
> Yes, in this case, the unaligned array goes faster (as much as 30%).  
> I think the reason is that numexpr optimizes the unaligned access by 
> doing a copy of the different chunks in internal buffers that fits in 
> L1 cache.  Apparently this is very beneficial in this case (not sure 
> why, though).
>
>>
>> Whereas summing shows just a 10%-ish slowdown:
>>
>> In [38]: %timeit packed_arr['b'].sum()
>> 1000 loops, best of 3: 1.29 ms per loop
>>
>> In [39]: %timeit aligned_arr['b'].sum()
>> 1000 loops, best of 3: 1.14 ms per loop
>
> On my machine:
>
> In [14]: %timeit baligned.sum()
> 1000 loops, best of 3: 1.03 ms per loop
>
> In [15]: %timeit bpacked.sum()
> 100 loops, best of 3: 3.79 ms per loop
>
> Again, the 4x slowdown is here.  Using numexpr:
>
> In [16]: %timeit numexpr.evaluate('sum(baligned)')
> 100 loops, best of 3: 2.16 ms per loop
>
> In [17]: %timeit numexpr.evaluate('sum(bpacked)')
> 100 loops, best of 3: 2.08 ms per loop

And with Theano:

In [26]: f2 = theano.function([a], a.sum())

In [27]: %timeit f2(baligned)
100 loops, best of 3: 2.52 ms per loop

In [28]: %timeit f2(bpacked)
100 loops, best of 3: 7.43 ms per loop

Again, the unaligned case is significantly slower (as much as 3x here!).

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] aligned / unaligned structured dtype behavior

2013-03-08 Thread Francesc Alted

On 3/7/13 7:26 PM, Frédéric Bastien wrote:
> Hi,
>
> It is normal that unaligned access are slower. The hardware have been
> optimized for aligned access. So this is a user choice space vs speed.
> We can't go around that.

Well, my benchmarks apparently say that numexpr can get better 
performance when tackling computations on unaligned arrays (30% 
faster).  This puzzled me a bit yesterday, but after thinking a bit 
about what was happening, the explanation is clear to me now.

The aligned and unaligned arrays were not contiguous, as they had a gap 
between elements (a consequence of the layout of structure arrays): 8 
bytes for the aligned case and 1 byte for the packed one.  The hardware 
of modern machines fetches a complete cache line (64 bytes typically) 
whenever an element is accessed and that means that, even though we are 
only making use of one field in the computations, both fields are 
brought into cache.  That means that, for aligned object, 16 MB (16 
bytes * 1 million elements) are transmitted to the cache, while the 
unaligned object only have to transmit 9 MB (9 bytes * 1 million).  Of 
course, transmitting 16 MB is pretty much work than just 9 MB.

Now, the elements land in cache aligned for the aligned case and 
unaligned for the packed case, and as you say, unaligned access in cache 
is pretty slow for the CPU, and this is the reason why NumPy can take up 
to 4x more time to perform the computation.  So why numexpr is 
performing much better for the packed case?  Well, it turns out that 
numexpr has machinery to detect that an array is unaligned, and does an 
internal copy for every block that is brought to the cache to be 
computed.  This block size is between 1024 elements (8 KB for double 
precision) and 4096 elements when linked with VML support, and that 
means that a copy normally happens at L1 or L2 cache speed, which is 
much faster than memory-to-memory copy. After the copy numexpr can 
perform operations with aligned data at full CPU speed.  The paradox is 
that, by doing more copies, you may end performing faster computations.  
This is the joy of programming with memory hierarchy in mind.

This is to say that there is more in the equation than just if an array 
is aligned or not.  You must take in account how (and how much!) data 
travels from storage to CPU before making assumptions on the performance 
of your programs.

>   We can only minimize the cost of unaligned
> access in some cases, but not all and those optimization depend of the
> CPU. But newer CPU have lowered in cost of unaligned access.
>
> I'm surprised that Theano worked with the unaligned input. I added
> some check to make this raise an error, as we do not support that!
> Francesc, can you check if Theano give the good result? It is possible
> that someone (maybe me), just copy the input to an aligned ndarray
> when we receive an not aligned one. That could explain why it worked,
> but my memory tell me that we raise an error.

It seems to work for me:

In [10]: f = theano.function([a], a**2)

In [11]: f(baligned)
Out[11]: array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

In [12]: f(bpacked)
Out[12]: array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

In [13]: f2 = theano.function([a], a.sum())

In [14]: f2(baligned)
Out[14]: array(100.0)

In [15]: f2(bpacked)
Out[15]: array(100.0)

>
> As you saw in the number, this is a bad example for Theano as the
> function compiled is too fast . Their is more Theano overhead then
> computation time in that example. We have reduced recently the
> overhead, but we can do more to lower it.

Yeah.  I was mainly curious about how different packages handle 
unaligned arrays.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: Numexpr 2.0 released

2011-11-27 Thread Francesc Alted


 Announcing Numexpr 2.0


Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
VML library, which allows for squeezing the last drop of performance
out of your multi-core processors.

What's new
==

This version comes with support for the new iterator in NumPy
(introduced in NumPy 1.6), allowing for improved performance in
practically all the scenarios (the exception being very small arrays),
and most specially for operations implying broadcasting,
fortran-ordered or non-native byte orderings.

The carefully crafted mix of the new NumPy iterator and direct access
to data buffers turned out to be so powerful and flexible, that the
internal virtual machine has been completely revamped around this
combination.

The drawback is that you will need NumPy >= 1.6 to run numexpr 2.0.
However, NumPy 1.6 has been released more than 6 months ago now, so we
think this is a good time for taking advantage of it.  Many thanks to
Mark Wiebe for such an important contribution!

For some benchmarks on the new virtual machine, see:

http://code.google.com/p/numexpr/wiki/NewVM

Also, Gaëtan de Menten contributed important bug fixes, code cleanup
as well as speed enhancements.  Francesc Alted contributed some fixes,
and added compatibility code with existing applications (PyTables)
too.

In case you want to know more in detail what has changed in this
version, see:

http://code.google.com/p/numexpr/wiki/ReleaseNotes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at Google code in:

http://code.google.com/p/numexpr/

You can get the packages from PyPI as well:

http://pypi.python.org/pypi/numexpr

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy!

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ANN: Numexpr 2.0 released

2012-01-08 Thread Francesc Alted

Hi srean,

Sorry for being late answering, the latest weeks have been really
crazy for me.  See my comments below.

2011/12/13 srean :
> This is great news, I hope this gets included in the epd distribution soon.
>
> I had mailed a few questions about numexpr sometime ago. I am still
> curious about those. I have included the relevant parts below. In
> addition, I have another question. There was a numexpr branch that
> allows a "out=blah" parameer to build the output in place, has that
> been merged or its functionality incorporated ?

Yes, the `out` parameter is fully supported in 2.0 series, as well as
new `order` and `casting` ones.  These are fully documented in
docstrings in forthcoming 2.0.1, as well as in the new User's Guide
wiki page at:

http://code.google.com/p/numexpr/wiki/UsersGuide

Thanks for pointing this out!

> This goes without saying, but, thanks for numexpr.
>
> --  from old mail --
>
> What I find somewhat encumbering is that there is no single piece of
> document that lists all the operators and functions that numexpr can
> parse. For a new user this will be very useful There is a list in the
> wiki page entitled "overview" but it seems incomplete (for instance it
> does not describe the reduction operations available). I do not know
> enough to know how incomplete it is.

The reduction functions are just `sum()` and `prod()` and are fully
documented in the new User's Guide.

>
> Is there any plan to implement the reduction like enhancements that
> ufuncs provide: namely reduce_at, accumulate, reduce ? It is entirely
> possible that they are already in there but I could not figure out how
> to use them. If they aren't it would be great to have them.

No, these are not implemented, but we will gladly accept contributions ;)

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: Numexpr 2.0.1 released

2012-01-08 Thread Francesc Alted

==
 Announcing Numexpr 2.0.1
==

Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
VML library, which allows for squeezing the last drop of performance
out of your multi-core processors.

What's new
==

In this release, better docstrings for `evaluate` and reduction
methods (`sum`, `prod`) is in place.  Also, compatibility with Python
2.5 has been restored (2.4 is definitely not supported anymore).

In case you want to know more in detail what has changed in this
version, see:

http://code.google.com/p/numexpr/wiki/ReleaseNotes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at Google code in:

http://code.google.com/p/numexpr/

You can get the packages from PyPI as well:

http://pypi.python.org/pypi/numexpr

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy!

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ANN: Numexpr 2.0.1 released

2012-01-08 Thread Francesc Alted

Python3 is not on my radar yet.  Perhaps others might be interested on
doing the port.

Francesc

2012/1/8 Nadav Horesh :
> What about python3 support?
>
>  Thanks
>
>    Nadav.
>
> 
> From: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] 
> On Behalf Of Francesc Alted [fal...@gmail.com]
> Sent: 08 January 2012 12:49
> To: Discussion of Numerical Python; numexpr
> Subject: [Numpy-discussion] ANN: Numexpr 2.0.1 released
>
> ==
>  Announcing Numexpr 2.0.1
> ==
>
> Numexpr is a fast numerical expression evaluator for NumPy.  With it,
> expressions that operate on arrays (like "3*a+4*b") are accelerated
> and use less memory than doing the same calculation in Python.
>
> It wears multi-threaded capabilities, as well as support for Intel's
> VML library, which allows for squeezing the last drop of performance
> out of your multi-core processors.
>
> What's new
> ==
>
> In this release, better docstrings for `evaluate` and reduction
> methods (`sum`, `prod`) is in place.  Also, compatibility with Python
> 2.5 has been restored (2.4 is definitely not supported anymore).
>
> In case you want to know more in detail what has changed in this
> version, see:
>
> http://code.google.com/p/numexpr/wiki/ReleaseNotes
>
> or have a look at RELEASE_NOTES.txt in the tarball.
>
> Where I can find Numexpr?
> =
>
> The project is hosted at Google code in:
>
> http://code.google.com/p/numexpr/
>
> You can get the packages from PyPI as well:
>
> http://pypi.python.org/pypi/numexpr
>
> Share your experience
> =
>
> Let us know of any bugs, suggestions, gripes, kudos, etc. you may
> have.
>
>
> Enjoy!
>
> --
> Francesc Alted
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] simple manipulations of numpy arrays

2012-02-10 Thread Francesc Alted

On Feb 10, 2012, at 3:29 PM, Brad Reisfeld wrote:

> Hi,
> 
> I am relatively new to numpy and am seeking some advice on an
> appropriate way to do the following simple task.
> 
> The idea is to build a class that will allow a user to easily remove
> and keep columns and rows in a 2D numpy array.


Apart from the good suggestions already made, you may also find useful the 
carray package available in:

https://github.com/FrancescAlted/carry

It implements a ctable object that has different capabilities:

* Allows addition and removal of columns very efficiently
* Allows to enlarge and shrink the ctable
* Supports compression (via the fast blosc compressor)
* If numexpr is installed, you can seamlessly operate with columns efficiently
* You can efficiently select rows using complex conditions (needs numexpr too)

You can have a quick look at how this works in the second part of the tutorial:

https://github.com/FrancescAlted/carray/blob/master/doc/tutorial.rst

Hope it helps,

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] simple manipulations of numpy arrays

2012-02-10 Thread Francesc Alted

On Feb 10, 2012, at 4:50 PM, Francesc Alted wrote:

> https://github.com/FrancescAlted/carry

Hmm, this should be:

https://github.com/FrancescAlted/carray

Blame my (too) smart spell corrector.

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Commit rights to NumPy for Francesc Alted

2012-02-12 Thread Francesc Alted

On Feb 12, 2012, at 12:07 AM, Ralf Gommers wrote:
> On Sat, Feb 11, 2012 at 11:06 PM, Fernando Perez  wrote:
> On Sat, Feb 11, 2012 at 11:11 AM, Travis Oliphant  wrote:
> > I propose to give Francesc Alted commit rights to the NumPy project.
> 
> +1.

Thanks for the kind invitation.  While it is true that in the last year I´ve 
been quite away of NumPy business, right now, and thanks to Travis O., NumPy 
has become a high priority project for me.  I just hope that I can provide some 
interesting value in the path for forthcoming NumPy 2.0.

Cheers!

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Index Array Performance

2012-02-14 Thread Francesc Alted

On Feb 14, 2012, at 1:50 AM, Wes McKinney wrote:
[clip]
> But:
> 
> In [40]: timeit hist[i, j]
> 1 loops, best of 3: 32 us per loop
> 
> So that's roughly 7-8x slower than a simple Cython method, so I
> sincerely hope it could be brought down to the sub 10 microsecond
> level with a little bit of work.

I vaguely remember this has shown up before.  My hunch is that indexing in 
NumPy is so powerful, that it has to check for a lot of different values for 
indices (integers, tuples, lists, arrays…), and that it is all these checks 
what is taking time.  Your Cython wrapper just assumed that the indices where 
integers, so this is probably the reason why it is that much faster.

This is not to say that indexing in NumPy could not be accelerated, but it 
won't be trivial, IMO.

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Change in scalar upcasting rules for 1.6.x?

2012-02-14 Thread Francesc Alted

On Feb 14, 2012, at 1:47 PM, Charles R Harris wrote:

> About the behavior in question, I would frame this as a specific case with 
> argument for and against like so:
> 
> The Current Behavior
> 
> In [1]: array([127], int8) + 127
> Out[1]: array([-2], dtype=int8)
> 
> In [2]: array([127], int8) + 128
> Out[2]: array([255], dtype=int16)
> 

Good point.

> Arguments for Old Behavior
> 
> Predictable, explicit output type. This is a good thing, in that no one wants 
> their 8GB int8 array turning into a 16GB int16 array.

Exactly, IIRC this is the main reason why the old behavior was decided.

> Backward compatibility.
> 
> Arguments for New Behavior
> 
> Fewer overflow problems. But no cure.
> 
> 
> Put that way I think you can make a solid argument for a tweak to restore old 
> behavior. Overflow can be a problem, but partial cures are not going to solve 
> it. I think we do need a way to deal with overflow. Maybe in two ways. 1) 
> saturated operations, i.e., 127 + 128 -> 127. This might be good for images. 
> 2) raise an error. We could make specific ufuncs for these behaviors.

Hmm, I'm thinking that it would be nice if NumPy could actually support both 
behaviors.  I just wonder whether that should be implemented as a property of 
each array or as global attribute for the whole NumPy package.  While the 
latter should be easier to implement (what to do when different behaved arrays 
are being operated?), the former would give more flexibility.  I know, this 
will introduce more complexity in the code base, but anyway, I think that would 
be a nice thing to support for NumPy 2.0.

Just a thought,

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] David M. Cooke?

2012-02-15 Thread Francesc Alted

Hi,

I know this is a bit unusual, but in the last few years I completely lost the 
track of David M. Cooke.  Most of the veterans on this list will remember him 
as being a great contributor to NumPy during the years 2004 to 2007.  He was 
the creator of numexpr (back in 2006) too.  I wonder if somebody knows about 
him.  If so, please tell me.

Thanks!

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Numpy governance update

2012-02-16 Thread Francesc Alted

On Feb 16, 2012, at 12:15 PM, Jason Grout wrote:

> On 2/15/12 6:27 PM, Dag Sverre Seljebotn wrote:
>> But in the very end, when agreement can't
>> be reached by other means, the developers are the one making the calls.
>> (This is simply a consequence that they are the only ones who can
>> credibly threaten to fork the project.)
> 
> Interesting point.  I hope I'm not pitching a log onto the fire here, 
> but in numpy's case, there are very many capable developers on other 
> projects who depend on numpy who could credibly threaten a fork if they 
> felt numpy was drastically going wrong.

Jason, that there capable developers out there that are able to fork NumPy (or 
any other project you can realize) is a given.  The point Dag was signaling is 
that this threaten is more probable to happen *inside* the community.

And you pointed out an important aspect too by saying "if they felt numpy was 
drastically going wrong".  It makes me the impression that some people is very 
frightened about something really bad would happen, well before it happens.  
While I agree that this is *possible*, I'd also advocate to give Travis the 
benefit of doubt.  I'm convinced he (and Continuum as a whole) is making things 
happen that will benefit the entire NumPy community; but in case something gets 
really wrong and catastrophic, it is always a relief to know that things can be 
reverted in the pure open source tradition (by either doing a fork, creating a 
new foundation, or even better, proposing a new way to do things).  What it 
does not sound reasonable to me is to allow fear to block Continuum efforts for 
making a better NumPy.  I think it is better to relax a bit, see how things are 
going, and then judge by looking at the *results*.

My two cents,

Disclaimer: As my e-mail address makes clear, I'm a Continuum guy.

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Numpy governance update

2012-02-17 Thread Francesc Alted

On Feb 17, 2012, at 5:20 AM, John Hunter wrote:

> And he has proven his ability to lead when *almost everyone* was
> against him.  At the height of the Numeric/numarray split, and I was
> deeply involved in this as the mpl author because we had a "numerix"
> compatibility layer to allow users to use one or the other, Travis
> proposed writing numpy to solve both camp's problems.  I really can't
> remember a single individual who supported him.  What I remember is
> the cacophony of voices who though this was a bad idea, because of the
> "third fork" problem.  But Travis forged ahead, on his own, wrote
> numpy, and re-united the Numeric and numarray camps.

Thanks John for including this piece of NumPy history in this context.  My 
voice was part of the "cacophony" about the "third fork" problem back in 2005.  
I was pretty darn uncomfortable on the perspective of adding support for a 
third numerical package on PyTables.  However, Travis started to work with all 
engines powered on, and in a few months he had built a thing that was overly 
better than Numeric and numarray together.  The rest is history.  But I 
remember this period (2005) as one of the most dramatic examples on how the 
capacity and dedication of a single individual can shape the world.

-- Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

2013-03-13 Thread Francesc Alted

On 3/13/13 2:45 PM, Andrea Cimatoribus wrote:
> Hi everybody, I hope this has not been discussed before, I couldn't find a 
> solution elsewhere.
> I need to read some binary data, and I am using numpy.fromfile to do this. 
> Since the files are huge, and would make me run out of memory, I need to read 
> data skipping some records (I am reading data recorded at high frequency, so 
> basically I want to read subsampling).
[clip]

You can do a fid.seek(offset) prior to np.fromfile() and the it will 
read from offset.  See the docstrings for `file.seek()` on how to use it.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

2013-03-13 Thread Francesc Alted

On 3/13/13 3:53 PM, Francesc Alted wrote:
> On 3/13/13 2:45 PM, Andrea Cimatoribus wrote:
>> Hi everybody, I hope this has not been discussed before, I couldn't 
>> find a solution elsewhere.
>> I need to read some binary data, and I am using numpy.fromfile to do 
>> this. Since the files are huge, and would make me run out of memory, 
>> I need to read data skipping some records (I am reading data recorded 
>> at high frequency, so basically I want to read subsampling).
> [clip]
>
> You can do a fid.seek(offset) prior to np.fromfile() and the it will 
> read from offset.  See the docstrings for `file.seek()` on how to use it.
>

Ups, you were already using file.seek().  Disregard, please.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] timezones and datetime64

2013-04-04 Thread Francesc Alted

On 4/4/13 1:52 AM, Chris Barker - NOAA Federal wrote:
> Thanks all for taking an interest. I need to think a bot more about
> the options before commenting more, but:
>
> while we're at it:
>
> It seems very odd to me that datetime64 supports different units
> (right down to  attosecond) but not different epochs. How can it
> possible be useful to use nanoseconds, etc, but only right around
> 1970? For that matter, why all the units at all? I can see the need
> for nanosecond resolution, but not without changing the epoch -- so if
> the epoch is fixed, why bother with different units?


When Ivan and me were discussing that, I remember us deciding that such 
a small units would be useful mainly for the timedelta datatype, which 
is a relative, not absolute time.  We did not want to make short for 
very precise time measurements, and this is why we decided to go with 
attoseconds.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] timezones and datetime64

2013-04-04 Thread Francesc Alted

On 4/4/13 1:54 PM, Nathaniel Smith wrote:
> On Thu, Apr 4, 2013 at 12:52 AM, Chris Barker - NOAA Federal
>  wrote:
>> Thanks all for taking an interest. I need to think a bot more about
>> the options before commenting more, but:
>>
>> while we're at it:
>>
>> It seems very odd to me that datetime64 supports different units
>> (right down to  attosecond) but not different epochs. How can it
>> possible be useful to use nanoseconds, etc, but only right around
>> 1970? For that matter, why all the units at all? I can see the need
>> for nanosecond resolution, but not without changing the epoch -- so if
>> the epoch is fixed, why bother with different units? Using days (for
>> instance) rather than seconds doesn't save memory, as we're always
>> using 64 bits. It can't be common to need more than 2.9e12 years (OK,
>> that's not quite as old as the universe, so some cosmologists may need
>> it...)
> Another reason why it might be interesting to support different epochs
> is that many timeseries (e.g., the ones I work with) aren't linked to
> absolute time, but are instead "milliseconds since we turned on the
> recording equipment". You can reasonably represent these as timedeltas
> of course, but it'd be even more elegant to be able to be able to
> represent them as absolute times against an opaque epoch. In
> particular, when you have multiple recording tracks, only those which
> were recorded against the same epoch are actually commensurable --
> trying to do
>recording1_times[10] - recording2_times[10]
> is meaningless and should be an error.

I remember to be discussing this with some level of depth 5 years ago in 
this list, as we asked people about the convenience of including an 
user-defined 'epoch'.  We were calling it 'origin'. But apparently it 
was decided that this was not needed because timestamps+timedelta would 
be enough.  The NEP still reflects this discussion:

https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst#why-the-origin-metadata-disappeared

This is just an historical note, not that we can't change that again.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] timezones and datetime64

2013-04-04 Thread Francesc Alted

On 4/4/13 7:01 PM, Chris Barker - NOAA Federal wrote:
> Francesc Alted wrote:
>> When Ivan and me were discussing that, I remember us deciding that such
>> a small units would be useful mainly for the timedelta datatype, which
>> is a relative, not absolute time.  We did not want to make short for
>> very precise time measurements, and this is why we decided to go with
>> attoseconds.
> I thought about that -- but if you have timedelta without datetime,
> you really just have an integer -- we haven't bought anything.

Well, it is not just an integer.  It is an integer with a time scale:

In []: np.array(1, dtype='timedelta64[us]') + np.array(1, 
dtype='timedelta64[ns]')
Out[]: numpy.timedelta64(1001,'ns')

That makes a difference.  This can be specially important for creating 
user-defined time origins:

In []: np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1, 
dtype='timedelta64[ns]')
Out[]: numpy.datetime64('2017-07-14T04:40:00.1+0200')

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] timezones and datetime64

2013-04-04 Thread Francesc Alted

On 4/4/13 8:56 PM, Chris Barker - NOAA Federal wrote:
> On Thu, Apr 4, 2013 at 10:54 AM, Francesc Alted  wrote:
>
>> That makes a difference.  This can be specially important for creating
>> user-defined time origins:
>>
>> In []: np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1,
>> dtype='timedelta64[ns]')
>> Out[]: numpy.datetime64('2017-07-14T04:40:00.1+0200')
> but that's worthless if you try it higher-resolution:
>
> In [40]: np.array(int(1.5e9), dtype='datetime64[s]')
> Out[40]: array(datetime.datetime(2017, 7, 14, 2, 40), dtype='datetime64[s]')
>
> # Start at 2017
>
> # add a picosecond:
> In [41]: np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1,
> dtype='timedelta64[ps]')
> Out[41]: numpy.datetime64('1970-03-08T22:55:30.029526319105-0800')
>
> # get 1970???

This is clearly a bug.  Could you file a ticket please?

Also, using attoseconds is giving a weird behavior:

In []: np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1, 
dtype='timedelta64[as]')
---
OverflowError Traceback (most recent call last)
 in ()
> 1 np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1, 
dtype='timedelta64[as]')

OverflowError: Integer overflow getting a common metadata divisor for 
NumPy datetime metadata [s] and [as]

I would expect the attosecond to be happily ignored and nothing would be 
added.

>
> And even with nanoseconds, given the leap-second issues, etc, you
> really wouldn't want to do this anyway -- rather, keep your epoch
> close by.
>
> Now that I think about it -- being able to set your epoch could lessen
> the impact of leap-seconds for second-resolution as well.

Probably this is the way to go, yes.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: numexpr 2.1 RC1

2013-04-14 Thread Francesc Alted



 Announcing Numexpr 2.1RC1


Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
VML library, which allows for squeezing the last drop of performance
out of your multi-core processors.

What's new
==

This version adds compatibility for Python 3.  A bunch of thanks to 
Antonio Valentino for his excellent work on this.I apologize for taking 
so long in releasing his contributions.


In case you want to know more in detail what has changed in this
version, see:

http://code.google.com/p/numexpr/wiki/ReleaseNotes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at Google code in:

http://code.google.com/p/numexpr/

This is a release candidate 1, so it will not be available on the PyPi 
repository.  I'll post it there when the final version will be released.


Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy!

--
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Profiling (was GSoC : Performance parity between numpy arrays and Python scalars)

2013-05-02 Thread Francesc Alted

On 5/2/13 3:58 PM, Nathaniel Smith wrote:
> callgrind has the *fabulous* kcachegrind front-end, but it only
> measures memory access performance on a simulated machine, which is
> very useful sometimes (if you're trying to optimize cache locality),
> but there's no guarantee that the bottlenecks on its simulated machine
> are the same as the bottlenecks on your real machine.

Agreed, there is no guarantee, but my experience is that kcachegrind 
normally gives you a pretty decent view of cache faults and hence it can 
do pretty good predictions on how this affects your computations.  I 
have used this feature extensively for optimizing parts of the Blosc 
compressor, and I cannot be more happier (to the point that, if it were 
not for Valgrind, I could not figure out many interesting memory access 
optimizations).

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: python-blosc 1.1 RC1 available for testing

2013-05-17 Thread Francesc Alted


Announcing python-blosc 1.1 RC1


What is it?
===

python-blosc (http://blosc.pydata.org) is a Python wrapper for the
Blosc compression library.

Blosc (http://blosc.org) is a high performance compressor optimized for
binary data.  It has been designed to transmit data to the processor
cache faster than the traditional, non-compressed, direct memory fetch
approach via a memcpy() OS call.  Whether this is achieved or not
depends of the data compressibility, the number of cores in the system,
and other factors.  See a series of benchmarks conducted for many
different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks.

Blosc works well for compressing numerical arrays that contains data
with relatively low entropy, like sparse data, time series, grids with
regular-spaced values, etc.

There is also a handy command line for Blosc called Bloscpack
(https://github.com/esc/bloscpack) that allows you to compress large
binary datafiles on-disk.  Although the format for Bloscpack has not
stabilized yet, it allows you to effectively use Blosc from your
favorite shell.


What is new?


- Added new `compress_ptr` and `decompress_ptr` functions that allows to
   compress and decompress from/to a data pointer.  These are low level
   calls and user must make sure that the pointer data area is safe.

- Since Blosc (the C library) already supports to be installed as an
   standalone library (via cmake), it is also possible to link
   python-blosc against a system Blosc library.

- The Python calls to Blosc are now thread-safe (another consequence of
   recent Blosc library supporting this at C level).

- Many checks on types and ranges of values have been added.  Most of
   the calls will now complain when passed the wrong values.

- Docstrings are much improved. Also, Sphinx-based docs are available
   now.

Many thanks to Valentin Hänel for his impressive work for this release.

For more info, you can see the release notes in:

https://github.com/FrancescAlted/python-blosc/wiki/Release-notes

More docs and examples are available in the documentation site:

http://blosc.pydata.org


Installing
==

python-blosc is in PyPI repository, so installing it is easy:

$ pip install -U blosc  # yes, you should omit the blosc- prefix


Download sources


The sources are managed through github services at:

http://github.com/FrancescAlted/python-blosc


Documentation
=

There is Sphinx-based documentation site at:

http://blosc.pydata.org/


Mailing list


There is an official mailing list for Blosc at:

bl...@googlegroups.com
http://groups.google.es/group/blosc


Licenses


Both Blosc and its Python wrapper are distributed using the MIT license.
See:

https://github.com/FrancescAlted/python-blosc/blob/master/LICENSES

for more details.

--
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: python-blosc 1.1 (final) released

2013-05-24 Thread Francesc Alted

===
Announcing python-blosc 1.1
===

What is it?
===

python-blosc (http://blosc.pydata.org/) is a Python wrapper for the
Blosc compression library.

Blosc (http://blosc.org) is a high performance compressor optimized for
binary data.  It has been designed to transmit data to the processor
cache faster than the traditional, non-compressed, direct memory fetch
approach via a memcpy() OS call.  Whether this is achieved or not
depends of the data compressibility, the number of cores in the system,
and other factors.  See a series of benchmarks conducted for many
different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks.

Blosc works well for compressing numerical arrays that contains data
with relatively low entropy, like sparse data, time series, grids with
regular-spaced values, etc.

There is also a handy command line for Blosc called Bloscpack
(https://github.com/esc/bloscpack) that allows you to compress large
binary datafiles on-disk.  Although the format for Bloscpack has not
stabilized yet, it allows you to effectively use Blosc from your
favorite shell.


What is new?


- Added new `compress_ptr` and `decompress_ptr` functions that allows to
   compress and decompress from/to a data pointer, avoiding an
   itermediate copy for maximum speed.  Be careful, as these are low
   level calls, and user must make sure that the pointer data area is
   safe.

- Since Blosc (the C library) already supports to be installed as an
   standalone library (via cmake), it is also possible to link
   python-blosc against a system Blosc library.

- The Python calls to Blosc are now thread-safe (another consequence of
   recent Blosc library supporting this at C level).

- Many checks on types and ranges of values have been added.  Most of
   the calls will now complain when passed the wrong values.

- Docstrings are much improved. Also, Sphinx-based docs are available
   now.

Many thanks to Valentin Hänel for his impressive work for this release.

For more info, you can see the release notes in:

https://github.com/FrancescAlted/python-blosc/wiki/Release-notes

More docs and examples are available in the documentation site:

http://blosc.pydata.org


Installing
==

python-blosc is in PyPI repository, so installing it is easy:

$ pip install -U blosc  # yes, you should omit the python- prefix


Download sources


The sources are managed through github services at:

http://github.com/FrancescAlted/python-blosc


Documentation
=

There is Sphinx-based documentation site at:

http://blosc.pydata.org/


Mailing list


There is an official mailing list for Blosc at:

bl...@googlegroups.com
http://groups.google.es/group/blosc


Licenses


Both Blosc and its Python wrapper are distributed using the MIT license.
See:

https://github.com/FrancescAlted/python-blosc/blob/master/LICENSES

for more details.

Enjoy!

--
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] RAM problem during code execution - Numpya arrays

2013-08-23 Thread Francesc Alted

  S_nuevo=sum(X_nuevo)
>
> std_dev_nuevo_exp_u = np.sqrt(S_nuevo/(tipos_nuevo-1.))
>
>
> componente_longitudinal=Longitud*np.abs(np.cos(array_angle))
> comp_y=np.append(comp_y, sum(componente_longitudinal)/N)
>
> componente_transversal=Longitud*np.abs(np.sin(array_angle))
> comp_x=np.append(comp_x, sum(componente_transversal)/N)
>
> std_dev_size_medio_intuitivo=np.append(std_dev_size_medio_intuitivo, 
> std_dev_exp_u)
>
> std_dev_size_medio_nuevo=np.append(std_dev_size_medio_nuevo, 
> std_dev_nuevo_exp_u)
>
> size_medio_intuitivo=np.append(size_medio_intuitivo, 
> S_medio_intuitivo_exp_u)
>
> size_medio_nuevo=np.append(size_medio_nuevo, S_medio_nuevo_exp_u)
>
>
> percolation_probability=sum(PERCOLACION)/numero_experimentos
>
> prob_perc=np.append(prob_perc, percolation_probability)
>
> S_int = np.append (S_int, sum(size_medio_intuitivo)/numero_experimentos)
>
> S_medio=np.append (S_medio, sum(size_medio_nuevo)/numero_experimentos)
>
> desviacion_standard = np.append (desviacion_standard, 
> sum(std_dev_size_medio_intuitivo)/numero_experimentos)
>
> desviacion_standard_nuevo=np.append (desviacion_standard_nuevo, 
> sum(std_dev_size_medio_nuevo)/numero_experimentos)
>
> tiempos=np.append(tiempos, time.clock()-empieza)
>
> componente_y=np.append(componente_y, sum(comp_y)/numero_experimentos)
> componente_x=np.append(componente_x, sum(comp_x)/numero_experimentos)
>
> anisotropia_macroscopica_porcentual=100*(1-(componente_y/componente_x))
>
> I tryed with gc and gc.collect() and 'del'command for deleting arrays
> after his use and nothing work!
>
> What am I doing wrong? Why the memory becomes full while running (starts
> with 10% of RAM used and in 1-2hour is totally full used)?
>
> Please help me, I'm totally stuck!
> Thanks a lot!
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] [ANN] numexpr 2.2 released

2013-08-31 Thread Francesc Alted

==
 Announcing Numexpr 2.2
==

Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
VML library (included in Intel MKL), which allows an extremely fast
evaluation of transcendental functions (sin, cos, tan, exp, log...)
while squeezing the last drop of performance out of your multi-core
processors.

Its only dependency is NumPy (MKL is optional), so it works well as an
easy-to-deploy, easy-to-use, computational kernel for projects that
don't want to adopt other solutions that require more heavy
dependencies.

What's new
==

This release is mainly meant to fix a problem with the license the
numexpr/win32/pthread.{c,h} files emulating pthreads on Windows. After
persmission from the original authors is granted, these files adopt
the MIT license and can be redistributed without problems.  See issue
#109 for details
(https://code.google.com/p/numexpr/issues/detail?id=110).

Another important improvement is the new algorithm to decide the initial
number of threads to be used.  This was necessary because by default,
numexpr was using a number of threads equal to the detected number of
cores, and this can be just too much for moder systems where this
number can be too high (and counterporductive for performance in many
cases).  Now, the 'NUMEXPR_NUM_THREADS' environment variable is
honored, and in case this is not present, a maximum number of *8*
threads are setup initially.  The new algorithm is fully described in
the Users Guide, in the note of 'General routines' section:
https://code.google.com/p/numexpr/wiki/UsersGuide#General_routines.
Closes #110.

In case you want to know more in detail what has changed in this
version, see:

http://code.google.com/p/numexpr/wiki/ReleaseNotes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at Google code in:

http://code.google.com/p/numexpr/

You can get the packages from PyPI as well:

http://pypi.python.org/pypi/numexpr

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy data!

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] -ffast-math

2013-12-03 Thread Francesc Alted

On 12/2/13, 12:14 AM, Dan Goodman wrote:
> Dan Goodman  thesamovar.net> writes:
> ...
>> I got around 5x slower. Using numexpr 'dumbly' (i.e. just putting the
>> expression in directly) was slower than the function above, but doing a
>> hybrid between the two approaches worked well:
>>
>> def timefunc_numexpr_smart():
>>  _sin_term = sin(2.0*freq*pi*t)
>>  _exp_term = exp(-dt/tau)
>>  _a_term = (_sin_term-_sin_term*_exp_term)
>>  _const_term = -b*_exp_term + b
>>  v[:] = numexpr.evaluate('a*_a_term+v*_exp_term+_const_term')
>>  #numexpr.evaluate('a*_a_term+v*_exp_term+_const_term', out=v)
>>
>> This was about 3.5x slower than weave. If I used the commented out final
>> line then it was only 1.5x slower than weave, but it also gives wrong
>> results. I reported this as a bug in numexpr a long time ago but I guess it
>> hasn't been fixed yet (or maybe I didn't upgrade my version recently).
> I just upgraded numexpr to 2.2 where they did fix this bug, and now the
> 'smart' numexpr version runs exactly as fast as weave (so I guess there were
> some performance enhancements in numexpr as well).

Err no, there have not been performance improvements in numexpr since 
2.0 (that I am aware of).  Maybe you are running in a multi-core machine 
now and you are seeing better speedup because of this?  Also, your 
expressions are made of transcendental functions, so linking numexpr 
with MKL could accelerate computations a good deal too.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Catching out-of-memory error before it happens

2014-01-24 Thread Francesc Alted


Yeah, numexpr is pretty cool for avoiding temporaries in an easy way:

https://github.com/pydata/numexpr

Francesc

El 24/01/14 16:30, Nathaniel Smith ha escrit:


There is no reliable way to predict how much memory an arbitrary numpy 
operation will need, no. However, in most cases the main memory cost 
will be simply the need to store the input and output arrays; for 
large arrays, all other allocations should be negligible.


The most effective way to avoid running out of memory, therefore, is 
to avoid creating temporary arrays, by using only in-place operations.


E.g., if a and b each require N bytes of ram, then memory requirements 
(roughly).


c = a + b: 3N
c = a + 2*b: 4N
a += b: 2N
np.add(a, b, out=a): 2N
b *= 2; a += b: 2N

Note that simply loading a and b requires 2N memory, so the latter 
code samples are near-optimal.


Of course some calculations do require the use of temporary storage 
space...


-n

On 24 Jan 2014 15:19, "Dinesh Vadhia" <mailto:dineshbvad...@hotmail.com>> wrote:


I want to write a general exception handler to warn if too much
data is being loaded for the ram size in a machine for a
successful numpy array operation to take place.  For example, the
program multiplies two floating point arrays A and B which are
populated with loadtext.  While the data is being loaded, want to
continuously check that the data volume doesn't pass a threshold
that will cause on out-of-memory error during the A*B operation.
The known variables are the amount of memory available, data type
(floats in this case) and the numpy array operation to be
performed. It seems this requires knowledge of the internal memory
requirements of each numpy operation.  For sake of simplicity, can
ignore other memory needs of program.  Is this possible?

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
http://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion



--
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: numexpr 2.3 (final) released

2014-01-25 Thread Francesc Alted

==
  Announcing Numexpr 2.3
==

Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
MKL (Math Kernel Library), which allows an extremely fast evaluation
of transcendental functions (sin, cos, tan, exp, log...)  while
squeezing the last drop of performance out of your multi-core
processors.  Look here for a some benchmarks of numexpr using MKL:

https://github.com/pydata/numexpr/wiki/NumexprMKL

Its only dependency is NumPy (MKL is optional), so it works well as an
easy-to-deploy, easy-to-use, computational engine for projects that
don't want to adopt other solutions requiring more heavy dependencies.
Numexpr is already being used in a series of packages (PyTables, pandas,
BLZ...) for helping doing computations faster.


What's new
==

The repository has been migrated to https://github.com/pydata/numexpr.
All new tickets and PR should be directed there.

Also, a `conj()` function for computing the conjugate of complex arrays 
has been added.
Thanks to David Menéndez.  See PR #125.

Finallly, we fixed a DeprecationWarning derived of using ``oa_ndim ==
0`` and ``op_axes == NULL`` with `NpyIter_AdvancedNew()` and
NumPy 1.8.  Thanks to Mark Wiebe for advise on how to fix this
properly.

Many thanks to Christoph Gohlke and Ilan Schnell for his help during
the testing of this release in all kinds of possible combinations of
platforms and MKL.

In case you want to know more in detail what has changed in this
version, see:

https://github.com/pydata/numexpr/wiki/Release-Notes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at GitHub in:

https://github.com/pydata/numexpr

You can get the packages from PyPI as well (but not for RC releases):

http://pypi.python.org/pypi/numexpr

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy data!

-- Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: python-blosc 1.2.0 released

2014-01-25 Thread Francesc Alted

ugh github services at:

http://github.com/ContinuumIO/python-blosc


Documentation
=

There is Sphinx-based documentation site at:

http://blosc.pydata.org/


Mailing list


There is an official mailing list for Blosc at:

bl...@googlegroups.com
http://groups.google.es/group/blosc


Licenses


Both Blosc and its Python wrapper are distributed using the MIT license.
See:

https://github.com/ContinuumIO/python-blosc/blob/master/LICENSES

for more details.

--
Francesc Alted
Continuum Analytics, Inc.


-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: BLZ 0.6.1 has been released

2014-01-25 Thread Francesc Alted

Announcing BLZ 0.6 series
=

What it is
--

BLZ is a chunked container for numerical data.  Chunking allows for
efficient enlarging/shrinking of data container.  In addition, it can
also be compressed for reducing memory/disk needs.  The compression
process is carried out internally by Blosc, a high-performance
compressor that is optimized for binary data.

The main objects in BLZ are `barray` and `btable`.  `barray` is meant
for storing multidimensional homogeneous datasets efficiently.
`barray` objects provide the foundations for building `btable`
objects, where each column is made of a single `barray`.  Facilities
are provided for iterating, filtering and querying `btables` in an
efficient way.  You can find more info about `barray` and `btable` in
the tutorial:

http://blz.pydata.org/blz-manual/tutorial.html

BLZ can use numexpr internally so as to accelerate many vector and
query operations (although it can use pure NumPy for doing so too)
either from memory or from disk.  In the future, it is planned to use
Numba as the computational kernel and to provide better Blaze
(http://blaze.pydata.org) integration.


What's new
--

BLZ has been branched off from the Blaze project
(http://blaze.pydata.org).  BLZ was meant as a persistent format and
library for I/O in Blaze.  BLZ in Blaze is based on previous carray
0.5 and this is why this new version is labeled 0.6.

BLZ supports completely transparent storage on-disk in addition to
memory.  That means that *everything* that can be done with the
in-memory container can be done using the disk as well.

The advantages of a disk-based container is that the addressable space
is much larger than just your available memory.  Also, as BLZ is based
on a chunked and compressed data layout based on the super-fast Blosc
compression library, the data access speed is very good.

The format chosen for the persistence layer is based on the
'bloscpack' library and described in the "Persistent format for BLZ"
chapter of the user manual ('docs/source/persistence-format.rst').
More about Bloscpack here: https://github.com/esc/bloscpack

You may want to know more about BLZ in this blog entry:
http://continuum.io/blog/blz-format

In this version, support for Blosc 1.3 has been added, that meaning
that a new `cname` parameter has been added to the `bparams` class, so
that you can select you preferred compressor from 'blosclz', 'lz4',
'lz4hc', 'snappy' and 'zlib'.

Also, many bugs have been fixed, providing a much smoother experience.

CAVEAT: The BLZ/bloscpack format is still evolving, so don't trust on
forward compatibility of the format, at least until 1.0, where the
internal format will be declared frozen.


Resources
-

Visit the main BLZ site repository at:
http://github.com/ContinuumIO/blz

Read the online docs at:
http://blz.pydata.org/blz-manual/index.html

Home of Blosc compressor:
http://www.blosc.org

User's mail list:
blaze-...@continuum.io



Enjoy!

Francesc Alted
Continuum Analytics, Inc.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ANN: numexpr 2.3 (final) released

2014-01-27 Thread Francesc Alted

Not really.  numexpr is mostly about element-wise operations in dense 
matrices.  You should look to another package for that.

Francesc

On 1/27/14, 10:18 AM, Dinesh Vadhia wrote:
> Francesc: Does numexpr support scipy sparse matrices?
>   
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] argsort speed

2014-02-17 Thread Francesc Alted

On 2/17/14, 1:08 AM, josef.p...@gmail.com wrote:
> On Sun, Feb 16, 2014 at 6:12 PM, Daπid  wrote:
>> On 16 February 2014 23:43,  wrote:
>>> What's the fastest argsort for a 1d array with around 28 Million
>>> elements, roughly uniformly distributed, random order?
>>
>> On numpy latest version:
>>
>> for kind in ['quicksort', 'mergesort', 'heapsort']:
>>  print kind
>>  %timeit np.sort(data, kind=kind)
>>  %timeit np.argsort(data, kind=kind)
>>
>>
>> quicksort
>> 1 loops, best of 3: 3.55 s per loop
>> 1 loops, best of 3: 10.3 s per loop
>> mergesort
>> 1 loops, best of 3: 4.84 s per loop
>> 1 loops, best of 3: 9.49 s per loop
>> heapsort
>> 1 loops, best of 3: 12.1 s per loop
>> 1 loops, best of 3: 39.3 s per loop
>>
>>
>> It looks quicksort is quicker sorting, but mergesort is marginally faster
>> sorting args. The diference is slim, but upon repetition, it remains
>> significant.
>>
>> Why is that? Probably part of the reason is what Eelco said, and part is
>> that in sort comparison are done accessing the array elements directly, but
>> in argsort you have to index the array, introducing some overhead.
> Thanks, both.
>
> I also gain a second with mergesort.
>
> matlab would be nicer in my case, it returns both.
> I still need to use the argsort to index into the array to also get
> the sorted array.

Many years ago I needed something similar, so I made some functions for 
sorting and argsorting in one single shot.  Maybe you want to reuse 
them.  Here it is an example of the C implementation:

https://github.com/PyTables/PyTables/blob/develop/src/idx-opt.c#L619

and here the Cython wrapper for all of them:

https://github.com/PyTables/PyTables/blob/develop/tables/indexesextension.pyx#L129

Francesc

>
> Josef
>
>
>> I seem unable to find the code for ndarray.sort, so I can't check. I have
>> tried to grep it tring all possible combinations of "def ndarray",
>> "self.sort", etc. Where is it?
>>
>>
>> /David.
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: numexpr 2.3.1 released

2014-02-18 Thread Francesc Alted

==
  Announcing Numexpr 2.3.1
==

Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
MKL (Math Kernel Library), which allows an extremely fast evaluation
of transcendental functions (sin, cos, tan, exp, log...)  while
squeezing the last drop of performance out of your multi-core
processors.  Look here for a some benchmarks of numexpr using MKL:

https://github.com/pydata/numexpr/wiki/NumexprMKL

Its only dependency is NumPy (MKL is optional), so it works well as an
easy-to-deploy, easy-to-use, computational engine for projects that
don't want to adopt other solutions requiring more heavy dependencies.

What's new
==

* Added support for shift-left (<<) and shift-right (>>) binary operators.
   See PR #131. Thanks to fish2000!

* Removed the rpath flag for the GCC linker, because it is probably
   not necessary and it chokes to clang.

In case you want to know more in detail what has changed in this
version, see:

https://github.com/pydata/numexpr/wiki/Release-Notes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at GitHub in:

https://github.com/pydata/numexpr

You can get the packages from PyPI as well (but not for RC releases):

http://pypi.python.org/pypi/numexpr

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy data!

-- Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] last call for fixes for numpy 1.8.1rc1

2014-02-28 Thread Francesc Alted

Hi Julian,

Any chance that NPY_MAXARGS could be increased to something more than 
the current value of 32?  There is a discussion about this in:

https://github.com/numpy/numpy/pull/226

but I think that, as Charles was suggesting, just increasing NPY_MAXARGS 
to something more reasonable (say 256) should be enough for a long while.

This issue limits quite a bit the number of operands in numexpr 
expressions, and hence, to other projects that depends on it, like 
PyTables or pandas.  See for example this bug report:

https://github.com/PyTables/PyTables/issues/286

Thanks,
Francesc

On 2/27/14, 9:05 PM, Julian Taylor wrote:
> hi,
>
> We want to start preparing the release candidate for the bugfix release
> 1.8.1rc1 this weekend, I'll start preparing the changelog tomorrow.
>
> So if you want a certain issue fixed please scream now or better create
> a pull request/patch on the maintenance/1.8.x branch.
> Please only consider bugfixes, no enhancements (unless they are really
> really simple), new features or invasive changes.
>
> I just finished my list of issues I want backported to numpy 1.8
> (gh-4390, gh-4388). Please check if its already included in these PRs.
> I'm probably still going to add gh-4284 after some though tomorrow.
>
> Cheers,
> Julian
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] last call for fixes for numpy 1.8.1rc1

2014-02-28 Thread Francesc Alted

Well, what numexpr is using is basically NpyIter_AdvancedNew:

https://github.com/pydata/numexpr/blob/master/numexpr/interpreter.cpp#L1178

and nothing else.  If NPY_MAXARGS could be increased just for that, and 
without ABI breaking, then fine.  If not, we should have to wait until 
1.9 I am afraid.

On the other hand, increasing the temporary arrays in nditer from 32kb 
to 128kb is a bit worrying, but probably we should do some benchmarks 
and see how much performance would be compromised (if any).

Francesc

On 2/28/14, 1:09 PM, Julian Taylor wrote:
> hm increasing it for PyArrayMapIterObject would break the public ABI.
> While nobody should be using this part of the ABI its not appropriate
> for a bugfix release.
> Note that as it currently stands in numpy 1.9.dev we will break this ABI
> for the indexing improvements.
>
> Though for nditer and some other functions we could change it if thats
> enough.
> It would bump some temporary arrays of nditer from 32kb to 128kb, I
> think that would still be fine, but getting to the point where we should
> move them onto the heap.
>
> On 28.02.2014 12:41, Francesc Alted wrote:
>> Hi Julian,
>>
>> Any chance that NPY_MAXARGS could be increased to something more than
>> the current value of 32?  There is a discussion about this in:
>>
>> https://github.com/numpy/numpy/pull/226
>>
>> but I think that, as Charles was suggesting, just increasing NPY_MAXARGS
>> to something more reasonable (say 256) should be enough for a long while.
>>
>> This issue limits quite a bit the number of operands in numexpr
>> expressions, and hence, to other projects that depends on it, like
>> PyTables or pandas.  See for example this bug report:
>>
>> https://github.com/PyTables/PyTables/issues/286
>>
>> Thanks,
>> Francesc
>>
>> On 2/27/14, 9:05 PM, Julian Taylor wrote:
>>> hi,
>>>
>>> We want to start preparing the release candidate for the bugfix release
>>> 1.8.1rc1 this weekend, I'll start preparing the changelog tomorrow.
>>>
>>> So if you want a certain issue fixed please scream now or better create
>>> a pull request/patch on the maintenance/1.8.x branch.
>>> Please only consider bugfixes, no enhancements (unless they are really
>>> really simple), new features or invasive changes.
>>>
>>> I just finished my list of issues I want backported to numpy 1.8
>>> (gh-4390, gh-4388). Please check if its already included in these PRs.
>>> I'm probably still going to add gh-4284 after some though tomorrow.
>>>
>>> Cheers,
>>> Julian
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] last call for fixes for numpy 1.8.1rc1

2014-02-28 Thread Francesc Alted

On 2/28/14, 3:00 PM, Charles R Harris wrote:
>
>
>
> On Fri, Feb 28, 2014 at 5:52 AM, Julian Taylor 
> mailto:jtaylor.deb...@googlemail.com>> 
> wrote:
>
> performance should not be impacted as long as we stay on the stack, it
> just increases offset of a stack pointer a bit more.
> E.g. nditer and einsum use temporary stack arrays of this type for its
> initialization:
> op_axes_arrays[NPY_MAXARGS][NPY_MAXDIMS]; // both 32 currently
> The resulting nditer structure is then in heap space and dependent on
> the real amount of arguments it got.
> So I'm more worried about running out of stack space, though the limit
> is usually 8mb so taking 128kb for a short while should be ok.
>
> On 28.02.2014 13:32, Francesc Alted wrote:
> > Well, what numexpr is using is basically NpyIter_AdvancedNew:
> >
> >
> 
> https://github.com/pydata/numexpr/blob/master/numexpr/interpreter.cpp#L1178
> >
> > and nothing else.  If NPY_MAXARGS could be increased just for
> that, and
> > without ABI breaking, then fine.  If not, we should have to wait
> until
> > 1.9 I am afraid.
> >
> > On the other hand, increasing the temporary arrays in nditer
> from 32kb
> > to 128kb is a bit worrying, but probably we should do some
> benchmarks
> > and see how much performance would be compromised (if any).
> >
> > Francesc
> >
> > On 2/28/14, 1:09 PM, Julian Taylor wrote:
> >> hm increasing it for PyArrayMapIterObject would break the
> public ABI.
> >> While nobody should be using this part of the ABI its not
> appropriate
> >> for a bugfix release.
> >> Note that as it currently stands in numpy 1.9.dev we will break
> this ABI
> >> for the indexing improvements.
> >>
> >> Though for nditer and some other functions we could change it
> if thats
> >> enough.
> >> It would bump some temporary arrays of nditer from 32kb to 128kb, I
> >> think that would still be fine, but getting to the point where
> we should
> >> move them onto the heap.
>
>
> These sort of changes can have subtle side effects and need lots of 
> testing in a release cycle. Bugfix release cycles are kept short by 
> restricting changes to those that look simple and safe.

Agreed.  I have just opened a ticket for having this in mind for NumPy 1.9:

https://github.com/numpy/numpy/issues/4398

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: numexpr 2.4 RC1

2014-04-06 Thread Francesc Alted

===
  Announcing Numexpr 2.4 RC1
===

Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
MKL (Math Kernel Library), which allows an extremely fast evaluation
of transcendental functions (sin, cos, tan, exp, log...)  while
squeezing the last drop of performance out of your multi-core
processors.  Look here for a some benchmarks of numexpr using MKL:

https://github.com/pydata/numexpr/wiki/NumexprMKL

Its only dependency is NumPy (MKL is optional), so it works well as an
easy-to-deploy, easy-to-use, computational engine for projects that
don't want to adopt other solutions requiring more heavy dependencies.

What's new
==

A new `contains()` function has been added for detecting substrings in
strings.  Thanks to Marcin Krol.

Also, there is a new version of setup.py that allows better management
of the NumPy dependency during pip installs.  Thanks to Aleks Bunin.

This is the first release candidate before 2.4 final would be out,
so please give it a go and report back any problems you may have.

In case you want to know more in detail what has changed in this
version, see:

https://github.com/pydata/numexpr/wiki/Release-Notes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at GitHub in:

https://github.com/pydata/numexpr

You can get the packages from PyPI as well (but not for RC releases):

http://pypi.python.org/pypi/numexpr

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy data!

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: numexpr 2.4 RC2

2014-04-07 Thread Francesc Alted

===
  Announcing Numexpr 2.4 RC2
===

Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
MKL (Math Kernel Library), which allows an extremely fast evaluation
of transcendental functions (sin, cos, tan, exp, log...)  while
squeezing the last drop of performance out of your multi-core
processors.  Look here for a some benchmarks of numexpr using MKL:

https://github.com/pydata/numexpr/wiki/NumexprMKL

Its only dependency is NumPy (MKL is optional), so it works well as an
easy-to-deploy, easy-to-use, computational engine for projects that
don't want to adopt other solutions requiring more heavy dependencies.

What's new
==

A new `contains()` function has been added for detecting substrings in
strings.  Only plain strings (bytes) are supported for now (see ticket
#142).  Thanks to Marcin Krol.

Also, there is a new version of setup.py that allows better management
of the NumPy dependency during pip installs.  Thanks to Aleks Bunin.

Windows related bugs have been addressed and (hopefully) squashed.
Thanks to Christoph Gohlke.

In case you want to know more in detail what has changed in this
version, see:

https://github.com/pydata/numexpr/wiki/Release-Notes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at GitHub in:

https://github.com/pydata/numexpr

You can get the packages from PyPI as well (but not for RC releases):

http://pypi.python.org/pypi/numexpr

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy data!

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] PEP 465 has been accepted / volunteers needed

2014-04-10 Thread Francesc Alted

On 4/9/14, 10:46 PM, Chris Barker wrote:
> On Tue, Apr 8, 2014 at 11:14 AM, Nathaniel Smith  <mailto:n...@pobox.com>> wrote:
>
> Thank you! Though I suspect that the most important part of my
> contribution may have just been my high tolerance for writing emails
> ;-).
>
>
> no -- it's your high tolerance for _reading_ emails...
>
> Far too many of us have a high tolerance for writing them!

Ha ha, very true!

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: numexpr 2.4 is out

2014-04-13 Thread Francesc Alted


  Announcing Numexpr 2.4


Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It wears multi-threaded capabilities, as well as support for Intel's
MKL (Math Kernel Library), which allows an extremely fast evaluation
of transcendental functions (sin, cos, tan, exp, log...)  while
squeezing the last drop of performance out of your multi-core
processors.  Look here for a some benchmarks of numexpr using MKL:

https://github.com/pydata/numexpr/wiki/NumexprMKL

Its only dependency is NumPy (MKL is optional), so it works well as an
easy-to-deploy, easy-to-use, computational engine for projects that
don't want to adopt other solutions requiring more heavy dependencies.

What's new
==

A new `contains()` function has been added for detecting substrings in
strings.  Only plain strings (bytes) are supported for now (see ticket
#142).  Thanks to Marcin Krol.

You can have a glimpse on how `contains()` works in this notebook:

http://nbviewer.ipython.org/gist/FrancescAlted/10595974

where it can be seen that this can make substring searches more
than 10x faster than with regular Python.

You can find the source for the notebook here:

https://github.com/FrancescAlted/ngrams

Also, there is a new version of setup.py that allows better management
of the NumPy dependency during pip installs.  Thanks to Aleks Bunin.

Windows related bugs have been addressed and (hopefully) squashed.
Thanks to Christoph Gohlke.

In case you want to know more in detail what has changed in this
version, see:

https://github.com/pydata/numexpr/wiki/Release-Notes

or have a look at RELEASE_NOTES.txt in the tarball.

Where I can find Numexpr?
=

The project is hosted at GitHub in:

https://github.com/pydata/numexpr

You can get the packages from PyPI as well (but not for RC releases):

http://pypi.python.org/pypi/numexpr

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.


Enjoy data!

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Francesc Alted

Uh, 15x slower for unaligned access is quite a lot.  But Intel (and AMD) 
arquitectures are much more tolerant in this aspect (and improving).  
For example, with a Xeon(R) CPU E5-2670 (2 years old) I get:

In [1]: import numpy as np

In [2]: shape = (1, 1)

In [3]: x_aligned = np.zeros(shape, 
dtype=[('x',np.float64),('y',np.int64)])['x']

In [4]: x_unaligned = np.zeros(shape, 
dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']

In [5]: %timeit res = x_aligned ** 2
1 loops, best of 3: 289 ms per loop

In [6]: %timeit res = x_unaligned ** 2
1 loops, best of 3: 664 ms per loop

so the added cost in this case is just a bit more than 2x.  But you can 
also alleviate this overhead if you do a copy that fits in cache prior 
to do computations.  numexpr does this:

https://github.com/pydata/numexpr/blob/master/numexpr/interp_body.cpp#L203

and the results are pretty good:

In [8]: import numexpr as ne

In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
10 loops, best of 3: 133 ms per loop

In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
10 loops, best of 3: 134 ms per loop

i.e. there is not a significant difference between aligned and unaligned 
access to data.

I wonder if the same technique could be applied to NumPy.

Francesc

El 17/04/14 16:26, Aron Ahmadia ha escrit:

Hmnn, I wasn't being clear :)

The default malloc on BlueGene/Q only returns 8 byte alignment, but 
the SIMD units need 32-byte alignment for loads, stores, and 
operations or performance suffers.  On the /P the required alignment 
was 16-bytes, but malloc only gave you 8, and trying to perform 
vectorized loads/stores generated alignment exceptions on unaligned 
memory.

See https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q and 
https://computing.llnl.gov/tutorials/bgp/BGP-usage.Walkup.pdf (slides 
14 for overview, 15 for the effective performance difference between 
the unaligned/aligned code) for some notes on this.

A

On Thu, Apr 17, 2014 at 10:18 AM, Nathaniel Smith <mailto:n...@pobox.com>> wrote:

On 17 Apr 2014 15:09, "Aron Ahmadia" mailto:a...@ahmadia.net>> wrote:
>
> > On the one hand it would be nice to actually know whether
posix_memalign is important, before making api decisions on this
basis.
>
> FWIW: On the lightweight IBM cores that the extremely popular
BlueGene machines were based on, accessing unaligned memory raised
system faults.  The default behavior of these machines was to
terminate the program if more than 1000 such errors occurred on a
given process, and an environment variable allowed you to
terminate the program if *any* unaligned memory access occurred.
 This is because unaligned memory accesses were 15x (or more)
slower than aligned memory access.
>
> The newer /Q chips seem to be a little more forgiving of this,
but I think one can in general expect allocated memory alignment
to be an important performance technique for future high
performance computing architectures.

Right, this much is true on lots of architectures, and so malloc
is careful to always return values with sufficient alignment (e.g.
8 bytes) to make sure that any standard operation can succeed.

The question here is whether it will be important to have *even
more* alignment than malloc gives us by default. A 16 or 32 byte
wide SIMD instruction might prefer that data have 16 or 32 byte
alignment, even if normal memory access for the types being
operated on only requires 4 or 8 byte alignment.

-n

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

--
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Francesc Alted

El 17/04/14 19:28, Julian Taylor ha escrit:
> On 17.04.2014 18:06, Francesc Alted wrote:
>
>> In [4]: x_unaligned = np.zeros(shape,
>> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
> on arrays of this size you won't see alignment issues you are dominated
> by memory bandwidth. If at all you will only see it if the data fits
> into the cache.
> Its also about unaligned to simd vectors not unaligned to basic types.
> But it doesn't matter anymore on modern x86 cpus. I guess for array data
> cache line splits should also not be a big concern.

Yes, that was my point, that in x86 CPUs this is not such a big 
problem.  But still a factor of 2 is significant, even for CPU-intensive 
tasks.  For example, computing sin() is affected similarly (sin() is 
using SIMD, right?):

In [6]: shape = (1, 1)

In [7]: x_aligned = np.zeros(shape, 
dtype=[('x',np.float64),('y',np.int64)])['x']

In [8]: x_unaligned = np.zeros(shape, 
dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']

In [9]: %timeit res = np.sin(x_aligned)
1 loops, best of 3: 654 ms per loop

In [10]: %timeit res = np.sin(x_unaligned)
1 loops, best of 3: 1.08 s per loop

and again, numexpr can deal with that pretty well (using 8 physical 
cores here):

In [6]: %timeit res = ne.evaluate('sin(x_aligned)')
10 loops, best of 3: 149 ms per loop

In [7]: %timeit res = ne.evaluate('sin(x_unaligned)')
10 loops, best of 3: 151 ms per loop


> Aligned allocators are not the only allocator which might be useful in
> numpy. Modern CPUs also support larger pages than 4K (huge pages up to
> 1GB in size) which reduces TLB cache misses. Memory of this type
> typically needs to be allocated with special mmap flags, though newer
> kernel versions can now also provide this memory to transparent
> anonymous pages (normal non-file mmaps).

That's interesting.  In which scenarios do you think that could improve 
performance?

>> In [8]: import numexpr as ne
>>
>> In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
>> 10 loops, best of 3: 133 ms per loop
>>
>> In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
>> 10 loops, best of 3: 134 ms per loop
>>
>> i.e. there is not a significant difference between aligned and unaligned
>> access to data.
>>
>> I wonder if the same technique could be applied to NumPy.
>
> you already can do so with relatively simple means:
> http://nbviewer.ipython.org/gist/anonymous/10942132
>
> If you change the blocking function to get a function as input and use
> inplace operations numpy can even beat numexpr (though I used the
> numexpr Ubuntu package which might not be compiled optimally)
> This type of transformation can probably be applied on the AST quite easily.

That's smart.  Yeah, I don't see a reason why numexpr would be 
performing badly on Ubuntu.  But I am not getting your performance for 
blocked_thread on my AMI linux vbox:

http://nbviewer.ipython.org/gist/anonymous/11000524

oh well, threads are always tricky beasts.  By the way, apparently the 
optimal block size for my machine is something like 1 MB, not 128 KB, 
although the difference is not big:

http://nbviewer.ipython.org/gist/anonymous/11002751

(thanks to Stefan Van der Walt for the script).

-- Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

1 2 3 4 5 6 >

1 - 100 of 501 matches

Mail list logo