[Numpy-discussion] Polynomial Improvements

2021-04-20 Thread Robert
Hi, 

I am new to contributing to open source projects and not sure where to begin
(maybe emailing this distro?). In any case, the main improvement I would
like to work on would be adding multivariate polynomials/differentials to
numpy. I would love any insight into the current status and intensions of
numpy polynomial implementations. 

1) why do we have np.polynomial and np.lib.polynomial? which one is older?
is there a desire to see these merged? 
2) why does np.polynomial.Polynomial have a domain and window? they appear
to have no implementation (except the chebyshev interpolation) as far as I
can tell? what is their intended use? why is their no variable
representation implemented like in np.polynomial and is not installed to
numpy directly, @set_module('numpy')? 
3) how many improvements are allowed to be done all at once? is there an
expectation of waiting before improving implementations or code? 

Thinking further down the road, how much is too much to add to numpy? I
would like to see multivariate implemented, but after that it would be nice
to have a collection of known polynomials like standard, pochhammer,
cyclotomic, hermite, lauguerre, legendre, chebyshev, jacobi. After that it
would be of interest to extend .diff() to really be a differential operator
(implemented as sequences of polynomials) and to allow differential
constructions. Then things like this would be possible:

x = np.poly1d([1,0])
D = np.diff1d([1,0])
assert Hermite(78) == (2*x - D)(Hermite(77))

Or if series were implemented then we would see things like: 

assert Hermite(189) == (-1)**189 * Exp(x**2)*(D**189)(Exp(-x**2))
assert Hermite(189) == Exp(-D**2)(x**189)(2*x)

4) at what point does numpy polynomial material belong in scipy material and
visa-versa? is numpy considered the backend to scipy or visa-versa? at times
it seems like things are double implemented in numpy and scipy, but in
general should not implement things in numpy that already exist in scipy? 

Thanks, 
Robert



--
Sent from: http://numpy-discussion.10968.n7.nabble.com/
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Polynomial Improvements

2021-04-20 Thread Robert
I had not looked into sympy that closes, thinking it was mostly a symbolic
package. However, there appears to be functions that convert back to numpy
expressions so that np.ndarray's and such can work. There also appears to be
extensive polynomial classes already defined. 

Thanks for pointing me in the right direction. 



--
Sent from: http://numpy-discussion.10968.n7.nabble.com/
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Re: Arbitrarily large random integers

2023-08-19 Thread Robert Kern
On Sat, Aug 19, 2023 at 10:49 AM Kevin Sheppard 
wrote:

> The easiest way to do this would to to write a pure python implementation
> using Python ints of a masked integer sampler.  This way you could draw
> unsigned integers and then treat this as a bit pool.  You would than take
> the number of bits needed for your integer, transform these to be a Python
> int, and finally apply the mask.
>

Indeed, that's how `random.Random` does it. I've commented on the issue
with an implementation that subclasses `random.Random` to use numpy PRNGs
as the source of bits for maximum compatibility with `Random`. The given
use case motivating this feature request is networkx, which manually wraps
numpy PRNGs in a class that incompletely mimics the `Random` interface. A
true subclass eliminates all of the remaining inconsistencies between the
two. I'm inclined to leave it at that and not extend the `Generator`
interface.

https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Inquiry about the difference between numpy.trunc() and numpy.fix()

2023-08-23 Thread Robert Kern
On Wed, Aug 23, 2023 at 10:06 PM Bao Guo  wrote:

> I am writing to seek clarification on the specific differences between the
> numpy.trunc() and numpy.fix() functions.
>
> I have come across opinions stating that these two functions exist for
> backward compatibility and that they serve different functional
> requirements. However, I am puzzled as to why these functions could not be
> merged or one of them upgraded to encompass both functionalities. Despite
> my efforts in researching and contemplating on this matter, I have not come
> across a convincing explanation.
>
> Could anyone kindly shed some light on this matter? I am seeking a deeper
> understanding of the design choices made for these functions and the
> rationale behind their coexistence.
>
> Thank you in advance for your time and assistance.


`fix()` was added to numpy back in the super-early days when it was
emerging from Numeric, its earlier predecessor. It was named after the
equivalent MATLAB function.

At some point thereafter, we settled on a policy for inclusion of
mathematical ufuncs into numpy. We got tired of arguing whether a special
function was "important enough" to be in numpy instead of scipy.special.
Eventually, we settled on a bright line that if a math function was in the
C99 standard, we'd go ahead and allow a ufunc for it in numpy. `trunc()` is
the name of that equivalent function in the C99 standard. It is a ufunc
instead of a plain function like `fix()` and has a (strict, I believe)
superset of functionality. You can ignore `fix()`, more or less. I'm not
sure if it's on the list for deprecation/removal in numpy 2.0, but it
certainly could be.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Find location of slice in it's base

2023-08-31 Thread Robert Kern
On Thu, Aug 31, 2023 at 3:25 PM Dom Grigonis  wrote:

> Hi everyone,
>
> I am working with shared arrays and their slices. And trying to subclass
> ndarray in such way so that they remap to memory on unpickling. I am using
> package SharedArray, which doesn’t micro-manage memory locations, but
> rather stores the whole map via shm using unique name. One issue I ran into
> is that when I pickle & unpickle slice of memory mapped array, I need to
> store the spec of the subset so that I can retrieve the same portion.
>
> The issue I am working on has been briefly discussed on stack:
> https://stackoverflow.com/questions/12421770/find-location-of-slice-in-numpy-array.
> Although I am not sure if it is the exact same one.
>
> So, in short, is there a way to determine location of a slice in original
> array. Ideally, it would be indexing which works for any non-copy
> slice/subset possible.
>

If you chase the `.base` chain all the way down, you get to a
`SharedArray.map_owner` object with its `.addr` attribute giving the memory
address that's assigned to it in the current process. This will be the same
address that's in the `'data'` key of that first `ndarray` returned from
`SharedArray.create()` that you are making your views from. In your view
`ndarray` (which may be a view of views, with slices, transposes, reshapes,
and arbitrary `stride_tricks` manipulations in between). If you store the
difference between the view `ndarray`'s `'data'` address from the
`map_owner.addr` address, I think that's all you need to recreate the array
in a different process which gets assigned a different memory address for
the same `shm` name. Just add that offset to the `map_owner.addr` and
restore the rest of the information in `__array_interface__`, and I think
you should be good to go. I don't think you'll need to infer or represent
the precise path of Python-level operations (slices, transposes, reshapes,
etc.) to which it got to that point.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] ANN: NumExpr 2.8.6 Released

2023-09-12 Thread Robert McLeod
Hi everyone,

NumExpr 2.8.6 is a release to deal with issues related to downstream
`pandas`
where the sanitization blacklist was hitting private variables used in their
evaluate. In addition the sanitization was hitting on scientific notation.

For those who do not wish to have sanitization on by default, it can be
changed
by setting an environment variable, `NUMEXPR_SANITIZE=0`.

If you use `pandas` in your packages it is advisable you pin

`numexpr >= 2.8.6`

in your requirements.

Project documentation is available at:

http://numexpr.readthedocs.io/

Changes from 2.8.5 to 2.8.6
---

* The sanitization can be turned off by default by setting an environment
variable,

`set NUMEXPR_SANITIZE=0`

* Improved behavior of the blacklist to avoid triggering on private
variables
  and scientific notation numbers.


What's Numexpr?
---

Numexpr is a fast numerical expression evaluator for NumPy.  With it,
expressions that operate on arrays (like "3*a+4*b") are accelerated
and use less memory than doing the same calculation in Python.

It has multi-threaded capabilities, as well as support for Intel's
MKL (Math Kernel Library), which allows an extremely fast evaluation
of transcendental functions (sin, cos, tan, exp, log...) while
squeezing the last drop of performance out of your multi-core
processors.  Look here for a some benchmarks of numexpr using MKL:

https://github.com/pydata/numexpr/wiki/NumexprMKL

Its only dependency is NumPy (MKL is optional), so it works well as an
easy-to-deploy, easy-to-use, computational engine for projects that
don't want to adopt other solutions requiring more heavy dependencies.

Where I can find Numexpr?
-

The project is hosted at GitHub in:

https://github.com/pydata/numexpr

You can get the packages from PyPI as well (but not for RC releases):

http://pypi.python.org/pypi/numexpr

Documentation is hosted at:

http://numexpr.readthedocs.io/en/latest/

Share your experience
-

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.

Enjoy data!

-- 
Robert McLeod
robbmcl...@gmail.com
robert.mcl...@hitachi-hightech.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0

2023-10-11 Thread Robert Kern
On Wed, Oct 11, 2023 at 11:50 PM Aaron Meurer  wrote:

> Is there a way to make pickle not depend on the specific submodule
> that a class is defined in?


No. `Unpickler` somehow has to locate the class/reconstruction function. It
will have to use the name given by the `__reduce_ex__` during pickling.


> Wouldn't this happen again if you ever
> decided to rename _core.
>

Yes.


> The underscores in numpy._core._reconstruct don't actually do anything
> here in terms of making the interface not public, and if anything, are
> really misleading.
>

There's private and there's private. There are two broad things that could
mean:

1. We don't want users playing around with it, importing it directly in
their code and using it.
2. Marking that we won't mess around with it without going through the
deprecation policy.

Usually, these two go together (hiding it from users directly calling it
means that we get to mess around with it freely), but when we get to
metaprogramming tasks like pickling, they become a little decoupled.
Because the "user" that we have to think about isn't a person reading
documentation and browsing APIs, but a file that someone created years ago
and some infrastructure that interprets that file. I think it's good that
`_reconstruct` is hidden from human users and is distinct from almost all
of what constitutes "numpy's public API", but that doesn't mean that we
can't have a tiny third class of functions where we do preserve some
guarantees. It's only occasionally confusing to us core devs, not users
(who can rely on the "don't touch underscored names" rule just fine).

I'm also curious about this more generally. We tend to think of the
> fully qualified name of a class as being an implementation detail for
> many libraries and only the top-level lib.Name should be used, but
> many things in the language (including this) break this.
>

Yes, and it's why these things attempt to be scoped in ways that limit this
problem. I.e. if you use pickling, you're told to use it only for transient
data with the same versions of libraries on both ends of the pipe, but the
reality is that it's too useful to avoid in creating files with arbitrarily
long lives. Not their fault; they warned us!

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern
On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion <
numpy-discussion@python.org> wrote:

> Hi all,
>
> I am trying to sample k N-dimensional vectors from a uniform distribution
> without replacement.
> It seems like this should be straightforward, but I can't seem to pin it
> down.
>
> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x
> dN-1 array.
>
> I thought about sneaking in a structured dtype into `rng.integers`, but of
> course that doesn't work.
>
> If we had a string sampler, I could sample k unique words (consisting of
> digits), and convert them to indices.
>
> I could over-sample and filter out the non-unique indices. Or iteratively
> draw blocks of samples until I've built up my k unique indices.
>
> The most straightforward solution would be to flatten indices, and to
> sample from those. The integers get large quickly, though. The rng.integers
> docstring suggests that it can handle object arrays for very large integers:
>
> > When using broadcasting with uint64 dtypes, the maximum value (2**64)
> > cannot be represented as a standard integer type.
> > The high array (or low if high is None) must have object dtype, e.g.,
> array([2**64]).
>
> But, that doesn't work:
>
> In [35]: rng.integers(np.array([2**64], dtype=object))
> ValueError: high is out of bounds for int64
>
> Is there an elegant way to handle this problem?
>

The default dtype for the result of `integers()` is the signed `int64`. If
you want to sample from the range `[0, 2**64)`, you need to specify
`dtype=np.uint64`. The text you are reading is saying that if you want to
specify exactly `2**64` as the exclusive upper bound that you won't be able
to do it with a `np.uint64` array/scalar because it's one above the bound
for that dtype, so you'll have to use a plain Python `int` object or
`dtype=object` array in order to represent `2**64`. It is not saying that
you can draw arbitrary-sized integers.

>>> rng.integers(2**64, dtype=np.uint64)
11569248114014186612

If the arrays you are drawing indices for are real in-memory arrays for
present-day 64-bit computers, this should be adequate. If it's a notional
array that is larger, then you'll need actual arbitrary-sized integer
sampling. The builtin `random.randrange()` will do arbitrary-sized integers
and is quite reasonable for this task. If you want it to use our
BitGenerators underneath for clean PRNG state management, this is quite
doable with a simple subclass of `random.Random`:
https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern
On Fri, Nov 17, 2023 at 4:15 PM Aaron Meurer  wrote:

> On Fri, Nov 17, 2023 at 12:10 PM Robert Kern 
> wrote:
> >
> > If the arrays you are drawing indices for are real in-memory arrays for
> present-day 64-bit computers, this should be adequate. If it's a notional
> array that is larger, then you'll need actual arbitrary-sized integer
> sampling. The builtin `random.randrange()` will do arbitrary-sized integers
> and is quite reasonable for this task. If you want it to use our
> BitGenerators underneath for clean PRNG state management, this is quite
> doable with a simple subclass of `random.Random`:
> https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258
>
> Wouldn't it be better to just use random.randint to generate the
> vectors directly at that point? If the number of possibilities is more
> than 2**64, birthday odds of generating the same vector twice are on
> the order of 1 in 2**32. And you can always do a unique() rejection
> check if you really want to be careful.
>

Yes, I jumped to correcting the misreading of the docstring rather than
solving the root problem. Almost certainly, the most straightforward
strategy is to use `rng.integers(0, array_shape)` repeatedly, storing
tuples of the resulting indices in a set and rejecting duplicates. You'd
have to do the same thing if one used `rng.integers(0,
np.prod(array_shape))` or `random.randint(0, np.prod(array_shape))` in the
flattened case.

`rng.integers()` doesn't sample without replacement in any case.
`rng.choice()` does. It also will not support unbounded integers; it uses a
signed `int64` to hold the population size, so it is bounded to that. That
is _likely_ to be a fine upper bound on the size of the array (if it is a
real array in a 64-bit address space). So one could use
`rng.choice(np.prod(array_shape), replace=False, size=n_samples)` and
unflatten these integers to the index tuples if that suits the array size.
It's just doing the same set lookups internally, just somewhat more
efficiently.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern
On Fri, Nov 17, 2023 at 5:34 PM Stefan van der Walt 
wrote:

> On Fri, Nov 17, 2023, at 14:28, Stefan van der Walt wrote:
>
> Attached is a script that implements this solution.
>
>
> And the version with set duplicates checking.
>

If you're going to do the set-checking yourself, then you don't need the
unbounded integer support. This is somewhat slower, probably due to the
greedy tuple conversions, but eliminates all of the complexity of
subclassing `Random`:

def sample_indices(shape, size, rng=None):
rng = np.random.default_rng(rng)
ashape = np.array(shape)
seen = set()
while len(seen) < size:
idx = tuple(rng.integers(0, ashape))
seen.add(idx)
return list(seen)

Unfortunately, subclassing from `Random` still doesn't get you the ability
to sample without replacement for arbitrary-sized populations using
`Random`'s own methods for that. `Random.sample(population, k)` requires
`population` to be a `Sequence`, and that restricts you to index-sized
integers (`ssize_t`) (you can try to fake one, but `len()` will balk if it
gets a too-large integer from `__len__`). But `Random.sample` is only just
doing set-checking anyways, so no loss.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern
On Fri, Nov 17, 2023 at 7:11 PM Aaron Meurer  wrote:

> rng.integers() (or np.random.randint) lets you specify lists for low
> and high. So you can just use rng.integers((0,)*len(dims), dims).
>
> Although I'm not seeing how to use this to generate a bunch of vectors
> at once. I would have thought something like size=(10, dims) would let
> you generate 10 vectors of length dims but it doesn't seem to work.
>

`size=(k, len(dims))`

def sample_indices(shape, size, rng=None):
rng = np.random.default_rng(rng)
ashape = np.array(shape)
seen = set()
while len(seen) < size:
dsize = size - len(seen)
seen.update(map(tuple, rng.integers(0, ashape, size=(dsize,
len(shape)
return list(seen)

That optimistic optimization makes this the fastest solution.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: pickling dtype values

2023-11-20 Thread Robert Kern
On Mon, Nov 20, 2023 at 10:08 PM Sebastien Binet  wrote:

> hi there,
>
> I have written a Go package[1] that can read/write simple arrays in the
> numpy file format [2].
> when I wrote it, it was for simple interoperability use cases, but now
> people would like to be able to read back ragged-arrays[3].
>
> unless I am mistaken, this means I need to interpret pieces of pickled
> data (`ndarray`, `multiarray` and `dtype`).
>
> so I am trying to understand how to unpickle `dtype` values that have been
> pickled:
>
> ```python
> import numpy as np
> import pickle
> import pickletools as pt
>
> pt.dis(pickle.dumps(np.dtype("int32"), protocol=4), annotate=True)
> ```
>
> gives:
> ```
> 0: \x80 PROTO  4 Protocol version indicator.
> 2: \x95 FRAME  55 Indicate the beginning of a new frame.
>11: \x8c SHORT_BINUNICODE 'numpy' Push a Python Unicode string object.
>18: \x94 MEMOIZE(as 0)Store the stack top into the memo.
> The stack is not popped.
>19: \x8c SHORT_BINUNICODE 'dtype' Push a Python Unicode string object.
>26: \x94 MEMOIZE(as 1)Store the stack top into the memo.
> The stack is not popped.
>27: \x93 STACK_GLOBAL Push a global object (module.attr) on
> the stack.
>28: \x94 MEMOIZE(as 2)Store the stack top into the memo.
> The stack is not popped.
>29: \x8c SHORT_BINUNICODE 'i4'Push a Python Unicode string object.
>33: \x94 MEMOIZE(as 3)Store the stack top into the memo.
> The stack is not popped.
>34: \x89 NEWFALSE Push False onto the stack.
>35: \x88 NEWTRUE  Push True onto the stack.
>36: \x87 TUPLE3   Build a three-tuple out of the top
> three items on the stack.
>37: \x94 MEMOIZE(as 4)Store the stack top into the memo.
> The stack is not popped.
>38: RREDUCE   Push an object built from a callable
> and an argument tuple.
>39: \x94 MEMOIZE(as 5)Store the stack top into the memo.
> The stack is not popped.
>40: (MARK Push markobject onto the stack.
>41: KBININT13 Push a one-byte unsigned integer.
>43: \x8c SHORT_BINUNICODE '<' Push a Python Unicode string object.
>46: \x94 MEMOIZE(as 6)Store the stack top into the memo.
> The stack is not popped.
>47: NNONE Push None on the stack.
>48: NNONE Push None on the stack.
>49: NNONE Push None on the stack.
>50: JBININT -1Push a four-byte signed integer.
>55: JBININT -1Push a four-byte signed integer.
>60: KBININT10 Push a one-byte unsigned integer.
>62: tTUPLE  (MARK at 40) Build a tuple out of the topmost
> stack slice, after markobject.
>63: \x94 MEMOIZE(as 7)   Store the stack top into the
> memo.  The stack is not popped.
>64: bBUILD   Finish building an object, via
> __setstate__ or dict update.
>65: .STOPStop the unpickling machine.
> highest protocol among opcodes = 4
> ```
>
> I have tried to find the usual `__reduce__` and `__setstate__` methods to
> understand what are the various arguments, to no avail.
>

First, be sure to read the generic `object.__reduce__` docs:

https://docs.python.org/3.11/library/pickle.html#object.__reduce__

Here is the C source for `np.dtype.__reduce__()`:

https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2623-L2750

And `np.dtype.__setstate__()`:

https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2787-L3151

so, in :
> ```python
> >>> np.dtype("int32").__reduce__()[1]
> ('i4', False, True)
>

These are arguments to the `np.dtype` constructor and are documented in
`np.dtype.__doc__`. The `False, True` arguments are hardcoded and always
those values.


> >>> np.dtype("int32").__reduce__()[2]
> (3, '<', None, None, None, -1, -1, 0)
>

These are arguments to pass to `np.dtype.__setstate__()` after the object
has been created.

0. `3` is the version number of the state; `3` is typical for simple
dtypes; datetimes and others with metadata will bump this to `4` and use a
9-element tuple instead of this 8-element tuple.
1. `'<'` is the endianness flag.
2. If there are subarrays
<https://numpy.org/doc/stable/reference/arrays.dtypes.html#index-7> (e.g.
`np.dtype((np.int32, (2,2)))`), that info here.
3. If there are f

[Numpy-discussion] Re: savetxt() has fmt='%.18e' as default, but fmt='%.16e' is always sufficient

2023-11-25 Thread Robert Kern
On Sat, Nov 25, 2023 at 11:18 AM Jeppe Dakin 
wrote:

> Double-precision numbers need at most 17 significant decimal digits to be
> serialised losslessly. Yet, savetxt() uses 19 by default, meaning that most
> files produced with savetxt() takes up about 9% more disk space than they
> need to, without any benefit. I have described the problem more detailed on
> Stackoverflow:
>
> https://stackoverflow.com/questions/77535380/minimum-number-of-digits-for-exact-double-precision-and-the-fmt-18e-of-numpy
>
> Is there any reason behind the default choice of savetxt(...,
> fmt='%.18e')? If not, why not reduce it to savetxt(..., fmt='%.16e')?
>

A long time ago when `savetxt()` was written, we did not use the reliable
Dragon4 string representation algorithm that guarantees that floating point
numbers are written out with the minimum number of decimal digits needed to
reproduce the number. We may even have relied on the platform's floating
point to string conversion routines, which were of variable quality. The
extra digits accounted for that unreliability.

It probably could be changed now, but I'd want more aggressive testing of
the assertion of correctness (`random()`, as used in that StackOverflow
demonstration, does *not* exercise a lot of the important edge cases in the
floating point format). But if your true concern is that 9% of disk space,
you probably don't want to be using `savetxt()` in any case.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Change definition of complex sign (and use it in copysign)

2024-01-04 Thread Robert Kern
On Wed, Jan 3, 2024 at 4:09 PM Aaron Meurer  wrote:

> sign(z) = z/|z| is a fairly standard definition. See
> https://oeis.org/wiki/Sign_function and
> https://en.wikipedia.org/wiki/Sign_function. It's also implemented
> this way in MATLAB and Mathematica (see
> https://www.mathworks.com/help/symbolic/sign.html and
> https://reference.wolfram.com/language/ref/Sign.html). The function
> z/|z| is useful because it represents a normalization of z as a vector
> in the complex plane onto the unit circle.
>
> With that being said, I'm not so sure about the suggestion about
> extending copysign(x1, x2) as |x1|*sign(x2). I generally think of
> copysign as a function to manipulate the floating-point representation
> of a number. It literally copies the sign *bit* from x2 into x1. It's
> useful because of things like -0.0, which is otherwise difficult to
> work with since it compares equal to 0.0. I would find it surprising
> for copysign to do a numeric calculation on complex numbers. Also,
> your suggested definition would be wrong for 0.0 and -0.0, since
> sign(0) is 0, and this is precisely where copysign matters.
>

Agreed on all points.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Feature request: Extension of the np.argsort Function - Returning Positional Information for Data

2024-01-16 Thread Robert Kern
On Tue, Jan 16, 2024 at 11:05 PM hao chen 
wrote:

> When dealing with lists that contain duplicate data, np.argsort fails to
> return index values that correspond to the actual sorting positions of the
> data, as it does when handling arrays without duplicates.
>
> Dear Author:
>
> When I use the np.argsort function on an array without duplicate data, the
> returned index values correspond to the sorting positions of the respective
> data.😀
>
> x = [1, 2, 5, 4]
> rank = np.argsort(x)
> print(rank)
> # [0 1 3 2]
>
> That is not what `argsort` is intended or documented to do. It returns an
array of indices _into `x`_ such that if you took the values from `x` in
that order, you would get a sorted array. That is, if `x` were sorted into
the array `sorted_x`, then `x[rank[i]] == sorted_x[i]` for all `i in
range(len(x))`. The indices in `rank` are positions in `x`, not positions
in `sorted_x`. They happen to correspond in this case, but that's a
coincidence that's somewhat common in these small examples. But consider
`[20, 30, 10, 40]`:

>>> x = np.array([20, 30, 10, 40])
>>> ix = np.argsort(x)
>>> def position(x):
... sorted_x = np.array(x)
... sorted_x.sort()
... return np.searchsorted(sorted_x, x)
...
>>> ip = position(x)
>>> ix
array([2, 0, 1, 3])
>>> ip
array([1, 2, 0, 3])

But also notice:

>>> np.argsort(np.argsort(x))
array([1, 2, 0, 3])

This double-argsort is what you seem to be looking for, though it depends
on what you want from the handling of duplicates (do you return the first
index into the sorted array with the same value as in my `position()`
implementation, or do you return the index that particular item was
actually sorted to).

Either way, we probably aren't going to add this as its own function. Both
options are straightforward combinations of existing primitives.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Robert Kern
On Thu, Feb 15, 2024 at 10:21 AM  wrote:

> Hello Numpy community,
>
> I'm proposing the introduction of a `pipe` method for NumPy arrays to
> enhance their usability and expressiveness.
>

Adding a prominent method like this to `np.ndarray` is something that we
will probably not take up ourselves unless it is adopted by the Array API
standard <https://data-apis.org/array-api/latest/>. It's possible that you
might get some interest there since the Array API deliberately strips out
the number of methods that we already have (e.g. `.mean()`, `.sum()`, etc.)
in favor of functions. A general way to add some kind of fluency cheaply in
an Array API-agnostic fashion might be helpful to people trying to make
their numpy-only code that uses our current set of methods in this way a
bit easier. But you'll have to make the proposal to them, I think, to get
started.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Converting array to string

2024-02-25 Thread Robert Kern
On Sat, Feb 24, 2024 at 7:17 PM Dom Grigonis  wrote:

> Hello,
>
> I am seeking a bit of help.
>
> I need a fast way to transfer numpy arrays in json format.
>
> Thus, converting array to list is not an option and I need something
> similar to:
>
> a = np.ones(1000)%timeit a.tobytes()17.6 ms
>
> This is quick, fast and portable. In other words I am very happy with
> this, but...
>
> Json does not support bytes.
>
> Any method of subsequent conversion from bytes to string is number of
> times slower than the actual serialisation.
>
> So my question is: Is there any way to serialise directly to string?
>
> I remember there used to be 2 methods: tobytes and tostring. However, I
> see that tostring is deprecated and its functionality is equivalent to
> `tobytes`. Is it completely gone? Or is there a chance I can still find a
> performant version of converting to and back from `str` type of non-human
> readable form?
>

The old `tostring` was actually the same as `tobytes`. In Python 2, the
`str` type was what `bytes` is now, a string of octets. In Python 3, `str`
became a string a Unicode characters (what you want) and the `bytes` type
was introduced for the string of octects so `tostring` was merely _renamed_
to `tobytes` to match. `tostring` never returned a string of Unicode
characters suitable for inclusion in JSON.

AFAICT, there is not likely to be a much more efficient way to convert from
an array to a reasonable JSONable encoding (e.g. base64). The time you are
seeing is the time it takes to encode that amount of data, period. That
said, if you want to use a quite inefficient hex encoding, `a.data.hex()`
is somewhat faster than the base64 encoding, but it's less than ideal in
terms of space usage.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Converting array to string

2024-02-25 Thread Robert Kern
On Sun, Feb 25, 2024 at 1:52 PM Dom Grigonis  wrote:

> Thank you for your answer,
>
> Yeah, I was unsure if it ever existed in the first place.
>
> Space is less of an issue and something like `a.data.hex()` would be fine
> as long as its speed was on par with `a.tobytes()`. However, it is 10x
> slower on my machine.
>
> This e-mail is pretty much a final check (after a fair bit of research and
> attempts) that it can not be done so I can eliminate this possibility as
> feasible and concentrate on other options.
>

I think that mostly runs the gamut for pure JSON. Consider looking at
BJData, but it's "JSON-based" and not quite pure JSON.

https://neurojson.org/

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Polyfit error in displacement

2024-03-26 Thread Robert Kern
On Tue, Mar 26, 2024 at 3:39 PM Luca Bertolotti 
wrote:

> Thanks for your reply yes it seems more appropriate a cubic spline but how
> can i get what they call SC
>

I don't think any of us has enough context to know what "SC" is. It's not a
standard term that I'm aware of.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: dtype=object arrays not treated like Python list?

2024-03-29 Thread Robert Kern
On Fri, Mar 29, 2024 at 9:11 AM Steven G. Johnson  wrote:

> Should a dtype=object array be treated more like Python lists for type
> detection/coercion reasons?   Currently, they are treated quite differently:
>
> >>> import numpy as np
> >>> np.isfinite([1,2,3])
> array([ True,  True,  True])
> >>> np.isfinite(np.asarray([1,2,3], dtype=object))
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: ufunc 'isfinite' not supported for the input types, and the
> inputs could not be safely coerced to any supported types according to the
> casting rule ''safe''
>
> The reason I ask is that we ran into something similar when trying to pass
> wrappers around Julia arrays to Python.  A Julia `Any[]` array is much like
> a Python list or a Numpy `object` array, and exposed in Python as a subtype
> of MutableSequence, but we found that it was treated by NumPy as more like
> a Numpy `object` array than a Python list (
> https://github.com/JuliaPy/PythonCall.jl/issues/486).
>
> Would it be desirable to treat a 1d Numpy `object` array more like a
> Python `list`?   Or is there a way for externally defined types to opt-in
> to the `list` behavior?  (I couldn't figure out in the numpy source code
> where `list` is being special-cased?)
>

`list` isn't special-cased, per se. Most numpy functions work on `ndarray`
objects and accept "array-like" objects like `list`s by coercing them to
`ndarray` objects ASAP using `np.asarray()`. `asarray` will leave existing
`ndarray` instances alone. When you pass `[1,2,3]` to a numpy function,
`np.asarray([1,2,3])` takes that list and infers the dtype and shape from
its contents using just the `Sequence` API. Since they are `int` objects of
reasonable size, the result is `np.array([1, 2, 3], dtype=np.int64)`.  The
`isfinite` ufunc has a loop defined for `dtype=np.int64`.

`np.asarray()` will also check for special methods and properties like
`__array__()` and `__array_interface__` to allow objects to customize how
they should be interpreted as `ndarray` objects, in particular, allowing
memory to be shared efficiently as pointers, if the layouts are compatible,
but also to control the dtype of the final array. PythonCall.jl implements
`__array__()` and `__array_interface__` for its array objects for these
purposes, and if I am not mistaken, explicitly makes `Any[]` convert to a
`dtype=object` `ndarray`
<https://github.com/JuliaPy/PythonCall.jl/blob/f586f2494432f2e5366ff1e2876b8aa532630b54/src/JlWrap/objectarray.jl#L61-L69>
(the
`typestr = "|O"` is equivalent to `dtype=object` in that context). It's
*possible* that you'd really rather have this `PyObjectArray` *not*
implement those interfaces and just let the `np.asarray()` inference work
its magic through the `Sequence` API, but that is expensive. You could also
try doing the type inference yourself and implement that in the
`PyObjectArray.__array__()` implementation and avoid implementing
`__array_interface__` for that object. Then `np.asarray()` will just
delegate to `PyObjectArray.__array__()`.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Build NumPy with Debugging Symbols with Meson

2024-05-15 Thread Robert McLeod
Hi everyone,

Is there a gist or similar guide anywhere on the steps required to build
NumPy with debugging symbols on the Windows platform using the new Meson
build system? It seems a bit difficult because NumPy seems to expect a
`conda`-like file structure whereas if you are cloning `cpython` directly
it is different. In particular `cpython` puts all the Windows files under
the `PC` and `PCBuild` subdirectories. Also meson doesn't seem to have a
means to call `python_d.exe` directly which may be causing problems. I've
ended up building both `python.exe` and `python_d.exe` but run into some
problem finding Python in `meson.build`.

This describes my efforts to date:

# Build CPython with Debugging Symbols on Windows

```shell
git clone https://github.com/python/cpython.git
cd cpython
git switch 3.11
PCbuild\get_externals.bat
# Build the debug version `python_d.exe`
PCbuild\build.bat -e -d
# Meson calls `python.exe` and doesn't seem to have an argument to change
this.
# We could either build it, or create a symlink
PCbuild\build.bat
# ln -s PCbuild/amd64/python_d.exe PCbuild/amd64/python.exe
```

The built Python will then be located in
`/cpython/PCBuild/amd64/python_d.exe`.

```batch
set PREFIX=C:/Users/Robert/dev/cpython
set PATH=%PREFIX%;%PREFIX%/PCBuild/amd64;%PREFIX%/Scripts;%PATH%
```

Next we have to install pip:
https://docs.python.org/3/library/ensurepip.html,
meson, and cython.

```shell
python_d -m ensurepip
python_d -mpip install meson meson-python ninja cython
```

NOTE: produces `pip3.exe`, not `pip`.

# Build NumPy with debug Python

```shell
git clone https://github.com/numpy/numpy.git
cd numpy
git switch maintenance/1.26.x
git submodule update --init
```

https://mesonbuild.com/meson-python/how-to-guides/debug-builds.html

Next try and build with meson/ninja:

```shell
mkdir build-dbg
cd build-dbg
meson .. setup --buildtype=debug
--includedir=C:/Users/Robert/dev/cpython/PC
--libdir=C:/Users/Robert/dev/cpython/PCbuild/amd64
ninja
ninja install
```

`meson .. setup <...>` fails and complains that,

'''
Run-time dependency python found: NO (tried sysconfig)

  ..\meson.build:41:12: ERROR: Python dependency not found
'''

which corresponds to:

```meson.build
py = import('python').find_installation(pure: false)
py_dep = py.dependency()
```

I tried also editing `pyproject.toml` to add the section:

```toml
[tool.meson-python.args]
setup = ['-Dbuildtype=debug',
 '-Dincludedir=C:/Users/Robert/dev/cpython/PC',
 '-Dlibdir=C:/Users/Robert/dev/cpython/PCbuild/amd64']
```

and then build NumPy with pip using the debug python build:

```shell
python_d -mpip install . --no-build-isolation -Cbuild-dir=build-dbg
```

But it fails in the same fashion. Searching for issues I find this one but
it's likely in this case something related to the debug Python build is the
problem.

https://github.com/mesonbuild/meson/issues/12547

Meson installed version is 1.4.0. Any advice would be appreciated. I'm
happy to write a gist if I can get it working.

-- 
Robert McLeod
robbmcl...@gmail.com
robert.mcl...@hitachi-hightech.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Adding bfill() to numpy.

2024-05-20 Thread Robert Kern
On Mon, May 20, 2024 at 1:17 PM george trojan 
wrote:

> xarray has both bfill and ffill for DataArrays. The implementation useless
> bottleneck https://github.com/pydata/bottleneck.
>

bottleneck would probably be a good home for these functions (though I
don't speak for its maintainers). If it gets picked up by a bunch of other
array implementations, then you can make a proposal to have them added to
the Array API standard. numpy probably won't add them unless if that
happens first.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Add more default arguments to various functions

2024-06-04 Thread Robert Kern
On Tue, Jun 4, 2024 at 3:12 PM Rakshit Singh 
wrote:

> I am unable to wrap my head around it.
>
> Why would we make a zero array which is empty. It can be done by a normal
> np.array
>

`shape=()` would not be empty, but a scalar. `np.zeros(())` is the same as
`np.array(0.0)`, so it is meaningful, at least. An empty array would be one
where one of the axis dimensions is 0, which would make the difference
between `zeros()`, `ones()`, and `empty()` meaningless if they all shared a
default like that.

Nonetheless, I don't think we want to make that a default. It's rarely
desired, so it can be rarely requested in an explicit manner.

For `np.heaviside()`, a default value was intentionally left unspecified
because of the multiple conventions that people want. In the face of
ambiguity, refuse the temptation to guess.

I think we're comfortable with these choices at this time.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Applying a function on local environments

2024-07-26 Thread Robert Kern
On Fri, Jul 26, 2024 at 8:50 AM  wrote:

> Dear all,
>
> my goal would be to apply some function on a local environment of size K x
> K x K where K is bigger than 1 and odd. For example, if K = 3, it would be
> nice to apply some function "func" on such an environment around each
> n-dimensional position within the n-dimensional array. So, for K = 3 and
> the position (1,1,1) if a 3D array, one would collect the array values at
> the positions X = [(0,0,0),(0,0,1),(0,0,2),(0,1,0),...(2,2,2)] (K**3 = 27
> in total in this example) and yield those values to "func". The result
> value at position (1,1,1) in the output array would be y = func(X). The
> same would apply for all entries excluding the padding area (or according
> to some padding policy).
>

scipy.ndimage.generic_filter()
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.generic_filter.html>

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Original seed of a BitGenerator

2024-08-02 Thread Robert Kern
On Fri, Aug 2, 2024 at 1:28 AM Andrew Nelson  wrote:

> When using the new `Generator`s for stochastic optimisation I sometimes
> find myself possessing a great solution, but am wondering what path the
> random number generation took to get to that point.
>
> I know that I can get the current state of the BitGenerators. However,
> what I'd like to do is query the BitGenerator to figure out how the
> BitGenerator was setup in the first place.
>
> i.e. either:
>
> - the seed/SeedSequence that was used to construct the BitGenerator
>

>>> rng = np.random.default_rng()
>>> rng.bit_generator.seed_seq
SeedSequence(
entropy=186013007116029215180532390504704448637,
)

In some older versions of numpy, the attribute was semi-private as
_seed_seq, if you're still using one of those.

--
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the
current sub-optimal situation because we never really laid out the use
cases. We just felt like we needed bytestring and unicode dtypes, more out
of completionism than anything, and we made a bunch of assumptions just to
get each one done. I think there may be broad agreement that many of those
assumptions are "wrong", but it would be good to reference that against
concretely-stated use cases.

FWIW, if I need to work with in-memory arrays of strings in Python code,
I'm going to use dtype=object a la pandas. It has almost no arbitrary
constraints, and I can rely on Python's unicode facilities freely. There
may be some cases where it's a little less memory-efficient (e.g.
representing a column of enumerated single-character values like 'M'/'F'),
but that's never prevented me from doing anything (compare to the
uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data
conveniently laid out according to numpy assumptions (e.g. FITS). Being
able to work with C/C++/Fortran APIs that have arrays of strings laid out
according to numpy assumptions (e.g. HDF5). I think it would behoove us to
canvass the needs of these formats and APIs before making any more
assumptions.

For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.

I don't know of a format off-hand that works with numpy uniform-length
strings and Unicode as well. HDF5 (to my recollection) supports arrays of
NULL-terminated, uniform-length ASCII like FITS, but only variable-length
UTF8 strings.

We should look at some of the newer formats and APIs, like Parquet and
Arrow, and also consider the cross-language APIs with Julia and R.

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype,
but perhaps add a new canonical alias, document it as being for those
specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.

* Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer  wrote:
>
> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern 
wrote:
>>
>> I don't know of a format off-hand that works with numpy uniform-length
strings and Unicode as well. HDF5 (to my recollection) supports arrays of
NULL-terminated, uniform-length ASCII like FITS, but only variable-length
UTF8 strings.
>
>
> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
variable length versions:
> https://github.com/PyTables/PyTables/issues/499
> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.

Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
of course are also constrained by numpy's current set of dtypes. The
NULL-terminated ASCII works well enough with np.string's semantics.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald 
wrote:
>
> On Thu, Apr 20, 2017 at 8:55 PM Robert Kern  wrote:

>> For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.
>
> Actually if I understood the spec, FITS header lines are 80 bytes long
and contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

Never mind, then. :-)

>> If I had to jump ahead and propose new dtypes, I might suggest this:
>>
>> * For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.
>>
>> * Acknowledge the use cases of the current NULL-terminated np.string
dtype, but perhaps add a new canonical alias, document it as being for
those specific use cases, and deprecate/de-emphasize the current name.
>>
>> * Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.
>
> How would this differ from a numpy array of bytes with one more
dimension?

The scalar in the implementation being the scalar in the use case,
immutability of the scalar, directly working with b'' strings in and out
(and thus work with the Python codecs easily).

>> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).
>
> I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps? Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

HDF5 seems to support this, but only for ASCII and UTF8, not a large list
of encodings.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:
>
> On 20.04.2017 20:53, Robert Kern wrote:
> > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> > mailto:jtaylor.deb...@googlemail.com>>
> > wrote:
> >
> >> Do you have comments on how to go forward, in particular in regards to
> >> new dtype vs modify np.unicode?
> >
> > Can we restate the use cases explicitly? I feel like we ended up with
> > the current sub-optimal situation because we never really laid out the
> > use cases. We just felt like we needed bytestring and unicode dtypes,
> > more out of completionism than anything, and we made a bunch of
> > assumptions just to get each one done. I think there may be broad
> > agreement that many of those assumptions are "wrong", but it would be
> > good to reference that against concretely-stated use cases.
>
> We ended up in this situation because we did not take the opportunity to
> break compatibility when python3 support was added.

Oh, the root cause I'm thinking of long predates Python 3, or even numpy
1.0. There never was an explicitly fleshed out use case for unicode arrays
other than "Python has unicode strings, so we should have a string dtype
that supports it". Hence the "we only support UCS4" implementation; it's
not like anyone *wants* UCS4 or interoperates with UCS4, but it does
represent all possible Unicode strings. The Python 3 transition merely
exacerbated the problem by making Unicode strings the primary string type
to work with. I don't really want to ameliorate the exacerbation without
addressing the root problem, which is worth solving.

I will put this down as a marker use case: Support HDF5's fixed-width UTF-8
arrays.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer  wrote:
>
> On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern 
wrote:
>>
>> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer  wrote:
>> >
>> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern 
wrote:
>> >>
>> >> I don't know of a format off-hand that works with numpy
uniform-length strings and Unicode as well. HDF5 (to my recollection)
supports arrays of NULL-terminated, uniform-length ASCII like FITS, but
only variable-length UTF8 strings.
>> >
>> >
>> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed
and variable length versions:
>> > https://github.com/PyTables/PyTables/issues/499
>> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>> >
>> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.
>>
>> Ah, okay, I was interpolating from a quick perusal of the h5py docs,
which of course are also constrained by numpy's current set of dtypes. The
NULL-terminated ASCII works well enough with np.string's semantics.
>
> Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should
correspond to a string type, not np.string_ (which is really bytes).

"... well enough with np.string's semantics [that h5py actually used it to
pass data in and out; whether that array is fit for purpose beyond that, I
won't comment]." :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 1:16 PM, Phil Hodge  wrote:
>
> On 04/20/2017 03:17 PM, Anne Archibald wrote:
>>
>> Actually if I understood the spec, FITS header lines are 80 bytes long
and contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.
>
> FITS BINTABLE extensions can have columns containing strings, and in that
case the values are NULL terminated, except that if the string fills the
field (i.e. there's no room for a NULL), the NULL will not be written.

Ah, that's what I was thinking of, thank you.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker 
wrote:
>
> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>>>
>>> BTW -- maybe we should keep the pathological use-case in mind: really
short strings. I think we are all thinking in terms of longer strings,
maybe a name field, where you might assign 32 bytes or so -- then someone
has an accented character in their name, and then ge30 or 31 characters --
no big deal.
>>
>>
>> I wouldn't call it a pathological use case, it doesn't seem so uncommon
to have large datasets of short strings.
>
> It's pathological for using a variable-length encoding.
>
>> I personally deal with a database of hundreds of billions of 2 to 5
character ASCII strings.  This has been a significant blocker to Python 3
adoption in my world.
>
> I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...

Unless if you have hundreds of billions of them.

>> BTW, for those new to the list or with a short memory, this topic has
been discussed fairly extensively at least 3 times before.  Hopefully the
*fourth* time will be the charm!
>
> yes, let's hope so!
>
> The big difference now is that Julian seems to be committed to actually
making it happen!
>
> Thanks Julian!
>
> Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.
>
> I have strong opinions, but would still rather see any of the ideas on
the table implemented than nothing.

FWIW, I prefer nothing to just adding a special case for latin-1. Solve the
HDF5 problem (i.e. fixed-length UTF-8 strings) or leave it be until someone
else is willing to solve that problem. I don't think we're at the
bikeshedding stage yet; we're still disagreeing about fundamental
requirements.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker 
wrote:

>> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>
> +1.  The key point is that there is a HUGE amount of legacy science data
in the form of FITS (astronomy-specific binary file format that has been
the primary file format for 20+ years) and HDF5 which uses a character data
type to store data which can be bytes 0-255.  Getting an decoding/encoding
error when trying to deal with these datasets is a non-starter from my
perspective.

That says to me that these are properly represented by `bytes` objects, not
`unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker 
wrote:
>
> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer  wrote:
>
>>> In this case, we want something compatible with Python's string (i.e.
full Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).
>>
>>
>> Yes, but NumPy doesn't really implement string operations, so
fortunately this is pretty irrelevant to us -- except for our API for
specifying dtype size.
>
> Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.

We have the freedom to make the error message not suck. :-)

> When len(a_string) <= 12
>
> AND that will only  occur if there are non-ascii characters in the
string, and maybe only if there are more than N non-ascii characters. i.e.
it is very likely to be a run-time error that may not have shown up in
tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not,
they need to truncate it, and THAT is non-obvious how to do, too -- you
don't want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.

If this becomes the right strategy for dealing with these problems (and I'm
not sure that it is), we can easily make a utility function that does this
for people.

This discussion is why I want to be sure that we have our use cases
actually mapped out. For this kind of in-memory manipulation, I'd use an
object array (a la pandas), then convert to the uniform-width string dtype
when I needed to push this out to a C API, HDF5 file, or whatever actually
requires a string-dtype array. The required width gets computed from the
data after all of the manipulations are done. Doing in-memory assignments
to a fixed-encoding, fixed-width string dtype will always have this kind of
problem. You should only put up with it if you have a requirement to write
to a format that specifies the width and the encoding. That specified
encoding is frequently not latin-1!

>> I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that
your data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.

But what if the format I'm working with specifies another encoding? Am I
supposed to encode all of my Unicode strings in the specified encoding,
then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
really important use case for me.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern 
wrote:
>>
>> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>> >
>> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker 
wrote:
>>
>> >> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>> >
>> > +1.  The key point is that there is a HUGE amount of legacy science
data in the form of FITS (astronomy-specific binary file format that has
been the primary file format for 20+ years) and HDF5 which uses a character
data type to store data which can be bytes 0-255.  Getting an
decoding/encoding error when trying to deal with these datasets is a
non-starter from my perspective.
>>
>> That says to me that these are properly represented by `bytes` objects,
not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.
>
> If you could go back 30 years and get every scientist in the world to do
the right thing, then sure.  But we are living in a messy world right now
with messy legacy datasets that have character type data that are *mostly*
ASCII, but not infrequently contain non-ASCII characters.

I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.

Can you walk us through the problems that you are having with working with
these columns as arrays of `bytes`?

> So I would beg to actually move forward with a pragmatic solution that
addresses very real and consequential problems that we face instead of
waiting/praying for a perfect solution.

Well, I outlined a solution: work with `bytes` arrays with utilities to
convert to/from the Unicode-aware string dtypes (or `object`).

A UTF-8-specific dtype and maybe a string-specialized `object` dtype
address the very real and consequential problems that I face (namely and
respectively, working with HDF5 and in-memory manipulation of string
datasets).

I'm happy to consider a latin-1-specific dtype as a second,
workaround-for-specific-applications-only-you-have-
been-warned-you're-gonna-get-mojibake option. It should not be *the*
Unicode string dtype (i.e. named np.realstring or np.unicode as in the
original proposal).

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
Chris, you've mashed all of my emails together, some of them are in reply
to you, some in reply to others. Unfortunately, this dropped a lot of the
context from each of them, and appears to be creating some
misunderstandings about what each person is advocating.

On Mon, Apr 24, 2017 at 2:00 PM, Chris Barker  wrote:
>
> On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern 
wrote:

>> Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
>
> I agree-- binary compatibility with utf-8 is a core use case -- though is
it so bad to go through python's encoding/decoding machinery to so it? Do
numpy arrays HAVE to be storing utf-8 natively?

If the point is to have an array that transparently accepts/yields
`unicode/str` scalars while maintaining the in-memory encoding, yes. If
that's not the point, then IMO the status quo is fine, and *no* new dtypes
should be added, just maybe some utility functions to convert between the
bytes-ish arrays and the Unicode-holding arrays (which was one of my
proposals). I am mostly happy to live in a world where I read in data as
bytes-ish arrays, decode into `object` arrays holding `unicode/str`
objects, do my manipulations, then encode the array into a bytes-ish array
to give to the C API or file format.

>> or leave it be until someone else is willing to solve that problem. I
don't think we're at the bikeshedding stage yet; we're still disagreeing
about fundamental requirements.
>
> yeah -- though I've seen projects get stuck in the sorting out what to
do, so nothing gets done stage before -- I don't want Julian to get too
frustrated and end up doing nothing.

I understand, but not all tedious discussions that have not yet achieved
consensus are bikeshedding to be cut short. We couldn't really decide what
to do back in the pre-1.0 days, too, so we just did *something*, and that
something is now the very situation that Julian has a problem with.

We have more experience now, especially with the added wrinkles of Python
3; other projects have advanced and matured their Unicode string
array-handling (e.g. pandas and HDF5); now is a great time to have a real
discussion about what we *need* before we make decisions about what we
should *do*.

> So here I'll lay out what I think are the fundamental requirements:
>
> 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do:
>
> arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
characters
>
> and arr[1] will return a native Python string object.
>
> 2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not be wasting space for "typical
european-oriented data".
>
> arr = np.array(("this", "that",), dtype=np.single_byte_string)
>
> (name TBD)
>
> and arr[1] would return a python string.
>
> attempting to put in a not-compatible with the encoding string in would
raise an Encoding Error.
>
> I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the
encoding in this case.
>
> 3) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange with other systems (netcdf, HDF, others???)
>
> 4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object, and returns a bytes
object.
>  - you could use astype() to convert between bytes and a specified
encoding with no change in binary representation.

You'll need to specify what NULL-terminating behavior you want here.
np.string_ has NULL-termination. np.void (which could be made to work
better with `bytes`) does not. Both have use-cases for text encoding
(shakes fist at UTF-16).

> 2) and 3) could be fully covered by a dtype with a settable encoding that
might as well support all python built-in encodings -- though I think an
alias to the common cases would be good -- latin, utf-8. If so, the length
would have to be specified in bytes.
>
> 1) could be covered with the existing 'U': type - only downside being
some wasted space -- or with a pointer to a python string dtype -- which
would also waste space, though less for long-ish strings, and maybe give us
some better access to the nifty built-in string features.
>
>> > +1.  The key point is that there is a HUGE amount of legacy science
data in the form of FITS (astronomy-specific binary file format that has
been the primary file format for 20+ years) and HDF5 which uses a character
data type to store data which can be bytes 0-255.  Getting an
decoding/encoding error when trying to deal with these datasets is a
non-starter from my perspective.
>
>> 

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern 
wrote:
>>
>> I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
>>
>> Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
>
> This is very simple and obvious but I will state for the record.

I appreciate it. What is obvious to you is not obvious to me.

> Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
the data to `U` type and incur the 4x memory bloat.
>
> In [22]: dat = np.array(['yes', 'no'], dtype='S3')
>
> In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
> Out[23]: False
>
> In [24]: dat == b'yes'  # Right answer but not practical
> Out[24]: array([ True, False], dtype=bool)

I'm curious why you think this is not practical. It seems like a very
practical solution to me.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer  wrote:
>
> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker 
wrote:
>>>
>>> On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
really is ascii, then it's perfect. If it really is latin-*, then you get
some extra useful stuff, and if it's corrupted somehow, you still get the
ascii text correct, and the rest won't  barf and can be passed on through.
>
> I am totally in agreement with Thomas that "We are living in a messy
world right now with messy legacy datasets that have character type data
that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they
truly latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't know that we can reasonably make that accounting relevant. Number
of such characters per byte of text? Number of files with such characters
out of all existing files?

What I can say with assurance is that every time I have decided, as a
developer, to write code that just hardcodes latin-1 for such cases, I have
regretted it. While it's just personal anecdote, I think it's at least
measuring the right thing. :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern 
wrote:
>>
>> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>> >
>> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern 
wrote:
>> >>
>> >> I am not unfamiliar with this problem. I still work with files that
have fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
>> >>
>> >> Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
>> >
>> > This is very simple and obvious but I will state for the record.
>>
>> I appreciate it. What is obvious to you is not obvious to me.
>>
>> > Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
the data to `U` type and incur the 4x memory bloat.
>> >
>> > In [22]: dat = np.array(['yes', 'no'], dtype='S3')
>> >
>> > In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
>> > Out[23]: False
>> >
>> > In [24]: dat == b'yes'  # Right answer but not practical
>> > Out[24]: array([ True, False], dtype=bool)
>>
>> I'm curious why you think this is not practical. It seems like a very
practical solution to me.
>
> In Py3 most character data will be string, not bytes.  So every time you
want to interact with the bytes array (compare, assign, etc) you need to
explicitly coerce the right hand side operand to be a bytes-compatible
object.  For code that developers write, this might be possible but results
in ugly code.  But for the general science and engineering communities that
use numpy this is completely untenable.

Okay, so the problem isn't with (byte-)string literals, but with variables
being passed around from other sources. Eg.

def func(dat, scalar):
return dat == scalar

Every one of those functions deepens the abstraction and moves that
unicode-by-default scalar farther away from the bytesish array, so it's
harder to demand that users of those functions be aware that they need to
pass in `bytes` strings. So you need to implement those functions
defensively, which complicates them.

> The only practical solution so far is to implement a unicode sandwich and
convert to the 4-byte `U` type at the interface.  That is precisely what we
are trying to eliminate.

What do you think about my ASCII-surrogateescape proposal? Do you think
that would work with your use cases?

In general, I don't think Unicode sandwiches will be eliminated by this or
the latin-1 dtype; the sandwich is usually the right thing to do and the
surrogateescape the wrong thing. But I'm keenly aware of the problems you
get when there just isn't a reliable encoding to use.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith  wrote:

> That said, AFAICT what people actually want in most use cases is support
for arrays that can hold variable-length strings, and the only place where
the current approach is *optimal* is when we need mmap compatibility with
legacy formats that use fixed-width-nul-padded fields (at which point it's
super convenient). It's not even possible to *represent* all Python strings
or bytestrings in current numpy unicode or string arrays (Python
strings/bytestrings can have trailing nuls). So if we're talking about
tweaks to the current system it probably makes sense to focus on this use
case specifically.
>
> From context I'm assuming FITS files use fixed-width-nul-padding for
strings? Is that right? I know HDF5 doesn't.

Yes, HDF5 does. Or at least, it is supported in addition to the
variable-length ones.

https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith  wrote:
>
> On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern 
wrote:
> > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith  wrote:
> >
> >> That said, AFAICT what people actually want in most use cases is
support
> >> for arrays that can hold variable-length strings, and the only place
where
> >> the current approach is *optimal* is when we need mmap compatibility
with
> >> legacy formats that use fixed-width-nul-padded fields (at which point
it's
> >> super convenient). It's not even possible to *represent* all Python
strings
> >> or bytestrings in current numpy unicode or string arrays (Python
> >> strings/bytestrings can have trailing nuls). So if we're talking about
> >> tweaks to the current system it probably makes sense to focus on this
use
> >> case specifically.
> >>
> >> From context I'm assuming FITS files use fixed-width-nul-padding for
> >> strings? Is that right? I know HDF5 doesn't.
> >
> > Yes, HDF5 does. Or at least, it is supported in addition to the
> > variable-length ones.
> >
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> Doh, I found that page but it was (and is) meaningless to me, so I
> went by http://docs.h5py.org/en/latest/strings.html, which says the
> options are fixed-width ascii, variable-length ascii, or
> variable-length utf-8 ... I guess it's just talking about what h5py
> currently supports.

It's okay, I made exactly the same mistake earlier in the thread. :-)

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-). Is it important for some other reason?

The lack of such a dtype seems to be the reason why neither h5py nor
PyTables supports that kind of HDF5 Dataset. The variable-length Datasets
can take up a lot of disk-space because they can't be compressed (even
accounting for the wasted padding space). I mean, they probably could have
implemented it with objects arrays like h5py does with the variable-length
string Datasets, but they didn't.

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/624

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker  wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
the use-cases, so I'm not sure if there is any consensus on the use cases
-- which I think we really do need to nail down first -- as Robert has made
clear.
>
> So I'll try again -- use-case only! we'll keep the possible solutions
separate.
>
> Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.
>
> 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::

... etc.

These aren't use cases but rather requirements. I'm looking for something
rather more concrete than that.

* HDF5 supports fixed-length and variable-length string arrays encoded in
ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
the documentation claiming that there are more options). In practice, the
ASCII strings permit high-bit characters, but the encoding is unspecified.
Memory-mapping is rare (but apparently possible). The two major HDF5
bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
HDF5 option. Compression is supported for fixed-length string arrays but
not variable-length string arrays.

* FITS supports fixed-length string arrays that are NULL-padded. The
strings do not have a formal encoding, but in practice, they are typically
mostly ASCII characters with the occasional high-bit character from an
unspecific encoding. Memory-mapping is a common practice. These arrays can
be quite large even if each scalar is reasonably small.

* pandas uses object arrays for flexible in-memory handling of string
columns. Lengths are not fixed, and None is used as a marker for missing
data. String columns must be written to and read from a variety of formats,
including CSV, Excel, and HDF5, some of which are Unicode-aware and work
with `unicode/str` objects instead of `bytes`.

* There are a number of sometimes-poorly-documented,
often-poorly-adhered-to, aging file format "standards" that include string
arrays but do not specify encodings, or such specification is ignored in
practice. This can make the usual "Unicode sandwich" at the I/O boundaries
difficult to perform.

* In Python 3 environments, `unicode/str` objects are rather more common,
and simple operations like equality comparisons no longer work between
`bytes` and `unicode/str`, making it difficult to work with numpy string
arrays that yield `bytes` scalars.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 10:04 AM, Chris Barker 
wrote:
>
> On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI  wrote:
>>
>> 2017-04-25 12:34 GMT-04:00 Chris Barker :
>> > I am totally euro-centric,
>
>> But Shift-JIS is not one-byte; it's two-byte (unless you allow only
>> half-width characters and nothing else). :-)
>
> bad example then -- are their other non-euro-centric one byte per char
encodings worth worrying about? I have no clue :-)

I've run into Windows-1251 in files (seismic and well log data from Russian
wells). Treating them as latin-1 does not make for a happy time. Both
encodings also technically derive from ASCII in the lower half, but most of
the actual language is written with the high-bit characters.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.face...@gmail.com> wrote:

>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

Just to clarify some terminology (because it wasn't originally clear to me
until I looked it up in reference to HDF5):

* "NULL-padded" implies that, for a fixed width of N, there can be up to N
non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.

* "NULL-terminated" implies that, for a fixed width of N, there can be up
to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.

I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern 
wrote:
>>
>> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>> >
>> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.face...@gmail.com> wrote:
>>
>> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>> >
>> > Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.
>>
>> Just to clarify some terminology (because it wasn't originally clear to
me until I looked it up in reference to HDF5):
>>
>> * "NULL-padded" implies that, for a fixed width of N, there can be up to
N non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.
>>
>> * "NULL-terminated" implies that, for a fixed width of N, there can be
up to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.
>>
>> I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.
>
> Thanks for the clarification. NULL-padded is what I meant.

Okay, however, the biggest use-case we have for UTF-8 arrays (HDF5) is
NULL-terminated.

> I'm wondering how much of the desired functionality we could get by
simply subclassing ndarray in python. I think we mostly want to be able to
view byte strings and convert to unicode if needed.

I'm not sure. Some of these fixed-width string arrays are embedded inside
structured arrays with other dtypes.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <
chris.bar...@noaa.gov> wrote:

>> Presumably you're getting byte strings (with  unknown encoding.
>
> No -- thus is for creating and using mostly ascii string data with python
and numpy.
>
> Unknown encoding bytes belong in byte arrays -- they are not text.

You are welcome to try to convince Thomas of that. That is the status quo
for him, but he is finding that difficult to work with.

> I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
with a few extra characters" data. With all the sloppiness over the years,
there are way to many files like that.

That sloppiness that you mention is precisely the "unknown encoding"
problem. Your previous advocacy has also touched on using latin-1 to decode
existing files with unknown encodings as well. If you want to advocate for
using latin-1 only for the creation of new data, maybe stop talking about
existing files? :-)

> Note: the primary use-case I have in mind is working with ascii text in
numpy arrays efficiently-- folks have called for that. All I'm saying is
use Latin-1 instead of ascii -- that buys you some useful extra characters.

For that use case, the alternative in play isn't ASCII, it's UTF-8, which
buys you a whole bunch of useful extra characters. ;-)

There are several use cases being brought forth here. Some involve file
reading, some involve file writing, and some involve in-memory
manipulation. Whatever change we make is going to impinge somehow on all of
the use cases. If all we do is add a latin-1 dtype for people to use to
create new in-memory data, then someone is going to use it to read existing
data in unknown or ambiguous encodings.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris 
wrote:

> The maximum length of an UTF-8 character is 4 bytes, so we could use that
to size arrays by character length. The advantage over UTF-32 is that it is
easily compressible, probably by a factor of 4 in many cases. That doesn't
solve the in memory problem, but does have some advantages on disk as well
as making for easy display. We could compress it ourselves after encoding
by truncation.

The major use case that we have for a UTF-8 array is HDF5, and it specifies
the width in bytes, not Unicode characters.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> Indeed,
> Most of this discussion is irrelevant to numpy.
> Numpy only really deals with the in memory storage of strings. And in
> that it is limited to fixed length strings (in bytes/codepoints).
> How you get your messy strings into numpy arrays is not very relevant to
> the discussion of a smaller representation of strings.
> You couldn't get messy strings into numpy without first sorting it out
> yourself before, you won't be able to afterwards.
> Numpy will offer a set of encodings, the user chooses which one is best
> for the use case and if the user screws it up, it is not numpy's problem.
>
> You currently only have a few ways to even construct string arrays:
> - array construction and loops
> - genfromtxt (which is again just a loop)
> - memory mapping which I seriously doubt anyone actually does for the S
> and U dtype

I fear that you decided that the discussion was irrelevant and thus did not
read it rather than reading it to decide that it was not relevant. Because
several of us have showed that, yes indeed, we do memory-map string arrays.

You can add to this list C APIs, like that of libhdf5, that need to
communicate (Unicode) string arrays.

Look, I know I can be tedious, but *please* go back and read this
discussion. We have concrete use cases outlined. We can give you more
details if you need them. We all feel the pain of the rushed, inadequate
implementation of the U dtype. But each of our pains is a little bit
different; you obviously aren't experiencing the same pains that I am.

> Having a new dtype changes nothing here. You still need to create numpy
> arrays from python strings which are well defined and clean.
> If you put something in that doesn't encode you get an encoding error.
> No oddities like surrogate escapes are needed, numpy arrays are not
> interfaces to operating systems nor does numpy need to _add_ support for
> historical oddities beyond what it already has.
> If you want to represent bytes exactly as they came in don't use a text
> dtype (which includes the S dtype, use i1).

Thomas Aldcroft has demonstrated the problem with this approach. numpy
arrays are often interfaces to files that have tons of historical oddities.

> Concerning variable sized strings, this is simply not going to happen.
> Nobody is going to rewrite numpy to support it, especially not just for
> something as unimportant as strings.
> Best you are going to get (or better already have) is object arrays. It
> makes no sense to discuss it unless someone comes up with an actual
> proposal and the willingness to code it.

No one has suggested such a thing. At most, we've talked about specializing
object arrays.

> What is a relevant discussion is whether we really need a more compact
> but limited representation of text than 4-byte utf32 at all.
> Its usecase is for the most part just for python3 porting and saving
> some memory in some ascii heavy cases, e.g. astronomy.
> It is not that significant anymore as porting to python3 has mostly
> already happened via the ugly byte workaround and memory saving is
> probably not as significant in the context of numpy which is already
> heavy on memory usage.
>
> My initial approach was to not add a new dtype but to make unicode
> parametrizable which would have meant almost no cluttering of numpys
> internals and keeping the api more or less consistent which would make
> this a relatively simple addition of minor functionality for people that
> want it.
> But adding a completely new partially redundant dtype for this usecase
> may be a too large change to the api. Having two partially redundant
> string types may confuse users more than our current status quo of our
> single string type (U).
>
> Discussing whether we want to support truncated utf8 has some merit as
> it is a decision whether to give the users an even larger gun to shot
> themselves in the foot with.
> But I'd like to focus first on the 1 byte type to add a symmetric API
> for python2 and python3.
> utf8 can always be added latter should we deem it a good idea.

What is your current proposal? A string dtype parameterized with the
encoding (initially supporting the latin-1 that you desire and maybe adding
utf-8 later)? Or a latin-1-specific dtype such that we will have to add a
second utf-8 dtype at a later date?

If you're not going to support arbitrary encodings right off the bat, I'd
actually suggest implementing UTF-8 and ASCII-surrogateescape first as they
seem to knock off more use cases straight away.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald 
wrote:
>
> On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer  wrote:
>>
>> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern 
wrote:
>>>
>>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>>>
>>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
that to size arrays by character length. The advantage over UTF-32 is that
it is easily compressible, probably by a factor of 4 in many cases. That
doesn't solve the in memory problem, but does have some advantages on disk
as well as making for easy display. We could compress it ourselves after
encoding by truncation.
>>>
>>> The major use case that we have for a UTF-8 array is HDF5, and it
specifies the width in bytes, not Unicode characters.
>>
>> It's not just HDF5. Counting bytes is the Right Way to measure the size
of UTF-8 encoded text:
>> http://utf8everywhere.org/#myths
>>
>> I also firmly believe (though clearly this is not universally agreed
upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
>
> It seems to me that most of the requirements people have expressed in
this thread would be satisfied by:
>
> (1) object arrays of strings. (We have these already; whether a
strings-only specialization would permit useful things like string-oriented
ufuncs is a question for someone who's willing to implement one.)
>
> (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
All python encodings should be permitted. An additional function to
truncate encoded data without mangling the encoding would be handy. I think
it makes more sense for this to be NULL-padded than NULL-terminated but it
may be necessary to support both; note that NULL-termination is complicated
for encodings like UCS4. This also includes the legacy UCS4 strings as a
special case.
>
> (3) a dtype for fixed-length byte strings. This doesn't look very
different from an array of dtype u8, but given we have the bytes type,
accessing the data this way makes sense.

The void dtype is already there for this general purpose and mostly works,
with a few niggles. On Python 3, it uses 'int8' ndarrays underneath the
scalars (fortunately, they do not appear to be mutable views). It also
accepts `bytes` strings that are too short (pads with NULs) and too long
(truncates). If it worked more transparently and perhaps rigorously with
`bytes`, then it would be quite suitable.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:
>
> On 26.04.2017 19:08, Robert Kern wrote:
> > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> > mailto:jtaylor.deb...@googlemail.com>>
> > wrote:
> >
> >> Indeed,
> >> Most of this discussion is irrelevant to numpy.
> >> Numpy only really deals with the in memory storage of strings. And in
> >> that it is limited to fixed length strings (in bytes/codepoints).
> >> How you get your messy strings into numpy arrays is not very relevant
to
> >> the discussion of a smaller representation of strings.
> >> You couldn't get messy strings into numpy without first sorting it out
> >> yourself before, you won't be able to afterwards.
> >> Numpy will offer a set of encodings, the user chooses which one is best
> >> for the use case and if the user screws it up, it is not numpy's
problem.
> >>
> >> You currently only have a few ways to even construct string arrays:
> >> - array construction and loops
> >> - genfromtxt (which is again just a loop)
> >> - memory mapping which I seriously doubt anyone actually does for the S
> >> and U dtype
> >
> > I fear that you decided that the discussion was irrelevant and thus did
> > not read it rather than reading it to decide that it was not relevant.
> > Because several of us have showed that, yes indeed, we do memory-map
> > string arrays.
> >
> > You can add to this list C APIs, like that of libhdf5, that need to
> > communicate (Unicode) string arrays.
> >
> > Look, I know I can be tedious, but *please* go back and read this
> > discussion. We have concrete use cases outlined. We can give you more
> > details if you need them. We all feel the pain of the rushed, inadequate
> > implementation of the U dtype. But each of our pains is a little bit
> > different; you obviously aren't experiencing the same pains that I am.
>
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being
read via memory mapping.

  http://docs.astropy.org/en/stable/io/fits/index.html

You were even pointed to a minor HDF5 implementation that memory maps:


https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_level.py#L682-L683

I'm afraid that I can't share the actual code of the full variety of
proprietary file formats that I've written code for, I can assure you that
I have memory mapped many string arrays in my time, usually embedded as
columns in structured arrays. It is not "nice to have"; it is "have done
many times and needs better support".

> In any case it is still irrelevant. My proposal only _adds_ additional
> cases that can be mmapped. It does not prevent you from doing what you
> have been doing before.

You are the one who keeps worrying about the additional complexity, both in
code and mental capacity of our users, of adding new overlapping dtypes and
solutions, and you're not wrong about that. I think it behooves us to
consider if there are solutions that solve multiple related problems at
once instead of adding new dtypes piecemeal to solve individual problems.

> >> Having a new dtype changes nothing here. You still need to create numpy
> >> arrays from python strings which are well defined and clean.
> >> If you put something in that doesn't encode you get an encoding error.
> >> No oddities like surrogate escapes are needed, numpy arrays are not
> >> interfaces to operating systems nor does numpy need to _add_ support
for
> >> historical oddities beyond what it already has.
> >> If you want to represent bytes exactly as they came in don't use a text
> >> dtype (which includes the S dtype, use i1).
> >
> > Thomas Aldcroft has demonstrated the problem with this approach. numpy
> > arrays are often interfaces to files that have tons of historical
oddities.
>
> This does not matter for numpy, the text dtype is well defined as bytes
> with a specific encoding and null padding.

You cannot dismiss something as "not mattering for *numpy*" just because
your new, *proposed* text dtype doesn't support it.

You seem to have fixed on a course of action and are defining everyone
else's use cases as out-of-scope because your course of action doesn't
support them. That

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg 
wrote:

> I remember talking with a colleague about something like that. And
> basically an annoying thing there was that if you strip the zero bytes
> in a zero padded string, some encodings (UTF16) may need one of the
> zero bytes to work right. (I think she got around it, by weird
> trickery, inverting the endianess or so and thus putting the zero bytes
> first).
> Maybe will ask her if this discussion is interesting to her. Though I
> think it might have been something like "make everything in
> hdf5/something similar work" without any actual use case, I don't know.

I don't think that will be an issue for an encoding-parameterized dtype.
The decoding machinery of that would have access to the full-width buffer
for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes
for UTF-16). It's only if you have to hack around at a higher level with
numpy's S arrays, which return Python byte strings that strip off the
trailing NULL bytes, that you have to worry about such things. Getting a
Python scalar from the numpy S array loses information in such cases.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>
> On Apr 26, 2017 12:09 PM, "Robert Kern"  wrote:

>> It's worthwhile enough that both major HDF5 bindings don't support
Unicode arrays, despite user requests for years. The sticking point seems
to be the difference between HDF5's view of a Unicode string array (defined
in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
string array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.
>
> I would really like to hear more from the authors of these libraries
about what exactly it is they feel they're missing. Is it that they want
numpy to enforce the length limit early, to catch errors when the array is
modified instead of when they go to write it to the file? Is it that they
really want an O(1) way to look at a array and know the maximum number of
bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
is really annoying and files that need it are rare so they haven't had the
motivation to implement it?

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/379

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker  wrote:

> But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:
>
> * Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.
>
> For THAT -- utf-8 is critical.
>
> But if I understand Julian's proposal -- he wants to create a
parameterized text dtype that you can set the encoding on, and then numpy
will use the encoding (and python's machinery) to encode / decode when
passing to/from python strings.
>
> It seems this would support all our desires:
>
> I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.
>
> Thomas would get latin-1 for binary interchange with mostly-ascii data
>
> The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)
>
> Even folks that had weird JAVA or Windows-generated UTF-16 data files
could do the binary interchange thing
>
> I'm now lost as to what the hang-up is.

The proposal is for only latin-1 and UTF-32 to be supported at first, and
the eventual support of UTF-8 will be constrained by specification of the
width in terms of characters rather than bytes, which conflicts with the
use cases of UTF-8 that have been brought forth.

  https://mail.python.org/pipermail/numpy-discussion/2017-April/076668.html

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Compare NumPy arrays with threshold and return the differences

2017-05-17 Thread Robert Kern
On Wed, May 17, 2017 at 9:50 AM, Nissim Derdiger 
wrote:

> Hi,
> In my script, I need to compare big NumPy arrays (2D or 3D), and return a
> list of all cells with difference bigger than a defined threshold.
> The compare itself can be done easily done with "allclose" function, like
> that:
> Threshold = 0.1
> if (np.allclose(Arr1, Arr2, Threshold, equal_nan=True)):
> Print('Same')
> But this compare does not return *which* cells are not the same.
>
> The easiest (yet naive) way to know which cells are not the same is to use
> a simple for loops code like this one:
> def CheckWhichCellsAreNotEqualInArrays(Arr1,Arr2,Threshold):
>if not Arr1.shape == Arr2.shape:
>return ['Arrays size not the same']
>Dimensions = Arr1.shape
>Diff = []
>for i in range(Dimensions [0]):
>for j in range(Dimensions [1]):
>if not np.allclose(Arr1[i][j], Arr2[i][j], Threshold,
> equal_nan=True):
>Diff.append(',' + str(i) + ',' + str(j) + ',' +
> str(Arr1[i,j]) + ','
>+ str(Arr2[i,j]) + ',' + str(Threshold) + ',Fail\n')
>return Diff
> (and same for 3D arrays - with 1 more for loop)
> This way is very slow when the Arrays are big and full of none-equal cells.
>
> Is there a fast straight forward way in case they are not the same - to
> get a list of the uneven cells? maybe some built-in function in the NumPy
> itself?
>

Use `close_mask = np.isclose(Arr1, Arr2, Threshold, equal_nan=True)` to
return a boolean mask the same shape as the arrays which is True where the
elements are close and False where they are not. You can invert it to get a
boolean mask which is True where they are "far" with respect to the
threshold: `far_mask = ~close_mask`. Then you can use `i_idx, j_idx =
np.nonzero(far_mask)` to get arrays of the `i` and `j` indices where the
values are far. For example:

for i, j in zip(i_idx, j_idx):
print("{0}, {1}, {2}, {3}, {4}, Fail".format(i, j, Arr1[i, j], Arr2[i,
j], Threshold))

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Compare NumPy arrays with threshold

2017-05-18 Thread Robert Kern
On Thu, May 18, 2017 at 5:07 AM, Nissim Derdiger 
wrote:
>
> Hi again,
> Thanks for the responses to my question!
> Roberts answer worked very well for me, except for 1 small issue:
>
> This line:
> close_mask = np.isclose(MatA, MatB, Threshold, equal_nan=True)
> returns each difference twice – once j in compare to I and once for I in
compare to j

No, it returns a boolean array the same size as MatA and MatB. It literally
can't contain "each difference twice". Maybe there is something else in
your code that is producing the doubling that you see, possibly in the
printing of the results.

I'm not seeing the behavior that you speak of. Please post your complete
code that produced the doubled output that you see.

import numpy as np

MatA = np.array([[10,20,30],[40,50,60]])
MatB = np.array([[10,30,30],[40,50,160]])
Threshold = 1.0

# Note the `atol=` here. I missed it before.
close_mask = np.isclose(MatA, MatB, atol=Threshold, equal_nan=True)
far_mask = ~close_mask
i_idx, j_idx = np.nonzero(far_mask)
for i, j in zip(i_idx, j_idx):
print("{0}, {1}, {2}, {3}, {4}, Fail".format(i, j, MatA[i, j], MatB[i,
j], Threshold))


I get the following output:

$ python isclose.py
0, 1, 20, 30, 1.0, Fail
1, 2, 60, 160, 1.0, Fail

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: SfePy 2017.2

2017-05-19 Thread Robert Cimrman

I am pleased to announce release 2017.2 of SfePy.

Description
---

SfePy (simple finite elements in Python) is a software for solving systems of
coupled partial differential equations by the finite element method or by the
isogeometric analysis (limited support). It is distributed under the new BSD
license.

Home page: http://sfepy.org
Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/
Git (source) repository, issue tracker: https://github.com/sfepy/sfepy

Highlights of this release
--

- simplified and unified implementation of some homogenized coefficients
- support for saving custom structured data to HDF5 files
- new tutorial on preparing meshes using FreeCAD/OpenSCAD and Gmsh

For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1
(rather long and technical).

Cheers,
Robert Cimrman

---

Contributors to this release in alphabetical order:

Robert Cimrman
Jan Heczko
Lubos Kejzlar
Vladimir Lukes
Matyas Novak
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] why a[0][0].__mul__(a[0][0]) where a is np.array, gives 'missing 1 required positional argument'?

2017-06-27 Thread Robert Kern
On Tue, Jun 27, 2017 at 2:54 AM, Larry Evans 
wrote:

> a=
> [[ 12.0]
>  [12.0 <0>]]
> a[0]=
> [ 12.0]
> a[0][0]=
> 
> Traceback (most recent call last):
>   File "shortestPathABC.py", line 123, in 
> a00mul=a[0][0].__mul__(a[0][0])
> TypeError: __mul__() missing 1 required positional argument: 'other'
> Makefile:7: recipe for target 'all' failed
> make: *** [all] Error 1
> --}--cut here--
>
> I don't understand why.  Apparently, a[0][0] is not a ShortestNull
because otherwise, the .__mul__ would have the positional argument,
> 'other' equal to a[0][0].
>
> What am I missing?

> a=np.array([[ShortestNull,ShortestPath(12)],[ShortestPath(
12),ShortestNull()]],dtype=object)

You didn't instantiate ShortestNull but passed the class object instead.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Boolean binary '-' operator

2017-06-27 Thread Robert Kern
On Tue, Jun 27, 2017 at 3:01 PM, Benjamin Root  wrote:
>
> Forgive my ignorance, but what is "Z/2"?

https://groupprops.subwiki.org/wiki/Cyclic_group:Z2
https://en.wikipedia.org/wiki/Cyclic_group

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposed changes to array printing in 1.14

2017-06-30 Thread Robert Kern
On Fri, Jun 30, 2017 at 4:47 PM, Gael Varoquaux <
gael.varoqu...@normalesup.org> wrote:
>
> One problem is that it becomes hard (impossible?) for downstream packages
> such as scikit-learn to doctest under multiple versions of the numpy.
> Past experience has shown that it could be useful.

It's not that hard: wrap the new `set_printoptions(pad=True)` in a `try:`
block to catch the error under old versions.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposed changes to array printing in 1.14

2017-06-30 Thread Robert Kern
On Fri, Jun 30, 2017 at 7:23 PM, Juan Nunez-Iglesias 
wrote:

> I do have sympathy for Ralf’s argument that "exact repr's are not part of
the NumPy (or Python for that matter) backwards compatibility guarantees”.
But it is such a foundational project in Scientific Python that I think
extreme care is warranted, beyond any official guarantees. (Hence this
thread, yes. Thank you!)

I would also like to make another distinction here: I don't think anyone's
actual *code* has broken because of this change. To my knowledge, it is
only downstream projects' *doctests* that break. This might deserve *some*
care on our part (beyond notification and keeping it out of a 1.x.y bugfix
release), but "extreme care" is just not warranted.

> Anyway, all this is (mostly) moot if the next NumPy ships with this
doctest++ thingy. That would be an enormously valuable contribution to the
whole ecosystem.

I'd recommend just making an independent project on Github and posting it
as its own project to PyPI when you think it's ready. We'll link to it in
our documentation. I don't think that it ought to be part of numpy and
stuck on our release cadence.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-05 Thread Robert Kern
On Wed, Jul 5, 2017 at 5:41 AM,  wrote:
>
> Dear all
>
> I’m sorry if my question is too basic (not fully in relation to Numpy –
while it is to build matrices and to work with Numpy afterward), but I’m
spending a lot of time and effort to find a way to record data from an asci
while, and reassign it into a matrix/array … with unsuccessfully!
>
> The only way I found is to use ‘append()’ instruction involving dynamic
memory allocation. :-(

Are you talking about appending to Python list objects? Or the np.append()
function on numpy arrays?

In my experience, it is usually fine to build a list with the `.append()`
method while reading the file of unknown size and then converting it to an
array afterwards, even for dozens of millions of lines. The list object is
quite smart about reallocating memory so it is not that expensive. You
should generally avoid the np.append() function, though; it is not smart.

> From my current experience under Scilab (a like Matlab scientific
solver), it is well know:
>
> Step 1 : matrix initialization like ‘np.zeros(n,n)’
> Step 2 : record the data
> and write it in the matrix (step 3)
>
> I’m obviously influenced by my current experience, but I’m interested in
moving to Python and its packages
>
> For huge asci files (involving dozens of millions of lines), my strategy
is to work by ‘blocks’ as :
>
> Find the line index of the beginning and the end of one block (this
implies that the file is read ounce)
> Read the block
> (process repeated on the different other blocks)

Are the blocks intrinsic parts of the file? Or are you just trying to break
up the file into fixed-size chunks?

> I tried different codes such as bellow, but each time Python is telling
me I cannot mix iteration and record method
>
> #
>
> position = []; j=0
> with open(PATH + file_name, "r") as rough_ data:
> for line in rough_ data:
> if my_criteria in line:
> position.append(j) ## huge blocs but limited in number
> j=j+1
>
> i = 0
> blockdata = np.zeros( (size_block), dtype=np.float)
> with open(PATH + file_name, "r") as f:
>  for line in itertools.islice(f,1,size_block):
>  blockdata [i]=float(f.readline() )

For what it's worth, this is the line that is causing the error that you
describe. When you iterate over the file with the `for line in
itertools.islice(f, ...):` loop, you already have the line text. You don't
(and can't) call `f.readline()` to get it again. It would mess up the
iteration if you did and cause you to skip lines.

By the way, it is useful to help us help you if you copy-paste the exact
code that you are running as well as the full traceback instead of
paraphrasing the error message.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-05 Thread Robert McLeod
While I'm going to bet that the fastest way to build a ndarray from ascii
is with a 'io.ByteIO` stream, NumPy does have a function to load from text,
`numpy.loadtxt` that works well enough for most purposes.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

It's hard to tell from the original post if the ascii is being continuously
generated or not.  If it's being produced in an on-going fashion then a
stream object is definitely the way to go, as the array chunks can be
produced by `numpy.frombuffer()`.

https://docs.python.org/3/library/io.html

https://docs.scipy.org/doc/numpy/reference/generated/numpy.frombuffer.html

Robert


On Wed, Jul 5, 2017 at 3:21 PM, Robert Kern  wrote:

> On Wed, Jul 5, 2017 at 5:41 AM,  wrote:
> >
> > Dear all
> >
> > I’m sorry if my question is too basic (not fully in relation to Numpy –
> while it is to build matrices and to work with Numpy afterward), but I’m
> spending a lot of time and effort to find a way to record data from an asci
> while, and reassign it into a matrix/array … with unsuccessfully!
> >
> > The only way I found is to use ‘append()’ instruction involving dynamic
> memory allocation. :-(
>
> Are you talking about appending to Python list objects? Or the np.append()
> function on numpy arrays?
>
> In my experience, it is usually fine to build a list with the `.append()`
> method while reading the file of unknown size and then converting it to an
> array afterwards, even for dozens of millions of lines. The list object is
> quite smart about reallocating memory so it is not that expensive. You
> should generally avoid the np.append() function, though; it is not smart.
>
> > From my current experience under Scilab (a like Matlab scientific
> solver), it is well know:
> >
> > Step 1 : matrix initialization like ‘np.zeros(n,n)’
> > Step 2 : record the data
> > and write it in the matrix (step 3)
> >
> > I’m obviously influenced by my current experience, but I’m interested in
> moving to Python and its packages
> >
> > For huge asci files (involving dozens of millions of lines), my strategy
> is to work by ‘blocks’ as :
> >
> > Find the line index of the beginning and the end of one block (this
> implies that the file is read ounce)
> > Read the block
> > (process repeated on the different other blocks)
>
> Are the blocks intrinsic parts of the file? Or are you just trying to
> break up the file into fixed-size chunks?
>
> > I tried different codes such as bellow, but each time Python is telling
> me I cannot mix iteration and record method
> >
> > #
> >
> > position = []; j=0
> > with open(PATH + file_name, "r") as rough_ data:
> > for line in rough_ data:
> > if my_criteria in line:
> > position.append(j) ## huge blocs but limited in
> number
> > j=j+1
> >
> > i = 0
> > blockdata = np.zeros( (size_block), dtype=np.float)
> > with open(PATH + file_name, "r") as f:
> >  for line in itertools.islice(f,1,size_block):
> >  blockdata [i]=float(f.readline() )
>
> For what it's worth, this is the line that is causing the error that you
> describe. When you iterate over the file with the `for line in
> itertools.islice(f, ...):` loop, you already have the line text. You don't
> (and can't) call `f.readline()` to get it again. It would mess up the
> iteration if you did and cause you to skip lines.
>
> By the way, it is useful to help us help you if you copy-paste the exact
> code that you are running as well as the full traceback instead of
> paraphrasing the error message.
>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 
Robert McLeod, Ph.D.
robert.mcl...@unibas.ch
robert.mcl...@bsse.ethz.ch 
robbmcl...@gmail.com
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread Robert Kern
On Thu, Jul 6, 2017 at 1:49 AM,  wrote:
>
> Dear All
>
> First of all thanks for the answers and the information’s (I’ll ding into
it) and let me trying to add comments on what I want to :
>
> My asci file mainly contains data (float and int) in a single column
> (it is not always the case but I can easily manage it – as well I saw I
can use ‘spli’ instruction if necessary)
> Comments/texts indicates the beginning of a bloc immediately followed by
the number of sub-blocs
> So I need to read/record all the values in order to build a matrix before
working on it (using Numpy & vectorization)
>
> The columns 2 and 3 have been added for further treatments
> The ‘0’ values will be specifically treated afterward
>
>
> Numpy won’t be a problem I guess (I did some basic tests and I’m quite
confident) on how to proceed, but I’m really blocked on data records … I
trying to find a way to efficiently read and record data in a matrix:
>
> avoiding dynamic memory allocation (here using ‘append’ in python
meaning, not np),

Although you can avoid some list appending in your case (because the blocks
self-describe their length), I would caution you against prematurely
avoiding it. It's often the most natural way to write the code in Python,
so go ahead and write it that way first. Once you get it working correctly,
but it's too slow or memory intensive, then you can puzzle over how to
preallocate the numpy arrays later. But quite often, it's fine. In this
case, the reading and handling of the text data itself is probably the
bottleneck, not appending to the lists. As I said, Python lists are
cleverly implemented to make appending fast. Accumulating numbers in a list
then converting to an array afterwards is a well-accepted numpy idiom.

> dealing with huge asci file: the latest file I get contains more than 60
million of lines
>
> Please find in attachment an extract of the input format
(‘example_of_input’), and the matrix I’m trying to create and manage with
Numpy
>
> Thanks again for your time

Try something like the attached. The function will return a list of blocks.
Each block will itself be a list of numpy arrays, which are the sub-blocks
themselves. I didn't bother adding the first three columns to the
sub-blocks or trying to assemble them all into a uniform-width matrix by
padding with trailing 0s. Since you say that the trailing 0s are going to
be "specially treated afterwards", I suspect that you can more easily work
with the lists of arrays instead. I assume floating-point data rather than
trying to figure out whether int or float from the data. The code can
handle multiple data values on one line (not especially well-tested, but it
ought to work), but it assumes that the number of sub-blocks, index of the
sub-block, and sub-block size are each on the own line. The code gets a
little more complicated if that's not the case.

--
Robert Kern
from __future__ import print_function

import numpy as np


def write_random_file(filename, n_blocks=42, n_elems=60*1000*1000):
q, r = divmod(n_elems, n_blocks)
block_lengths = [q] * n_blocks
block_lengths[-1] += r
with open(filename, 'w') as f:
print('##BEGIN', file=f)
print(n_blocks, file=f)
for i, block_length in enumerate(block_lengths, 1):
print(i, file=f)
print(block_length, file=f)
block = np.random.randint(0, 1000, size=block_length)
for x in block:
print(x, file=f)


def read_blocked_file(filename):
blocks = []
with open(filename, 'r') as f:
# Loop over all blocks.
while True:
# Consume lines until the start of the next block.
# Unfortunately, we can't use `for line in f:` because we need to
# use `f.readline()` later.
line = f.readline()
found_block = True
while '##BEGIN' not in line:
line = f.readline()
if line == '':
# We've reached the end of the file.
found_block = False
break
if not found_block:
# We iterated to the end of the file. Break out of the `while`
# loop.
break

# Read the number of sub-blocks.
# This assumes that it is on a line all by itself.
n_subblocks = int(f.readline())
subblocks = []
for i_subblock in range(1, n_subblocks + 1):
read_i_subblock = int(f.readline())
# These ought to match.
if read_i_subblock != i_subblock:
raise RuntimeError("Mismatched sub-block index")
# Read the size of the sub-block.
subblock_size = int(f.readline())
# Allocate an array

Re: [Numpy-discussion] record data previous to Numpy use

2017-07-06 Thread Robert Kern
On Thu, Jul 6, 2017 at 3:19 AM,  wrote:
>
> Thanks Rober for your effort - I'll have a look on it
>
> ...  the goal is be guide in how to proceed (and to understand), and not
to have a "ready-made solution" ... but I appreciate honnestly :-)

Sometimes it's easier to just write the code than to try to explain in
prose what to do. :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] How to compare an array of arrays elementwise to None in Numpy 1.13 (was easy before)?

2017-07-17 Thread Robert Kern
On Mon, Jul 17, 2017 at 2:13 AM,  wrote:
>
> Dear all
>
> I have object array of arrays, which I compare element-wise to None in
various places:
>
> >>> a =
numpy.array([numpy.arange(5),None,numpy.nan,numpy.arange(6),None],dtype=numpy.object)
> >>> a
> array([array([0, 1, 2, 3, 4]), None, nan, array([0, 1, 2, 3, 4, 5]),
None], dtype=object)
> >>> numpy.equal(a,None)
> FutureWarning: comparison to `None` will result in an elementwise object
comparison in the future.
>
>
> So far, I always ignored the warning, for lack of an idea how to resolve
it.
>
> Now, with Numpy 1.13, I have to resolve the issue, because it fails with:
>
> ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
>
> It seem that the numpy.equal is applied to each inner array, returning a
Boolean array for each element, which cannot be coerced to a single Boolean.
>
> The expression
>
> >>> numpy.vectorize(operator.is_)(a,None)
>
> gives the desired result, but feels a bit clumsy.

Wrap the clumsiness up in a documented, tested utility function with a
descriptive name and use that function everywhere instead.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] How to compare an array of arrays elementwise to None in Numpy 1.13 (was easy before)?

2017-07-17 Thread Robert Kern
On Mon, Jul 17, 2017 at 10:52 AM, Eric Wieser 
wrote:

> Here’s a hack that lets you keep using ==:
>
> class IsCompare:
> __array_priority__ = 99  # needed to make it work on either side of 
> `==`
> def __init__(self, val): self._val = val
> def __eq__(self, other): return other is self._val
> def __neq__(self, other): return other is not self._val
>
> a == IsCompare(None)  # a is None
> a == np.array(IsCompare(None)) # broadcasted a is None
>
>
Frankly, I'd stick with a well-named utility function. It's much more kind
to those who have to read the code (e.g. you in 6 months). :-)

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] power function distribution or power-law distribution?

2017-08-24 Thread Robert Kern
On Thu, Aug 24, 2017 at 7:56 AM, Renato Fabbri 
wrote:
>
> On Thu, Aug 24, 2017 at 11:47 AM, Nathan Goldbaum 
wrote:
>>
>> The latest version of numpy is 1.13.
>>
>> In this case, as described in the docs, a power function distribution is
one with a probability desnity function of the form ax^(a-1) for x between
0 and 1.
>
> ok, let's try ourselves to relate the terms.
> Would you agree that the "power function distribution" is a "power-law
distribution"
> in which the domain is restricted to be [0,1]?

I probably wouldn't. The coincidental similarity in functional form (domain
and normalizing constants notwithstanding) obscures the very different
mechanisms each represent.

The ambiguous name of the method `power` instead of `power_function` is my
fault. You have my apologies.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] selecting all 2-d slices out of n-dimensional array

2017-08-29 Thread Robert Kern
On Tue, Aug 29, 2017 at 6:03 PM, Moroney, Catherine M (398E) <
catherine.m.moro...@jpl.nasa.gov> wrote:

> Hello,
>
>
>
> I have an n-dimensional array (say (4,4,2,2)) and I wish to automatically
> extract all the (4,4) slices in it.
>
> i.e.
>
>
>
> a = numpy.arange(0, 64).reshape(4,4,2,2)
>
> slice1 = a[..., 0, 0]
>
> slice2 = a[..., 0, 1]
>
> slice3 = a[..., 1, 0]
>
> slice4 = a[..., 1,1]
>
>
>
> Simple enough example but in my case array “a” will have unknown rank and
> size.  All I know is that it will have more than 2 dimensions, but I don’t
> know ahead of time how many dimensions or what the size of those dimensions
> are.
>
>
>
> What is the best way of tackling this problem without writing a whole
> bunch of if-then cases depending on what the rank and shape of a is?  Is
> there a one-size-fits-all solution?
>

First, reshape the array to (4, 4, -1). The -1 tells the method to choose
whatever's needed to get the size to work out. Then roll the last axis to
the front, and then you have a sequence of the (4, 4) arrays that you
wanted.

E.g. (using (4,4,3,3) as the original shape for clarity)

[~]
|26> a = numpy.arange(0, 4*4*3*3).reshape(4,4,3,3)

[~]
|27> b = a.reshape([4, 4, -1])

[~]
|28> b.shape
(4, 4, 9)

[~]
|29> c = np.rollaxis(b, -1, 0)

[~]
|30> c.shape
(9, 4, 4)

[~]
|31> c[0]
array([[  0,   9,  18,  27],
   [ 36,  45,  54,  63],
   [ 72,  81,  90,  99],
   [108, 117, 126, 135]])

[~]
|32> c[1]
array([[  1,  10,  19,  28],
   [ 37,  46,  55,  64],
   [ 73,  82,  91, 100],
   [109, 118, 127, 136]])

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Only integer scalar arrays can be converted to a scalar index

2017-09-15 Thread Robert McLeod
On Fri, Sep 15, 2017 at 2:37 PM, Elliot Hallmark 
wrote:

> Nope. Numpy only works on in memory arrays. You can determine your own
> chunking strategy using hdf5, or something like dask can figure that
> strategy out for you. With numpy you might worry about not accidentally
> making duplicates or intermediate arrays, but that's the extent of memory
> optimization you can do in numpy itself.
>

NumPy does have it's own memory map variant on ndarray:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html



-- 
Robert McLeod, Ph.D.
robbmcl...@gmail.com
robbmcl...@protonmail.com
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Only integer scalar arrays can be converted to a scalar index

2017-09-15 Thread Robert Kern
On Sat, Sep 16, 2017 at 7:16 AM, Chris Barker - NOAA Federal <
chris.bar...@noaa.gov> wrote:
>
> No thoughts on optimizing memory, but that indexing error probably comes
from np.mean producing float results. An astype call shoulder that work.

Why? It's not being used as an index. It's being assigned into a float
array.

Rather, it's the slicing inside of `trace_block()` when it's being given
arrays as inputs for `x` and `y`. numpy simply doesn't support that because
in general the result wouldn't have a uniform shape.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: SfePy 2017.3

2017-09-19 Thread Robert Cimrman

I am pleased to announce release 2017.3 of SfePy.

Description
---

SfePy (simple finite elements in Python) is a software for solving systems of
coupled partial differential equations by the finite element method or by the
isogeometric analysis (limited support). It is distributed under the new BSD
license.

Home page: http://sfepy.org
Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/
Git (source) repository, issue tracker: https://github.com/sfepy/sfepy

Highlights of this release
--

- support preconditioning in SciPy and PyAMG based linear solvers
- user-defined preconditioners for PETSc linear solvers
- parallel multiscale (macro-micro) homogenization-based computations
- improved tutorial and installation instructions

For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1
(rather long and technical).

Cheers,
Robert Cimrman

---

Contributors to this release in alphabetical order:

Robert Cimrman
Lubos Kejzlar
Vladimir Lukes
Matyas Novak
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] MATLAB to Numpy

2017-10-21 Thread Robert Kern
On Sat, Oct 21, 2017 at 10:45 AM, Andrei Berceanu 
wrote:
>
> Hi,
>
> I am new to Numpy, and would like to start by translating a (badly
written?) piece of MATLAB code.
> What I have come up with so far is this:
>
> px = np.zeros_like(tmp_px); py = np.zeros_like(tmp_py); pz =
np.zeros_like(tmp_pz)
> w = np.zeros_like(tmp_w)
> x = np.zeros_like(tmp_x); y = np.zeros_like(tmp_y); z =
np.zeros_like(tmp_z)
>
> j=-1
> for i in range(tmp_px.size):
> if tmp_px[i] > 2:
> j += 1
> px[j] = tmp_px[i]
> py[j] = tmp_py[i]
> pz[j] = tmp_pz[i]
> w[j] = tmp_w[i]
> x[j] = tmp_x[i]
> y[j] = tmp_y[i]
> z[j] = tmp_z[i]
>
> px=px[:j+1]; py=py[:j+1]; pz=pz[:j+1]
> w=w[:j+1]
> x=x[:j+1]; y=y[:j+1]; z=z[:j+1]
>
> It works, but I'm sure it's probably the most inefficient way of doing
it. What would be a decent rewrite?

Index with a boolean mask.

mask = (tmp_px > 2)
px = tmp_px[mask]
py = tmp_py[mask]
# ... etc.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.vstack vs. np.stack

2017-11-09 Thread Robert Kern
On Thu, Nov 9, 2017 at 1:30 AM, Joe  wrote:
>
> Hello,
>
> I have a question and hope that you can help me.
>
> The doc for vstack mentions that "this function continues to be supported
for backward compatibility, but you should prefer np.concatenate or
np.stack."
>
> Using vstack was convenient because "the arrays must have the same shape
along all but the first axis."
>
> So it was possible to stack an array (3,) and (2, 3) to a (3, 3) array
without using e.g. atleast_2d on the (3,) array.
>
> Is there a possibility to mimic that behavior with np.concatenate or
np.stack?

Quite frankly, I ignore the documentation as I think it's recommendation is
wrong in these cases. Vive la vstack!

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.vstack vs. np.stack

2017-11-09 Thread Robert Kern
On Thu, Nov 9, 2017 at 1:58 PM, Mark Bakker  wrote:

> Can anybody explain why vstack is going the way of the dodo?
> Why are stack / concatenate better? What is 'bad' about vstack?

As far as I can tell, the discussion happened all on Github, not the
mailing list. See here for references:

https://github.com/numpy/numpy/pull/7253

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] np.vstack vs. np.stack

2017-11-09 Thread Robert Kern
On Thu, Nov 9, 2017 at 2:49 PM, Allan Haldane 
wrote:
>
> On 11/09/2017 05:39 PM, Robert Kern wrote:
> > On Thu, Nov 9, 2017 at 1:58 PM, Mark Bakker  wrote:
> >
> >> Can anybody explain why vstack is going the way of the dodo?
> >> Why are stack / concatenate better? What is 'bad' about vstack?
> >
> > As far as I can tell, the discussion happened all on Github, not the
> > mailing list. See here for references:
> >
> > https://github.com/numpy/numpy/pull/7253
> >
> > --
> > Robert Kern
>
> yes, and in particular this linked comment/PR:
>
> https://github.com/numpy/numpy/pull/5605#issuecomment-85180204
>
> Maybe we should reword the vstack docstring so that it doesn't imply
> that vstack is going away. It should say something weaker
> like "the functions np.stack, np.concatenate, and np.block are often
> more general/useful/less confusing alternatives".. or better explain
> what the problem is.
>
> If we limit ourselves to 1d,2d and maybe 3d arrays the vstack behavior
> doesn't seem all that confusing to me.

I concur. Highlighting that the functions are only being retained "for
backward compatibility" does seem to imply to people that they are
deprecated and cannot be relied upon to remain. We *do* break backwards
compatibility from time to time.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-10 Thread Robert McLeod
On Wed, Nov 8, 2017 at 2:50 PM, Matthew Brett 
wrote:

> Hi,
>
> On Wed, Nov 8, 2017 at 7:08 PM, Julian Taylor
>  wrote:
> > On 06.11.2017 11:10, Ralf Gommers wrote:
> >>
> >>
> >> On Mon, Nov 6, 2017 at 7:25 AM, Charles R Harris
> >> mailto:charlesr.har...@gmail.com>> wrote:
> >>
> >> Hi All,
> >>
> >> Thought I'd toss this out there. I'm tending towards better sooner
> >> than later in dropping Python 2.7 support as we are starting to run
> >> up against places where we would like to use Python 3 features. That
> >> is particularly true on Windows where the 2.7 compiler is really old
> >> and lacks C99 compatibility.
> >>
> >>
> >> This is probably the most pressing reason to drop 2.7 support. We seem
> >> to be expending a lot of effort lately on this stuff. I was previously
> >> advocating being more conservative than the timeline you now propose,
> >> but this is the pain point that I think gets me over the line.
> >
> >
> > Would dropping python2 support for windows earlier than the other
> > platforms a reasonable approach?
> > I am not a big fan of to dropping python2 support before 2020, but I
> > have no issue with dropping python2 support on windows earlier as it is
> > our largest pain point.
>
> I wonder about this too.  I can imagine there are a reasonable number
> of people using older Linux distributions on which they cannot upgrade
> to a recent Python 3, but is that likely to be true for Windows?
>
> We'd have to make sure we could persuade pypi to give the older
> version for Windows, by default - I don't know if that is possible.
>


Pip repo names and actual module names don't have to be the same.  One
potential work-around would be to make a 'numpylts' repo on PyPi which is
the 1.17 version with support for Python 2.7 and bug-fix releases as
required.  This will still cause regressions but it's a matter of modifying
`requirements.txt` in downstream Python 2.7 packages and not much else.

E.g. in `requirements.txt`:

numpy;python_version>"3.0"
numpylts; python_version<"3.0"

In both cases you still call `import numpy` in the code.

Robert

-- 
Robert McLeod, Ph.D.
robbmcl...@gmail.com
robbmcl...@protonmail.com
www.entropyreduction.al
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support

2017-11-13 Thread Robert McLeod
On Mon, Nov 13, 2017 at 7:47 AM, Matthias Bussonnier <
bussonniermatth...@gmail.com> wrote:

> > For this to be efficient, it should be done soon enough to allow
> downstream projects to adapt their requirements.txt.
> > Release managers: how much more effort would it be to upload current
> numpy to both numpy and numpylts?
>
> I'm not quite sure I see the point. you would ask downstream to change
> `numpy` to `numpylts` instead of  `numpy` to `numpy<2` ?
>
> Also I think then you have the risk of having for example pandas
> saying `numpy<2` and scipy saying `numpylts` and now the pasckages are
> incompatible ?


 The trouble is PyPi doesn't allow multiple branches.  So if you upload
NumPy 2.0 wheels, then you cannot turn around and upload 1.18.X bug-fix
patches.  At least, this is my understanding of PyPi.



-- 
Robert McLeod, Ph.D.
robbmcl...@gmail.com
robbmcl...@protonmail.com
www.entropyreduction.al
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NEP process update

2017-12-05 Thread Robert Kern
On Wed, Dec 6, 2017 at 9:49 AM, Nathaniel Smith  wrote:

> - NEPs are really part of the development process, not an output for
> end-users -- they're certainly useful to have available as a
> reference, but if we're asking end-users to look at them on a regular
> basis then I think we've messed up and should improve our actual
> documentation :-)

I agree. The usefulness of NEPs (and PEPs) as documentation degrades over
time. The functionality matches the NEP at the time that it is accepted,
but features evolve over time through the normal, non-NEP development
process. Personally, I've frequently drove myself into corners reading a
PEP to learn about a new feature only to "learn" things that did not make
it into the final implementation or were quickly overtaken by later
development.

We should not rely on NEPs to provide documentation for users. The regular
docs should be the authority. To the extent that the NEPs happen to provide
useful documentation for the new feature (and represent a significant
amount of sunk effort to draft that documentation), we should feel free to
copy-paste that into the regular docs and evolve it from there.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What is the pythonic way to write a function that handles arrays and scalars?

2017-12-12 Thread Robert Kern
On Wed, Dec 13, 2017 at 4:50 AM, Joe  wrote:
>
> Hi,
>
> the best example I found was this one:
>
> https://stackoverflow.com/a/29319864/7919597
>
> def func_for_scalars_or_vectors(x):
> x = np.asarray(x)
> scalar_input = False
> if x.ndim == 0:
> x = x[None]  # Makes x 1D
> scalar_input = True
>
> # The magic happens here
>
> if scalar_input:
> return np.squeeze(ret)
> return ret
>
> Is this as good as it gets or do you have other suggestions?

In general structure, yeah. I'd probably use `np.isscalar(x)` for the test
and `x = np.atleast_1d(x)` for the coercion for readability, but otherwise,
that's it.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What is the pythonic way to write a function that handles arrays and scalars?

2017-12-12 Thread Robert Kern
On Wed, Dec 13, 2017 at 5:00 AM, Kirill Balunov 
wrote:
>
> On minor thing that instead of 'ret' there should be 'x'.

No, `x` is the input. The code that actually does the computation (elided
here by the `# The magic happens here` comment) would have assigned to
`ret`.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] What is the pythonic way to write a function that handles arrays and scalars?

2017-12-12 Thread Robert Kern
On Wed, Dec 13, 2017 at 5:52 AM, Eric Wieser 
wrote:

> Using np.iscalar is a bad idea, as it fails for 0d arrays. x.ndim is the
> better option there.
>
> I’d maybe suggest not special-casing 0d arrays though, and using:
>
> def func_for_scalars_or_vectors(x):
> x = np.asanyarray(x) # convert scalars to 0d arrays
>
> # The magic happens here
>
> return ret[()]  # convert 0d arrays to scalars
>
> Good call. I didn't think that the empty tuple was valid for anything but
indexing into 0d arrays, but of course, following the rules of indexing, it
works as required for other arrays too.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: SfePy 2017.4

2017-12-29 Thread Robert Cimrman

I am pleased to announce release 2017.4 of SfePy.

Description
---

SfePy (simple finite elements in Python) is a software for solving systems of
coupled partial differential equations by the finite element method or by the
isogeometric analysis (limited support). It is distributed under the new BSD
license.

Home page: http://sfepy.org
Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/
Git (source) repository, issue tracker: https://github.com/sfepy/sfepy

Highlights of this release
--

- basic support for penalty-based contacts
- support for user-defined contexts in all solvers and preconditioners
- new example: dispersion analysis of heterogeneous periodic materials

For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1
(rather long and technical).

Cheers,
Robert Cimrman

---

Contributors to this release in alphabetical order:

Robert Cimrman
Jan Heczko
Lubos Kejzlar
Jan Kopacka
Vladimir Lukes
Matyas Novak
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Direct GPU support on NumPy

2018-01-02 Thread Robert Kern
On Tue, Jan 2, 2018 at 1:21 PM, Yasunori Endo  wrote:
>
> Hi all
>
> Numba looks so nice library to try.
> Thanks for the information.
>
>> This suggests a new, higher-level data model which supports replicating
data into different memory spaces (e.g. host and GPU). Then users (or some
higher layer in the software stack) can dispatch operations to suitable
implementations to minimize data movement.
>>
>> Given NumPy's current raw-pointer C API this seems difficult to
implement, though, as it is very hard to track memory aliases.
>
> I understood modifying numpy.ndarray for GPU is technically difficult.
>
> So my next primitive question is why NumPy doesn't offer
> ndarray like interface (e.g. numpy.gpuarray)?
> I wonder why everybody making *separate* library, making user confused.

Because there is no settled way to do this. All of those separate library
implementations are trying different approaches. We are learning from each
of their attempts. They can each move at their own pace rather than being
tied down to numpy's slow rate of development and strict backwards
compatibility requirements. They can try new things and aren't limited to
their first mistakes. The user may well be confused by all of the different
options currently available. I don't think that's avoidable: there are lots
of meaningful options. Picking just one to stick into numpy is a disservice
to the community that needs the other options.

> Is there any policy that NumPy refuse standard GPU implementation?

Not officially, but I'm pretty sure that there is no appetite among the
developers for incorporating proper GPU support into numpy (by which I mean
that a user would build numpy with certain settings then make use of the
GPU using just numpy APIs). numpy is a mature project with a relatively
small development team. Much of that effort is spent more on maintenance
than new development.

What there is appetite for is to listen to the needs of the GPU-using
libraries and making sure that numpy's C and Python APIs are flexible
enough to do what the GPU libraries need. This ties into the work that's
being done to make ndarray subclasses better and formalizing the notions of
an "array-like" interface that things like pandas Series, etc. can
implement and play well with the rest of numpy.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Understanding np.min behaviour with nan

2018-01-10 Thread Robert Kern
On Wed, Jan 10, 2018 at 3:03 PM, Andrew Nelson  wrote:
>
> Further to my last message, why is the warning only raised once?
>
> ```
> >>> import numpy as np
> >>> np.min([1, np.nan])
>
/Users/andrew/miniconda3/envs/dev3/lib/python3.6/site-packages/numpy/core/_methods.py:29:
RuntimeWarning: invalid value encountered in reduce
>   return umr_minimum(a, axis, None, out, keepdims)
> nan
> >>> np.min([1, np.nan])
> nan
> >>>
> ```

This is default behavior for warnings.

https://docs.python.org/3/library/warnings.html#the-warnings-filter

It also explains the previous results. The same warning would have been
issued from the same place in each of the variations you tried. Since the
warnings mechanism had already seen that RuntimeWarning with the same
message from the same code location, they were not printed.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] custom Welch method for power spectral density

2018-01-12 Thread Robert Kern
On Fri, Jan 12, 2018 at 1:12 PM, Seb  wrote:
>
> Hello,
>
> I'm trying to compute a power spectral density of a signal, using the
> Welch method, in the broad sense; i.e. splitting the signal into
> segments for deriving smoother spectra.  This is well implemented in
> scipy.signal.welch.  However, I'd like to use exponentially increasing
> (power 2) segment length to dampen increasing variance in spectra at
> higher frequencies.  Before hacking the scipy.signal.spectral module for
> this, I'd appreciate any tips on available packages/modules that allow
> for this kind of binning scheme, or other suggestions.

Not entirely sure about this kind of binning scheme per se, but you may
want to look at multitaper spectral estimation methods. The Welch method
can be viewed as a poor-man's multitaper. Multitaper methods give you
better control over the resolution/variance tradeoff that may help with
your problem. Googling for "python multitaper" gives you several options; I
haven't used any of them in anger, so I don't have a single recommendation
for you. The nitime documentation provides more information about
multitaper methods that may be useful to you:

http://nipy.org/nitime/examples/multi_taper_spectral_estimation.html

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Moving NumPy's PRNG Forward

2018-01-19 Thread Robert Kern
f the simpler distribution methods to the list that
*is* guaranteed to remain stream-compatible, probably `randint()` and
`bytes()` and maybe a couple of others. I would appreciate input on the
matter.

The current API should remain available and working, but not necessarily
with the same algorithms. That is, for code like the following:

  prng = np.random.RandomState(12345)
  x = prng.normal(10.0, 3.0, size=100)

`x` is still guaranteed to be 100 normal variates with the appropriate mean
and standard deviation, but they might be computed by the ziggurat method
from PCG-generated bytes (or whatever new default core PRNG we have).

As an alternative, we may also want to leave `np.random.RandomState`
entirely fixed in place as deprecated legacy code that is never updated.
This would allow current unit tests that depend on the stream-compatibility
that we previously promised to still pass until they decide to update.
Development would move to a different class hierarchy with new names.

I am personally not at all interested in preserving any stream
compatibility for the `numpy.random.*` aliases or letting the user swap out
the core PRNG for the global PRNG that underlies them. `np.random.seed()`
should be discouraged (if not outright deprecated) in favor of explicitly
passing around instances.

In any case, we have a lot of different options to discuss if we decide to
relax our stream-compatibility policy. At the moment, I'm not pushing for
any particular changes to the code, just the policy in order to enable a
more wide-ranging field of options that we have been able to work with so
far.

Thanks.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Behavior of rint?

2018-01-19 Thread Robert Kern
On Fri, Jan 19, 2018 at 11:56 PM, Mads Ipsen  wrote:
>
> I am confused . Shouldn't rint round to nearest integer.
http://en.cppreference.com/w/cpp/numeric/math/rint

It does. Matthew was asking specifically about its behavior when it is
rounding numbers ending in .5, not the general case. Since that's the only
real difference between rounding routines, we often recognize that from the
context and speak in a somewhat elliptical way (i.e. just "round-to-even"
instead of "round to the nearest integer and rounding numbers ending in .5
to the nearest even integer").

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Moving NumPy's PRNG Forward

2018-01-19 Thread Robert Kern
On Sat, Jan 20, 2018 at 2:27 AM,  wrote:

> I'm not sure I fully understand
> Is the proposal to drop stream-backward compatibility completely for the
future or just a one time change?

For all future.

> > No version-selection API would be required as you select the version by
installing the desired version of numpy.
>
> That's not useful if we want to have unit tests that run in the same way
across numpy versions.
>
> There are many unit tests that rely on fixed streams and have hard coded
results that rely on specific numbers (up to floating point, numerical
noise).
> Giving up stream compatibility would essentially kill using np.random for
these unit tests.

This is a use case that I am sensitive to. However, it should be noted that
relying on the exact stream for unit tests makes you vulnerable to platform
differences. That's something that we've never guaranteed (because we
can't). That said, there are some of the simpler distributions that are
more robust to such things, and those are fairly typical in unit tests. As
I mentioned, I am open to a small set of methods that we do guarantee
stream-compatibility for. I think that unit tests are the obvious use case
that should determine what that set is. Unit tests rarely need
`noncentral_chisquare()`, for instance.

I'd also be willing to make the API a little clunkier in order to maintain
the stable set of methods. For example, two methods that are common in unit
testing are `normal()` and `choice()`, but those have been the target of
the most attempted innovation. I'd be willing to leave them alone while
providing other methods that do the same thing but are allowed to innovate.

> Similar, reproducibility from another user, e.g. in notebooks, would
break without stream compatibility across numpy versions.

That is the reproducible-research use case that I discussed already. I
argued that the stability that our policy actually provides is rather more
muted than what it seems on its face.

> One possibility is to keep  the current stream-compatible np.random
version and maintain it in future for those usecases, and add a new
"high-performance" version with the new features.

That is one of the alternatives I raised.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Moving NumPy's PRNG Forward

2018-01-19 Thread Robert Kern
On Sat, Jan 20, 2018 at 2:57 AM, Stephan Hoyer  wrote:
>
> On Fri, Jan 19, 2018 at 6:57 AM Robert Kern  wrote:
>>
>> As an alternative, we may also want to leave `np.random.RandomState`
entirely fixed in place as deprecated legacy code that is never updated.
This would allow current unit tests that depend on the stream-compatibility
that we previously promised to still pass until they decide to update.
Development would move to a different class hierarchy with new names.
>
> I like this alternative, but I would hesitate to call it "deprecated".
Users who care about exact reproducibility across NumPy versions (e.g., for
testing) are probably less concerned about performance, and could continue
to use it.

I would be careful about that because quite a few of the methods are not
stable across platforms, even on the same numpy version. If you want to
declare that some part of the np.random API is stable for such purposes, we
need to curate a subset of the methods anyways. As a one-off thing, this
alternative proposes to declare that all of `np.random.RandomState` is
stable across versions, but we can't guarantee that all of it is
unconditionally stable for exact reproducibility. We can make a guarantee
for a smaller subset of methods, though. To your point, though, if we
freeze the current `RandomState`, we can make that guarantee for a larger
subset of the methods than we would for the new API. So I guess I talked
myself around to your view, but I would be a bit more cautious in how we
advertise the stability of the frozen `RandomState` API.

> New random number generator classes could implement their own guarantees
about compatibility across their methods.
>
>> I am personally not at all interested in preserving any stream
compatibility for the `numpy.random.*` aliases or letting the user swap out
the core PRNG for the global PRNG that underlies them. `np.random.seed()`
should be discouraged (if not outright deprecated) in favor of explicitly
passing around instances.
>
> I agree that np.random.seed() should be discouraged, but it feels very
late in NumPy's development to remove it.
>
> If we do alter the random number streams for numpy.random.*, it seems
that we should probably issue a warning (at least for a several major
versions) whenever numpy.random.seed() is called. This could get pretty
noisy. I guess that's all the more incentive to switch to random state
objects.

True. I like that.

The reason I think that it might be worth an exception is that it has been
a moral hazard. People aren't just writing correct but improvable code
(relying on `np.random.*` methods but seeding exactly once at the start of
their single-threaded simulation) but they've been writing incorrect and
easily-broken code. For example:

np.random.seed(seed)
np.random.shuffle(x_train)
np.random.seed(seed)
np.random.shuffle(labels_train)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Moving NumPy's PRNG Forward

2018-01-20 Thread Robert Kern
On Sat, Jan 20, 2018 at 7:34 AM, Robert Kern  wrote:
>
> On Sat, Jan 20, 2018 at 2:27 AM,  wrote:
>
> > I'm not sure I fully understand
> > Is the proposal to drop stream-backward compatibility completely for
the future or just a one time change?
>
> For all future.

To color this a little, while we'll have a burst of activity for the first
round to fix all of the mistakes I baked in early, I do not expect the pace
of change after that to be very large. While we are going to relax the
strictness of the policy, we should carefully weigh the benefit of a change
against the pain of changing the stream, and explore ways to implement the
improvement that would retain the stream. We should take more care making
such changes in `np.random` than in other parts of numpy. When we get to
drafting the NEP, I'd be happy to include language to this effect.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Moving NumPy's PRNG Forward

2018-01-26 Thread Robert Kern
128(seed))
  x = prng.normal(mean, std)

If someone wants to write a WELL512 core PRNG, they can just implement that
object and pass it to `Distributions()`. The `Distributions` code doesn't
need to be copied, nor do we need to much around with `__new__` tricks in
Cython.

Why do this instead of distribution functions? Well, I have a damn lot of
code that is expecting an object with at least the broad outlines of the
`RandomState` interface. I'm not going to update that code to use functions
instead. And if I'm not, no one is. There isn't a good transition path
since the PRNG object needs to thread through all of the code, whether it's
my code that I'm writing greenfield or library code that I don't control.
That said, people writing new distributions outside of numpy.random would
be writing functions, not trying to add to the `Distributions` class, but
that's fine.

It also allows the possibility for the `Distributions` class to be stateful
if we want to do things like caching the next Box-Muller variate and not
force that onto the core PRNG state like I currently do. Though I'd rather
just drop Box-Muller, and that's not a common pattern outside of
Box-Muller. But it's a possibility.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Moving NumPy's PRNG Forward

2018-01-29 Thread Robert Kern
On Tue, Jan 30, 2018 at 5:39 AM, Pierre de Buyl <
pierre.deb...@chem.kuleuven.be> wrote:
>
> Hello,
>
> On Sat, Jan 27, 2018 at 09:28:54AM +0900, Robert Kern wrote:
> > On Sat, Jan 27, 2018 at 1:14 AM, Kevin Sheppard
> >  wrote:
> > >
> > > In terms of what is needed, I think that the underlying PRNG should
> > be swappable.  The will provide a simple mechanism to allow certain
> > types of advancement while easily providing backward compat.  In the
> > current design this is very hard and requires compiling many nearly
> > identical copies of RandomState. In pseudocode something like
> > >
> > > standard_normal(prng)
> > >
> > > where prng is a basic class that retains the PRNG state and has a
> > small set of core random number generators that belong to the
> > underlying PRNG -- probably something like int32, int64, double, and
> > possibly int53. I am not advocating explicitly passing the PRNG as an
> > argument, but having generators which can take any suitable PRNG would
> > add a lot of flexibility in terms of taking advantage of improvements
> > in the underlying PRNGs (see, e.g., xoroshiro128/xorshift1024).  The
> > "small" core PRNG would have responsibility over state and streams.
> > The remainder of the module would transform the underlying PRNG into
> > the required distributions.
> > (edit: after writing the following verbiage, I realize it can be summed
> > up with more respect to your suggestion: yes, we should do this design,
> > but we don't need to and shouldn't give up on a class with distribution
> > methods.)
> > Once the core PRNG C API is in place, I don't think we necessarily need
> > to move away from a class structure per se, though it becomes an
> > option.
>
> (Sorry for cutting so much, I have a short question)
>
> My typical use case for the C API of NumPy's random features is that I
start
> coding in pure Python and then switch to Cython. I have at least twice in
the
> past resorted to include "randomkit.h" and use that directly. My last work
> actually implements a Python/Cython interface for rngs, see
> http://threefry.readthedocs.io/using_from_cython.html
>
> The goal is to use exactly the same backend in Python and Cython, with a
cimport
> and a few cdefs the only changes needed for a first port to Cython.
>
> Is this type of feature in discussion or in project for the future of
> numpy.random?

Sort of, but not really. For sure, once we've made the decisions that let
us move forward to a new design, we'll have Cython implementations that can
be used natively from Cython as well as Python without code changes. *But*
it's not going to be an automatic speedup like your threefry library
allows. You designed that API such that each of the methods returns a
single scalar, so all you need to do is declare your functions `cpdef` and
provide a `.pxd`. Our methods return either a scalar or an array depending
on the arguments, so the methods will be declared to return `object`, and
you will pay the overhead costs for checking the arguments and such. We're
not going to change that Python API; we're only considering dropping
stream-compatibility, not source-compatibility.

I would like to make sure that we do expose a C/Cython API to the
distribution functions (i.e. that only draw a single variate and return a
scalar), but it's not likely to look exactly like the Python API. There
might be clever tricks that we can do to minimize the amount of changes
that one needs to do, though, if you are only drawing single variates at a
time (e.g. an agent-based simulation) and you want to make it go faster by
moving to Cython. For example, maybe we collect all of the single-variate
C-implemented methods into a single object sitting as an attribute on the
`Distributions` object.

cdef class DistributionsCAPI:
cdef double normal(double loc, double scale)
cdef double uniform(double low, double high)

cdef class Distributions:
cdef DistributionsCAPI capi
cpdef object normal(loc, scale, size=None):
if size is None and np.isscalar(loc) and np.isscalar(scale):
return self.capi.normal(loc, scale)
else:
 # ... Make an array

prng = Distributions(...)
# From Python:
x = prng.normal(mean, std)
# From Cython:
cdef double x = prng.capi.normal(mean, std)

But we need to make some higher-level decisions first before we can get
down to this level of design. Please do jump in and remind us of this use
case once we do get down to actual work on the new API design. Thanks!

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Extending C with Python

2018-01-30 Thread Robert Kern
On Wed, Jan 31, 2018 at 3:25 PM, Jialin Liu  wrote:

> Hello,
> I'm extending C with python (which is opposite way of what people usually
> do, extending python with C), I'm currently stuck in passing a C array to
> python layer, could anyone plz advise?
>
> I have a C buffer in my C code and want to pass it to a python function.
> In the C code, I have:
>
> npy_intp  dims [2];
>> dims[0] = 10;
>> dims[1] = 20;
>> import_array();
>> npy_intp m=2;
>> PyObject * py_dims = PyArray_SimpleNewFromData(1, &m, NPY_INT16 ,(void
>> *)dims ); // I also tried NPY_INT
>> PyObject_CallMethod(pInstance, method_name, "O", py_dims);
>
>
> In the Python code, I want to just print that array:
>
> def f(self, dims):
>
>print ("np array:%d,%d"%(dims[0],dims[1]))
>
>
>
> But it only prints the first number correctly, i.e., dims[0]. The second
> number is always 0.
>

The correct typecode would be NPY_INTP.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] documentation, sound and music [was: doc? music through mathematical relations between LPCM samples and musical elements/characteristics]

2018-01-31 Thread Robert Kern
On Thu, Feb 1, 2018 at 7:38 AM, Renato Fabbri 
wrote:
>
> Dear Scipy-ers,
>
> If you think I should split the message so that
> things get more clear...
>
> But the things are:
> 1) How to document numpy-heavy projects?

There is nothing particularly special about numpy-heavy projects with
respect to documentation. The tools used for numpy and scipy's
documentation may (or may not) be useful to you. For example, you may want
to embed matplotlib plots into your documentation.

  https://pypi.python.org/pypi/numpydoc

But if not, don't worry about it. Sphinx is all most numpy-heavy projects
need, like most Python projects. Hosting the documentation on either
readthedocs or Github is fine. Do whatever's convenient for you. Either is
just as convenient for us.

> 2) How to better make these contributions available
> to the numpy/scipy community?

Having it up on Github and PyPI is all you need to do. Participate in
relevant mailing list conversations. While we don't forbid release
announcement emails on either of the lists, I wouldn't do much more than
announce the initial release and maybe a really major update (i.e. not for
every bugfix release).

> Directions will be greatly appreciated.
> I suspect that this info is all gathered somewhere
> I did not find.

Sorry this isn't gathered anywhere, but truly, the answer is "there is not
much to it". You're doing everything right. :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Prototype CorePRNG implementation feedback (Neal Becker)

2018-02-22 Thread Robert Kern
On Feb 22, 2018 16:30, "Kevin Sheppard"  wrote:


> What is the syntax to construct an initialized generator?
> RandomGenerator(Xoroshiro128(my_seed))?
>
>
Not 100% certain on this.  There was talk in the earlier thread that seed
should be killed,


No, just the np.random.seed() function alias for the hidden global PRNG.
Core PRNGs should be initializable with a seed number like this, and
instances of these objects should also have a seed() method to reset the
state.

I think all of the core PRNGs MUST be seedable by being given a uint32
integer, at minimum. They MAY be seedable by other inputs (uint64, Python
long, array of ints) depending on the algorithm.

although I wasn't clear what mathod would be
preferrable to get easily reproducible random numbers.

Another option would be

RandomGenerator(Xoroshiro128, seed=my_seed)


I recommend the first option. With things like settable streams, the
diversity of initialization options between algorithms is best implemented
by a diversity of initializers than trying to force everything into one
complicated initializer.

While I think the first option is the best way to implement things, we can
make some syntactic sugar to make things nicer. For example,
`Xoroshiro128(my_seed).generator` where `.generator` is implemented as a
property that returns `RandomGenerator(self)`.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: SfePy 2018.1

2018-03-06 Thread Robert Cimrman

I am pleased to announce release 2018.1 of SfePy.

Description
---

SfePy (simple finite elements in Python) is a software for solving systems of
coupled partial differential equations by the finite element method or by the
isogeometric analysis (limited support). It is distributed under the new BSD
license.

Home page: http://sfepy.org
Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/
Git (source) repository, issue tracker: https://github.com/sfepy/sfepy

Highlights of this release
--

- major update of time-stepping solvers and solver handling
- Newmark and Bathe elastodynamics solvers
- interface to MUMPS linear solver
- new examples:
  - iron plate impact problem (elastodynamics)
  - incompressible Mooney-Rivlin material model (hyperelasticity) as a script

For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1
(rather long and technical).

Cheers,
Robert Cimrman

---

Contributors to this release in alphabetical order:

Robert Cimrman
Jan Heczko
Jan Kopacka
Vladimir Lukes

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.random.randn

2018-03-06 Thread Robert Kern
On Tue, Mar 6, 2018 at 1:39 AM, Marko Asplund 
wrote:
>
> I've some neural network code in NumPy that I'd like to compare with a
Scala based implementation.
> My problem is currently random initialization of the neural net
parameters.
> I'd like to be able to get the same results from both implementations
when using the same random seed.
>
> One approach I've though of would be to use the NumPy random generator
also with the Scala implementation, but unfortunately the linear algebra
library I'm using doesn't provide an equivalent for this.
>
> Could someone give pointers to implementing numpy.random.randn?
> Or alternatively, is there an equivalent random generator for Scala or
Java?

I would just recommend using one of the codebases to initialize the
network, save the network out to disk, and load up the initialized network
in each of the different codebases for training. That way you are sure that
they are both starting from the same exact network parameters.

Even if you do rewrite a precisely equivalent np.random.randn() for
Scala/Java, you ought to write the code to serialize the initialized
network anyways so that you can test that the two initialization routines
are equivalent. But if you're going to do that, you might as well take my
recommended approach.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.random.randn

2018-03-07 Thread Robert Kern
On Wed, Mar 7, 2018 at 1:10 PM, Marko Asplund 
wrote:
>
> However, the results look very different when using random initialization.
> With respect to exact cost this is course expected, but what I find
troublesome
> is that  after N training iterations the cost starts approaching zero
with the NumPy
> code (most of of the time), whereas with the Scala based implementations
cost fails
> to converge (most of the time).
>
> With NumPy I'm simply using the following random initilization code:
>
> np.random.randn(n_h, n_x) * 0.01
>
> I'm trying to emulate the same behaviour in my Scala code by  sampling
from a
> Gaussian distribution with mean = 0 and std dev = 1.

`np.random.randn(n_h, n_x) * 0.01`  gives a Gaussian distribution of mean=0
and stdev=0.01

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy.random.randn

2018-03-09 Thread Robert Kern
On Thu, Mar 8, 2018 at 12:44 PM, Marko Asplund 
wrote:
>
> On Wed, 7 Mar 2018 13:14:36, Robert Kern wrote:
>
> > > With NumPy I'm simply using the following random initilization code:
> > >
> > > np.random.randn(n_h, n_x) * 0.01
> > >
> > > I'm trying to emulate the same behaviour in my Scala code by  sampling
> > from a
> > > Gaussian distribution with mean = 0 and std dev = 1.
>
> > `np.random.randn(n_h, n_x) * 0.01`  gives a Gaussian distribution of
mean=0
> > and stdev=0.01
>
> Sorry for being a bit inaccurate.
> My Scala code actually mirrors the NumPy based random initialization, so
I sample with Gaussian of mean = 0 and std dev = 1, then multiply with 0.01.

Have you verified this? I.e. save out the Scala-initialized network and
load it up with numpy to check the mean and std dev? How about if you run
the numpy NN training with the Scala-initialized network? Does that also
diverge?

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ENH: softmax

2018-03-14 Thread Robert Kern
On Wed, Mar 14, 2018 at 5:22 PM, Warren Weckesser <
warren.weckes...@gmail.com> wrote:
>
> On Wed, Mar 14, 2018 at 4:05 AM, Kulick, Johannes 
wrote:
>>
>> Hi,
>>
>> I regularly need the softmax function (
https://en.wikipedia.org/wiki/Softmax_function) for my code. I have a quite
efficient pure python implementation (credits to Nolan Conaway). I think it
would be a valuable enhancement of the ndarray class. But since it is kind
of a specialty function I wanted to ask you if you would consider it to be
part of the numpy core (alongside ndarray.max and ndarray.argmax) or rather
in scipy (e.g. scipy.stats seems also an appropriate place).
>
> Johannes,
>
> If the numpy devs aren't interested in adding it to numpy, I'm pretty
sure we can get it in scipy.  I've had adding it (or at least proposing
that it be added) to scipy on my to-do list for quite a while now.

+1 for scipy.special.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


  1   2   3   4   >