[Numpy-discussion] Polynomial Improvements
Hi, I am new to contributing to open source projects and not sure where to begin (maybe emailing this distro?). In any case, the main improvement I would like to work on would be adding multivariate polynomials/differentials to numpy. I would love any insight into the current status and intensions of numpy polynomial implementations. 1) why do we have np.polynomial and np.lib.polynomial? which one is older? is there a desire to see these merged? 2) why does np.polynomial.Polynomial have a domain and window? they appear to have no implementation (except the chebyshev interpolation) as far as I can tell? what is their intended use? why is their no variable representation implemented like in np.polynomial and is not installed to numpy directly, @set_module('numpy')? 3) how many improvements are allowed to be done all at once? is there an expectation of waiting before improving implementations or code? Thinking further down the road, how much is too much to add to numpy? I would like to see multivariate implemented, but after that it would be nice to have a collection of known polynomials like standard, pochhammer, cyclotomic, hermite, lauguerre, legendre, chebyshev, jacobi. After that it would be of interest to extend .diff() to really be a differential operator (implemented as sequences of polynomials) and to allow differential constructions. Then things like this would be possible: x = np.poly1d([1,0]) D = np.diff1d([1,0]) assert Hermite(78) == (2*x - D)(Hermite(77)) Or if series were implemented then we would see things like: assert Hermite(189) == (-1)**189 * Exp(x**2)*(D**189)(Exp(-x**2)) assert Hermite(189) == Exp(-D**2)(x**189)(2*x) 4) at what point does numpy polynomial material belong in scipy material and visa-versa? is numpy considered the backend to scipy or visa-versa? at times it seems like things are double implemented in numpy and scipy, but in general should not implement things in numpy that already exist in scipy? Thanks, Robert -- Sent from: http://numpy-discussion.10968.n7.nabble.com/ ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Polynomial Improvements
I had not looked into sympy that closes, thinking it was mostly a symbolic package. However, there appears to be functions that convert back to numpy expressions so that np.ndarray's and such can work. There also appears to be extensive polynomial classes already defined. Thanks for pointing me in the right direction. -- Sent from: http://numpy-discussion.10968.n7.nabble.com/ ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Re: Arbitrarily large random integers
On Sat, Aug 19, 2023 at 10:49 AM Kevin Sheppard wrote: > The easiest way to do this would to to write a pure python implementation > using Python ints of a masked integer sampler. This way you could draw > unsigned integers and then treat this as a bit pool. You would than take > the number of bits needed for your integer, transform these to be a Python > int, and finally apply the mask. > Indeed, that's how `random.Random` does it. I've commented on the issue with an implementation that subclasses `random.Random` to use numpy PRNGs as the source of bits for maximum compatibility with `Random`. The given use case motivating this feature request is networkx, which manually wraps numpy PRNGs in a class that incompletely mimics the `Random` interface. A true subclass eliminates all of the remaining inconsistencies between the two. I'm inclined to leave it at that and not extend the `Generator` interface. https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Inquiry about the difference between numpy.trunc() and numpy.fix()
On Wed, Aug 23, 2023 at 10:06 PM Bao Guo wrote: > I am writing to seek clarification on the specific differences between the > numpy.trunc() and numpy.fix() functions. > > I have come across opinions stating that these two functions exist for > backward compatibility and that they serve different functional > requirements. However, I am puzzled as to why these functions could not be > merged or one of them upgraded to encompass both functionalities. Despite > my efforts in researching and contemplating on this matter, I have not come > across a convincing explanation. > > Could anyone kindly shed some light on this matter? I am seeking a deeper > understanding of the design choices made for these functions and the > rationale behind their coexistence. > > Thank you in advance for your time and assistance. `fix()` was added to numpy back in the super-early days when it was emerging from Numeric, its earlier predecessor. It was named after the equivalent MATLAB function. At some point thereafter, we settled on a policy for inclusion of mathematical ufuncs into numpy. We got tired of arguing whether a special function was "important enough" to be in numpy instead of scipy.special. Eventually, we settled on a bright line that if a math function was in the C99 standard, we'd go ahead and allow a ufunc for it in numpy. `trunc()` is the name of that equivalent function in the C99 standard. It is a ufunc instead of a plain function like `fix()` and has a (strict, I believe) superset of functionality. You can ignore `fix()`, more or less. I'm not sure if it's on the list for deprecation/removal in numpy 2.0, but it certainly could be. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Find location of slice in it's base
On Thu, Aug 31, 2023 at 3:25 PM Dom Grigonis wrote: > Hi everyone, > > I am working with shared arrays and their slices. And trying to subclass > ndarray in such way so that they remap to memory on unpickling. I am using > package SharedArray, which doesn’t micro-manage memory locations, but > rather stores the whole map via shm using unique name. One issue I ran into > is that when I pickle & unpickle slice of memory mapped array, I need to > store the spec of the subset so that I can retrieve the same portion. > > The issue I am working on has been briefly discussed on stack: > https://stackoverflow.com/questions/12421770/find-location-of-slice-in-numpy-array. > Although I am not sure if it is the exact same one. > > So, in short, is there a way to determine location of a slice in original > array. Ideally, it would be indexing which works for any non-copy > slice/subset possible. > If you chase the `.base` chain all the way down, you get to a `SharedArray.map_owner` object with its `.addr` attribute giving the memory address that's assigned to it in the current process. This will be the same address that's in the `'data'` key of that first `ndarray` returned from `SharedArray.create()` that you are making your views from. In your view `ndarray` (which may be a view of views, with slices, transposes, reshapes, and arbitrary `stride_tricks` manipulations in between). If you store the difference between the view `ndarray`'s `'data'` address from the `map_owner.addr` address, I think that's all you need to recreate the array in a different process which gets assigned a different memory address for the same `shm` name. Just add that offset to the `map_owner.addr` and restore the rest of the information in `__array_interface__`, and I think you should be good to go. I don't think you'll need to infer or represent the precise path of Python-level operations (slices, transposes, reshapes, etc.) to which it got to that point. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] ANN: NumExpr 2.8.6 Released
Hi everyone, NumExpr 2.8.6 is a release to deal with issues related to downstream `pandas` where the sanitization blacklist was hitting private variables used in their evaluate. In addition the sanitization was hitting on scientific notation. For those who do not wish to have sanitization on by default, it can be changed by setting an environment variable, `NUMEXPR_SANITIZE=0`. If you use `pandas` in your packages it is advisable you pin `numexpr >= 2.8.6` in your requirements. Project documentation is available at: http://numexpr.readthedocs.io/ Changes from 2.8.5 to 2.8.6 --- * The sanitization can be turned off by default by setting an environment variable, `set NUMEXPR_SANITIZE=0` * Improved behavior of the blacklist to avoid triggering on private variables and scientific notation numbers. What's Numexpr? --- Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It has multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. Where I can find Numexpr? - The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Documentation is hosted at: http://numexpr.readthedocs.io/en/latest/ Share your experience - Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Robert McLeod robbmcl...@gmail.com robert.mcl...@hitachi-hightech.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0
On Wed, Oct 11, 2023 at 11:50 PM Aaron Meurer wrote: > Is there a way to make pickle not depend on the specific submodule > that a class is defined in? No. `Unpickler` somehow has to locate the class/reconstruction function. It will have to use the name given by the `__reduce_ex__` during pickling. > Wouldn't this happen again if you ever > decided to rename _core. > Yes. > The underscores in numpy._core._reconstruct don't actually do anything > here in terms of making the interface not public, and if anything, are > really misleading. > There's private and there's private. There are two broad things that could mean: 1. We don't want users playing around with it, importing it directly in their code and using it. 2. Marking that we won't mess around with it without going through the deprecation policy. Usually, these two go together (hiding it from users directly calling it means that we get to mess around with it freely), but when we get to metaprogramming tasks like pickling, they become a little decoupled. Because the "user" that we have to think about isn't a person reading documentation and browsing APIs, but a file that someone created years ago and some infrastructure that interprets that file. I think it's good that `_reconstruct` is hidden from human users and is distinct from almost all of what constitutes "numpy's public API", but that doesn't mean that we can't have a tiny third class of functions where we do preserve some guarantees. It's only occasionally confusing to us core devs, not users (who can rely on the "don't touch underscored names" rule just fine). I'm also curious about this more generally. We tend to think of the > fully qualified name of a class as being an implementation detail for > many libraries and only the top-level lib.Name should be used, but > many things in the language (including this) break this. > Yes, and it's why these things attempt to be scoped in ways that limit this problem. I.e. if you use pickling, you're told to use it only for transient data with the same versions of libraries on both ends of the pipe, but the reality is that it's too useful to avoid in creating files with arbitrarily long lives. Not their fault; they warned us! -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion < numpy-discussion@python.org> wrote: > Hi all, > > I am trying to sample k N-dimensional vectors from a uniform distribution > without replacement. > It seems like this should be straightforward, but I can't seem to pin it > down. > > Specifically, I am trying to get random indices in an d0 x d1 x d2.. x > dN-1 array. > > I thought about sneaking in a structured dtype into `rng.integers`, but of > course that doesn't work. > > If we had a string sampler, I could sample k unique words (consisting of > digits), and convert them to indices. > > I could over-sample and filter out the non-unique indices. Or iteratively > draw blocks of samples until I've built up my k unique indices. > > The most straightforward solution would be to flatten indices, and to > sample from those. The integers get large quickly, though. The rng.integers > docstring suggests that it can handle object arrays for very large integers: > > > When using broadcasting with uint64 dtypes, the maximum value (2**64) > > cannot be represented as a standard integer type. > > The high array (or low if high is None) must have object dtype, e.g., > array([2**64]). > > But, that doesn't work: > > In [35]: rng.integers(np.array([2**64], dtype=object)) > ValueError: high is out of bounds for int64 > > Is there an elegant way to handle this problem? > The default dtype for the result of `integers()` is the signed `int64`. If you want to sample from the range `[0, 2**64)`, you need to specify `dtype=np.uint64`. The text you are reading is saying that if you want to specify exactly `2**64` as the exclusive upper bound that you won't be able to do it with a `np.uint64` array/scalar because it's one above the bound for that dtype, so you'll have to use a plain Python `int` object or `dtype=object` array in order to represent `2**64`. It is not saying that you can draw arbitrary-sized integers. >>> rng.integers(2**64, dtype=np.uint64) 11569248114014186612 If the arrays you are drawing indices for are real in-memory arrays for present-day 64-bit computers, this should be adequate. If it's a notional array that is larger, then you'll need actual arbitrary-sized integer sampling. The builtin `random.randrange()` will do arbitrary-sized integers and is quite reasonable for this task. If you want it to use our BitGenerators underneath for clean PRNG state management, this is quite doable with a simple subclass of `random.Random`: https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 4:15 PM Aaron Meurer wrote: > On Fri, Nov 17, 2023 at 12:10 PM Robert Kern > wrote: > > > > If the arrays you are drawing indices for are real in-memory arrays for > present-day 64-bit computers, this should be adequate. If it's a notional > array that is larger, then you'll need actual arbitrary-sized integer > sampling. The builtin `random.randrange()` will do arbitrary-sized integers > and is quite reasonable for this task. If you want it to use our > BitGenerators underneath for clean PRNG state management, this is quite > doable with a simple subclass of `random.Random`: > https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 > > Wouldn't it be better to just use random.randint to generate the > vectors directly at that point? If the number of possibilities is more > than 2**64, birthday odds of generating the same vector twice are on > the order of 1 in 2**32. And you can always do a unique() rejection > check if you really want to be careful. > Yes, I jumped to correcting the misreading of the docstring rather than solving the root problem. Almost certainly, the most straightforward strategy is to use `rng.integers(0, array_shape)` repeatedly, storing tuples of the resulting indices in a set and rejecting duplicates. You'd have to do the same thing if one used `rng.integers(0, np.prod(array_shape))` or `random.randint(0, np.prod(array_shape))` in the flattened case. `rng.integers()` doesn't sample without replacement in any case. `rng.choice()` does. It also will not support unbounded integers; it uses a signed `int64` to hold the population size, so it is bounded to that. That is _likely_ to be a fine upper bound on the size of the array (if it is a real array in a 64-bit address space). So one could use `rng.choice(np.prod(array_shape), replace=False, size=n_samples)` and unflatten these integers to the index tuples if that suits the array size. It's just doing the same set lookups internally, just somewhat more efficiently. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 5:34 PM Stefan van der Walt wrote: > On Fri, Nov 17, 2023, at 14:28, Stefan van der Walt wrote: > > Attached is a script that implements this solution. > > > And the version with set duplicates checking. > If you're going to do the set-checking yourself, then you don't need the unbounded integer support. This is somewhat slower, probably due to the greedy tuple conversions, but eliminates all of the complexity of subclassing `Random`: def sample_indices(shape, size, rng=None): rng = np.random.default_rng(rng) ashape = np.array(shape) seen = set() while len(seen) < size: idx = tuple(rng.integers(0, ashape)) seen.add(idx) return list(seen) Unfortunately, subclassing from `Random` still doesn't get you the ability to sample without replacement for arbitrary-sized populations using `Random`'s own methods for that. `Random.sample(population, k)` requires `population` to be a `Sequence`, and that restricts you to index-sized integers (`ssize_t`) (you can try to fake one, but `len()` will balk if it gets a too-large integer from `__len__`). But `Random.sample` is only just doing set-checking anyways, so no loss. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 7:11 PM Aaron Meurer wrote: > rng.integers() (or np.random.randint) lets you specify lists for low > and high. So you can just use rng.integers((0,)*len(dims), dims). > > Although I'm not seeing how to use this to generate a bunch of vectors > at once. I would have thought something like size=(10, dims) would let > you generate 10 vectors of length dims but it doesn't seem to work. > `size=(k, len(dims))` def sample_indices(shape, size, rng=None): rng = np.random.default_rng(rng) ashape = np.array(shape) seen = set() while len(seen) < size: dsize = size - len(seen) seen.update(map(tuple, rng.integers(0, ashape, size=(dsize, len(shape) return list(seen) That optimistic optimization makes this the fastest solution. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: pickling dtype values
On Mon, Nov 20, 2023 at 10:08 PM Sebastien Binet wrote: > hi there, > > I have written a Go package[1] that can read/write simple arrays in the > numpy file format [2]. > when I wrote it, it was for simple interoperability use cases, but now > people would like to be able to read back ragged-arrays[3]. > > unless I am mistaken, this means I need to interpret pieces of pickled > data (`ndarray`, `multiarray` and `dtype`). > > so I am trying to understand how to unpickle `dtype` values that have been > pickled: > > ```python > import numpy as np > import pickle > import pickletools as pt > > pt.dis(pickle.dumps(np.dtype("int32"), protocol=4), annotate=True) > ``` > > gives: > ``` > 0: \x80 PROTO 4 Protocol version indicator. > 2: \x95 FRAME 55 Indicate the beginning of a new frame. >11: \x8c SHORT_BINUNICODE 'numpy' Push a Python Unicode string object. >18: \x94 MEMOIZE(as 0)Store the stack top into the memo. > The stack is not popped. >19: \x8c SHORT_BINUNICODE 'dtype' Push a Python Unicode string object. >26: \x94 MEMOIZE(as 1)Store the stack top into the memo. > The stack is not popped. >27: \x93 STACK_GLOBAL Push a global object (module.attr) on > the stack. >28: \x94 MEMOIZE(as 2)Store the stack top into the memo. > The stack is not popped. >29: \x8c SHORT_BINUNICODE 'i4'Push a Python Unicode string object. >33: \x94 MEMOIZE(as 3)Store the stack top into the memo. > The stack is not popped. >34: \x89 NEWFALSE Push False onto the stack. >35: \x88 NEWTRUE Push True onto the stack. >36: \x87 TUPLE3 Build a three-tuple out of the top > three items on the stack. >37: \x94 MEMOIZE(as 4)Store the stack top into the memo. > The stack is not popped. >38: RREDUCE Push an object built from a callable > and an argument tuple. >39: \x94 MEMOIZE(as 5)Store the stack top into the memo. > The stack is not popped. >40: (MARK Push markobject onto the stack. >41: KBININT13 Push a one-byte unsigned integer. >43: \x8c SHORT_BINUNICODE '<' Push a Python Unicode string object. >46: \x94 MEMOIZE(as 6)Store the stack top into the memo. > The stack is not popped. >47: NNONE Push None on the stack. >48: NNONE Push None on the stack. >49: NNONE Push None on the stack. >50: JBININT -1Push a four-byte signed integer. >55: JBININT -1Push a four-byte signed integer. >60: KBININT10 Push a one-byte unsigned integer. >62: tTUPLE (MARK at 40) Build a tuple out of the topmost > stack slice, after markobject. >63: \x94 MEMOIZE(as 7) Store the stack top into the > memo. The stack is not popped. >64: bBUILD Finish building an object, via > __setstate__ or dict update. >65: .STOPStop the unpickling machine. > highest protocol among opcodes = 4 > ``` > > I have tried to find the usual `__reduce__` and `__setstate__` methods to > understand what are the various arguments, to no avail. > First, be sure to read the generic `object.__reduce__` docs: https://docs.python.org/3.11/library/pickle.html#object.__reduce__ Here is the C source for `np.dtype.__reduce__()`: https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2623-L2750 And `np.dtype.__setstate__()`: https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2787-L3151 so, in : > ```python > >>> np.dtype("int32").__reduce__()[1] > ('i4', False, True) > These are arguments to the `np.dtype` constructor and are documented in `np.dtype.__doc__`. The `False, True` arguments are hardcoded and always those values. > >>> np.dtype("int32").__reduce__()[2] > (3, '<', None, None, None, -1, -1, 0) > These are arguments to pass to `np.dtype.__setstate__()` after the object has been created. 0. `3` is the version number of the state; `3` is typical for simple dtypes; datetimes and others with metadata will bump this to `4` and use a 9-element tuple instead of this 8-element tuple. 1. `'<'` is the endianness flag. 2. If there are subarrays <https://numpy.org/doc/stable/reference/arrays.dtypes.html#index-7> (e.g. `np.dtype((np.int32, (2,2)))`), that info here. 3. If there are f
[Numpy-discussion] Re: savetxt() has fmt='%.18e' as default, but fmt='%.16e' is always sufficient
On Sat, Nov 25, 2023 at 11:18 AM Jeppe Dakin wrote: > Double-precision numbers need at most 17 significant decimal digits to be > serialised losslessly. Yet, savetxt() uses 19 by default, meaning that most > files produced with savetxt() takes up about 9% more disk space than they > need to, without any benefit. I have described the problem more detailed on > Stackoverflow: > > https://stackoverflow.com/questions/77535380/minimum-number-of-digits-for-exact-double-precision-and-the-fmt-18e-of-numpy > > Is there any reason behind the default choice of savetxt(..., > fmt='%.18e')? If not, why not reduce it to savetxt(..., fmt='%.16e')? > A long time ago when `savetxt()` was written, we did not use the reliable Dragon4 string representation algorithm that guarantees that floating point numbers are written out with the minimum number of decimal digits needed to reproduce the number. We may even have relied on the platform's floating point to string conversion routines, which were of variable quality. The extra digits accounted for that unreliability. It probably could be changed now, but I'd want more aggressive testing of the assertion of correctness (`random()`, as used in that StackOverflow demonstration, does *not* exercise a lot of the important edge cases in the floating point format). But if your true concern is that 9% of disk space, you probably don't want to be using `savetxt()` in any case. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Change definition of complex sign (and use it in copysign)
On Wed, Jan 3, 2024 at 4:09 PM Aaron Meurer wrote: > sign(z) = z/|z| is a fairly standard definition. See > https://oeis.org/wiki/Sign_function and > https://en.wikipedia.org/wiki/Sign_function. It's also implemented > this way in MATLAB and Mathematica (see > https://www.mathworks.com/help/symbolic/sign.html and > https://reference.wolfram.com/language/ref/Sign.html). The function > z/|z| is useful because it represents a normalization of z as a vector > in the complex plane onto the unit circle. > > With that being said, I'm not so sure about the suggestion about > extending copysign(x1, x2) as |x1|*sign(x2). I generally think of > copysign as a function to manipulate the floating-point representation > of a number. It literally copies the sign *bit* from x2 into x1. It's > useful because of things like -0.0, which is otherwise difficult to > work with since it compares equal to 0.0. I would find it surprising > for copysign to do a numeric calculation on complex numbers. Also, > your suggested definition would be wrong for 0.0 and -0.0, since > sign(0) is 0, and this is precisely where copysign matters. > Agreed on all points. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Feature request: Extension of the np.argsort Function - Returning Positional Information for Data
On Tue, Jan 16, 2024 at 11:05 PM hao chen wrote: > When dealing with lists that contain duplicate data, np.argsort fails to > return index values that correspond to the actual sorting positions of the > data, as it does when handling arrays without duplicates. > > Dear Author: > > When I use the np.argsort function on an array without duplicate data, the > returned index values correspond to the sorting positions of the respective > data.😀 > > x = [1, 2, 5, 4] > rank = np.argsort(x) > print(rank) > # [0 1 3 2] > > That is not what `argsort` is intended or documented to do. It returns an array of indices _into `x`_ such that if you took the values from `x` in that order, you would get a sorted array. That is, if `x` were sorted into the array `sorted_x`, then `x[rank[i]] == sorted_x[i]` for all `i in range(len(x))`. The indices in `rank` are positions in `x`, not positions in `sorted_x`. They happen to correspond in this case, but that's a coincidence that's somewhat common in these small examples. But consider `[20, 30, 10, 40]`: >>> x = np.array([20, 30, 10, 40]) >>> ix = np.argsort(x) >>> def position(x): ... sorted_x = np.array(x) ... sorted_x.sort() ... return np.searchsorted(sorted_x, x) ... >>> ip = position(x) >>> ix array([2, 0, 1, 3]) >>> ip array([1, 2, 0, 3]) But also notice: >>> np.argsort(np.argsort(x)) array([1, 2, 0, 3]) This double-argsort is what you seem to be looking for, though it depends on what you want from the handling of duplicates (do you return the first index into the sorted array with the same value as in my `position()` implementation, or do you return the index that particular item was actually sorted to). Either way, we probably aren't going to add this as its own function. Both options are straightforward combinations of existing primitives. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays
On Thu, Feb 15, 2024 at 10:21 AM wrote: > Hello Numpy community, > > I'm proposing the introduction of a `pipe` method for NumPy arrays to > enhance their usability and expressiveness. > Adding a prominent method like this to `np.ndarray` is something that we will probably not take up ourselves unless it is adopted by the Array API standard <https://data-apis.org/array-api/latest/>. It's possible that you might get some interest there since the Array API deliberately strips out the number of methods that we already have (e.g. `.mean()`, `.sum()`, etc.) in favor of functions. A general way to add some kind of fluency cheaply in an Array API-agnostic fashion might be helpful to people trying to make their numpy-only code that uses our current set of methods in this way a bit easier. But you'll have to make the proposal to them, I think, to get started. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Converting array to string
On Sat, Feb 24, 2024 at 7:17 PM Dom Grigonis wrote: > Hello, > > I am seeking a bit of help. > > I need a fast way to transfer numpy arrays in json format. > > Thus, converting array to list is not an option and I need something > similar to: > > a = np.ones(1000)%timeit a.tobytes()17.6 ms > > This is quick, fast and portable. In other words I am very happy with > this, but... > > Json does not support bytes. > > Any method of subsequent conversion from bytes to string is number of > times slower than the actual serialisation. > > So my question is: Is there any way to serialise directly to string? > > I remember there used to be 2 methods: tobytes and tostring. However, I > see that tostring is deprecated and its functionality is equivalent to > `tobytes`. Is it completely gone? Or is there a chance I can still find a > performant version of converting to and back from `str` type of non-human > readable form? > The old `tostring` was actually the same as `tobytes`. In Python 2, the `str` type was what `bytes` is now, a string of octets. In Python 3, `str` became a string a Unicode characters (what you want) and the `bytes` type was introduced for the string of octects so `tostring` was merely _renamed_ to `tobytes` to match. `tostring` never returned a string of Unicode characters suitable for inclusion in JSON. AFAICT, there is not likely to be a much more efficient way to convert from an array to a reasonable JSONable encoding (e.g. base64). The time you are seeing is the time it takes to encode that amount of data, period. That said, if you want to use a quite inefficient hex encoding, `a.data.hex()` is somewhat faster than the base64 encoding, but it's less than ideal in terms of space usage. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Converting array to string
On Sun, Feb 25, 2024 at 1:52 PM Dom Grigonis wrote: > Thank you for your answer, > > Yeah, I was unsure if it ever existed in the first place. > > Space is less of an issue and something like `a.data.hex()` would be fine > as long as its speed was on par with `a.tobytes()`. However, it is 10x > slower on my machine. > > This e-mail is pretty much a final check (after a fair bit of research and > attempts) that it can not be done so I can eliminate this possibility as > feasible and concentrate on other options. > I think that mostly runs the gamut for pure JSON. Consider looking at BJData, but it's "JSON-based" and not quite pure JSON. https://neurojson.org/ -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Polyfit error in displacement
On Tue, Mar 26, 2024 at 3:39 PM Luca Bertolotti wrote: > Thanks for your reply yes it seems more appropriate a cubic spline but how > can i get what they call SC > I don't think any of us has enough context to know what "SC" is. It's not a standard term that I'm aware of. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: dtype=object arrays not treated like Python list?
On Fri, Mar 29, 2024 at 9:11 AM Steven G. Johnson wrote: > Should a dtype=object array be treated more like Python lists for type > detection/coercion reasons? Currently, they are treated quite differently: > > >>> import numpy as np > >>> np.isfinite([1,2,3]) > array([ True, True, True]) > >>> np.isfinite(np.asarray([1,2,3], dtype=object)) > Traceback (most recent call last): > File "", line 1, in > TypeError: ufunc 'isfinite' not supported for the input types, and the > inputs could not be safely coerced to any supported types according to the > casting rule ''safe'' > > The reason I ask is that we ran into something similar when trying to pass > wrappers around Julia arrays to Python. A Julia `Any[]` array is much like > a Python list or a Numpy `object` array, and exposed in Python as a subtype > of MutableSequence, but we found that it was treated by NumPy as more like > a Numpy `object` array than a Python list ( > https://github.com/JuliaPy/PythonCall.jl/issues/486). > > Would it be desirable to treat a 1d Numpy `object` array more like a > Python `list`? Or is there a way for externally defined types to opt-in > to the `list` behavior? (I couldn't figure out in the numpy source code > where `list` is being special-cased?) > `list` isn't special-cased, per se. Most numpy functions work on `ndarray` objects and accept "array-like" objects like `list`s by coercing them to `ndarray` objects ASAP using `np.asarray()`. `asarray` will leave existing `ndarray` instances alone. When you pass `[1,2,3]` to a numpy function, `np.asarray([1,2,3])` takes that list and infers the dtype and shape from its contents using just the `Sequence` API. Since they are `int` objects of reasonable size, the result is `np.array([1, 2, 3], dtype=np.int64)`. The `isfinite` ufunc has a loop defined for `dtype=np.int64`. `np.asarray()` will also check for special methods and properties like `__array__()` and `__array_interface__` to allow objects to customize how they should be interpreted as `ndarray` objects, in particular, allowing memory to be shared efficiently as pointers, if the layouts are compatible, but also to control the dtype of the final array. PythonCall.jl implements `__array__()` and `__array_interface__` for its array objects for these purposes, and if I am not mistaken, explicitly makes `Any[]` convert to a `dtype=object` `ndarray` <https://github.com/JuliaPy/PythonCall.jl/blob/f586f2494432f2e5366ff1e2876b8aa532630b54/src/JlWrap/objectarray.jl#L61-L69> (the `typestr = "|O"` is equivalent to `dtype=object` in that context). It's *possible* that you'd really rather have this `PyObjectArray` *not* implement those interfaces and just let the `np.asarray()` inference work its magic through the `Sequence` API, but that is expensive. You could also try doing the type inference yourself and implement that in the `PyObjectArray.__array__()` implementation and avoid implementing `__array_interface__` for that object. Then `np.asarray()` will just delegate to `PyObjectArray.__array__()`. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Build NumPy with Debugging Symbols with Meson
Hi everyone, Is there a gist or similar guide anywhere on the steps required to build NumPy with debugging symbols on the Windows platform using the new Meson build system? It seems a bit difficult because NumPy seems to expect a `conda`-like file structure whereas if you are cloning `cpython` directly it is different. In particular `cpython` puts all the Windows files under the `PC` and `PCBuild` subdirectories. Also meson doesn't seem to have a means to call `python_d.exe` directly which may be causing problems. I've ended up building both `python.exe` and `python_d.exe` but run into some problem finding Python in `meson.build`. This describes my efforts to date: # Build CPython with Debugging Symbols on Windows ```shell git clone https://github.com/python/cpython.git cd cpython git switch 3.11 PCbuild\get_externals.bat # Build the debug version `python_d.exe` PCbuild\build.bat -e -d # Meson calls `python.exe` and doesn't seem to have an argument to change this. # We could either build it, or create a symlink PCbuild\build.bat # ln -s PCbuild/amd64/python_d.exe PCbuild/amd64/python.exe ``` The built Python will then be located in `/cpython/PCBuild/amd64/python_d.exe`. ```batch set PREFIX=C:/Users/Robert/dev/cpython set PATH=%PREFIX%;%PREFIX%/PCBuild/amd64;%PREFIX%/Scripts;%PATH% ``` Next we have to install pip: https://docs.python.org/3/library/ensurepip.html, meson, and cython. ```shell python_d -m ensurepip python_d -mpip install meson meson-python ninja cython ``` NOTE: produces `pip3.exe`, not `pip`. # Build NumPy with debug Python ```shell git clone https://github.com/numpy/numpy.git cd numpy git switch maintenance/1.26.x git submodule update --init ``` https://mesonbuild.com/meson-python/how-to-guides/debug-builds.html Next try and build with meson/ninja: ```shell mkdir build-dbg cd build-dbg meson .. setup --buildtype=debug --includedir=C:/Users/Robert/dev/cpython/PC --libdir=C:/Users/Robert/dev/cpython/PCbuild/amd64 ninja ninja install ``` `meson .. setup <...>` fails and complains that, ''' Run-time dependency python found: NO (tried sysconfig) ..\meson.build:41:12: ERROR: Python dependency not found ''' which corresponds to: ```meson.build py = import('python').find_installation(pure: false) py_dep = py.dependency() ``` I tried also editing `pyproject.toml` to add the section: ```toml [tool.meson-python.args] setup = ['-Dbuildtype=debug', '-Dincludedir=C:/Users/Robert/dev/cpython/PC', '-Dlibdir=C:/Users/Robert/dev/cpython/PCbuild/amd64'] ``` and then build NumPy with pip using the debug python build: ```shell python_d -mpip install . --no-build-isolation -Cbuild-dir=build-dbg ``` But it fails in the same fashion. Searching for issues I find this one but it's likely in this case something related to the debug Python build is the problem. https://github.com/mesonbuild/meson/issues/12547 Meson installed version is 1.4.0. Any advice would be appreciated. I'm happy to write a gist if I can get it working. -- Robert McLeod robbmcl...@gmail.com robert.mcl...@hitachi-hightech.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Adding bfill() to numpy.
On Mon, May 20, 2024 at 1:17 PM george trojan wrote: > xarray has both bfill and ffill for DataArrays. The implementation useless > bottleneck https://github.com/pydata/bottleneck. > bottleneck would probably be a good home for these functions (though I don't speak for its maintainers). If it gets picked up by a bunch of other array implementations, then you can make a proposal to have them added to the Array API standard. numpy probably won't add them unless if that happens first. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: ENH: Add more default arguments to various functions
On Tue, Jun 4, 2024 at 3:12 PM Rakshit Singh wrote: > I am unable to wrap my head around it. > > Why would we make a zero array which is empty. It can be done by a normal > np.array > `shape=()` would not be empty, but a scalar. `np.zeros(())` is the same as `np.array(0.0)`, so it is meaningful, at least. An empty array would be one where one of the axis dimensions is 0, which would make the difference between `zeros()`, `ones()`, and `empty()` meaningless if they all shared a default like that. Nonetheless, I don't think we want to make that a default. It's rarely desired, so it can be rarely requested in an explicit manner. For `np.heaviside()`, a default value was intentionally left unspecified because of the multiple conventions that people want. In the face of ambiguity, refuse the temptation to guess. I think we're comfortable with these choices at this time. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Applying a function on local environments
On Fri, Jul 26, 2024 at 8:50 AM wrote: > Dear all, > > my goal would be to apply some function on a local environment of size K x > K x K where K is bigger than 1 and odd. For example, if K = 3, it would be > nice to apply some function "func" on such an environment around each > n-dimensional position within the n-dimensional array. So, for K = 3 and > the position (1,1,1) if a 3D array, one would collect the array values at > the positions X = [(0,0,0),(0,0,1),(0,0,2),(0,1,0),...(2,2,2)] (K**3 = 27 > in total in this example) and yield those values to "func". The result > value at position (1,1,1) in the output array would be y = func(X). The > same would apply for all entries excluding the padding area (or according > to some padding policy). > scipy.ndimage.generic_filter() <https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.generic_filter.html> -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Original seed of a BitGenerator
On Fri, Aug 2, 2024 at 1:28 AM Andrew Nelson wrote: > When using the new `Generator`s for stochastic optimisation I sometimes > find myself possessing a great solution, but am wondering what path the > random number generation took to get to that point. > > I know that I can get the current state of the BitGenerators. However, > what I'd like to do is query the BitGenerator to figure out how the > BitGenerator was setup in the first place. > > i.e. either: > > - the seed/SeedSequence that was used to construct the BitGenerator > >>> rng = np.random.default_rng() >>> rng.bit_generator.seed_seq SeedSequence( entropy=186013007116029215180532390504704448637, ) In some older versions of numpy, the attribute was semi-private as _seed_seq, if you're still using one of those. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > Do you have comments on how to go forward, in particular in regards to > new dtype vs modify np.unicode? Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation because we never really laid out the use cases. We just felt like we needed bytestring and unicode dtypes, more out of completionism than anything, and we made a bunch of assumptions just to get each one done. I think there may be broad agreement that many of those assumptions are "wrong", but it would be good to reference that against concretely-stated use cases. FWIW, if I need to work with in-memory arrays of strings in Python code, I'm going to use dtype=object a la pandas. It has almost no arbitrary constraints, and I can rely on Python's unicode facilities freely. There may be some cases where it's a little less memory-efficient (e.g. representing a column of enumerated single-character values like 'M'/'F'), but that's never prevented me from doing anything (compare to the uniform-length restrictions, which *have* prevented me from doing things). So what's left? Being able to memory-map to files that have string data conveniently laid out according to numpy assumptions (e.g. FITS). Being able to work with C/C++/Fortran APIs that have arrays of strings laid out according to numpy assumptions (e.g. HDF5). I think it would behoove us to canvass the needs of these formats and APIs before making any more assumptions. For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string. I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings. We should look at some of the newer formats and APIs, like Parquet and Arrow, and also consider the cross-language APIs with Julia and R. If I had to jump ahead and propose new dtypes, I might suggest this: * For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs. * Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name. * Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied. * Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.). -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: >> >> I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings. > > > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: > https://github.com/PyTables/PyTables/issues/499 > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html > > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters. Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald wrote: > > On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: >> For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the singular motivating use case for the trailing-NULL behavior of np.string. > > Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. Never mind, then. :-) >> If I had to jump ahead and propose new dtypes, I might suggest this: >> >> * For the most part, treat the string dtypes as temporary communication formats rather than the preferred in-memory working format, similar to how we use `float16` to communicate with GPU APIs. >> >> * Acknowledge the use cases of the current NULL-terminated np.string dtype, but perhaps add a new canonical alias, document it as being for those specific use cases, and deprecate/de-emphasize the current name. >> >> * Add a dtype for holding uniform-length `bytes` strings. This would be similar to the current `void` dtype, but work more transparently with the `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` like `float64` does with `float`. This would not be NULL-terminated. No encoding would be implied. > > How would this differ from a numpy array of bytes with one more dimension? The scalar in the implementation being the scalar in the use case, immutability of the scalar, directly working with b'' strings in and out (and thus work with the Python codecs easily). >> * Maybe add a dtype similar to `object_` that only permits `unicode/str` (2.x/3.x) strings (and maybe None to represent missing data a la pandas). This maintains all of the flexibility of using a `dtype=object` array while allowing code to specialize for working with strings without all kinds of checking on every item. But most importantly, we can serialize such an array to bytes without having to use pickle. Utility functions could be written for en-/decoding to/from the uniform-length bytestring arrays handling different encodings and things like NULL-termination (also working with the legacy dtypes and handling structured arrays easily, etc.). > > I think there may also be a niche for fixed-byte-size null-terminated strings of uniform encoding, that do decoding and encoding automatically. The encoding would naturally be attached to the dtype, and they would handle too-long strings by either truncating to a valid encoding or simply raising an exception. As with the current fixed-length strings, they'd mostly be for communication with other code, so the necessity depends on whether such other codes exist at all. Databases, perhaps? Custom hunks of C that don't want to deal with variable-length packing of data? Actually this last seems plausible - if I want to pass a great wodge of data, including Unicode strings, to a C program, writing out a numpy array seems maybe the easiest. HDF5 seems to support this, but only for ASCII and UTF8, not a large list of encodings. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > > On 20.04.2017 20:53, Robert Kern wrote: > > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor > > mailto:jtaylor.deb...@googlemail.com>> > > wrote: > > > >> Do you have comments on how to go forward, in particular in regards to > >> new dtype vs modify np.unicode? > > > > Can we restate the use cases explicitly? I feel like we ended up with > > the current sub-optimal situation because we never really laid out the > > use cases. We just felt like we needed bytestring and unicode dtypes, > > more out of completionism than anything, and we made a bunch of > > assumptions just to get each one done. I think there may be broad > > agreement that many of those assumptions are "wrong", but it would be > > good to reference that against concretely-stated use cases. > > We ended up in this situation because we did not take the opportunity to > break compatibility when python3 support was added. Oh, the root cause I'm thinking of long predates Python 3, or even numpy 1.0. There never was an explicitly fleshed out use case for unicode arrays other than "Python has unicode strings, so we should have a string dtype that supports it". Hence the "we only support UCS4" implementation; it's not like anyone *wants* UCS4 or interoperates with UCS4, but it does represent all possible Unicode strings. The Python 3 transition merely exacerbated the problem by making Unicode strings the primary string type to work with. I don't really want to ameliorate the exacerbation without addressing the root problem, which is worth solving. I will put this down as a marker use case: Support HDF5's fixed-width UTF-8 arrays. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: >> >> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: >> > >> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: >> >> >> >> I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length ASCII like FITS, but only variable-length UTF8 strings. >> > >> > >> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and variable length versions: >> > https://github.com/PyTables/PyTables/issues/499 >> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html >> > >> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for storage, not the number of characters. >> >> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which of course are also constrained by numpy's current set of dtypes. The NULL-terminated ASCII works well enough with np.string's semantics. > > Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should correspond to a string type, not np.string_ (which is really bytes). "... well enough with np.string's semantics [that h5py actually used it to pass data in and out; whether that array is fit for purpose beyond that, I won't comment]." :-) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Thu, Apr 20, 2017 at 1:16 PM, Phil Hodge wrote: > > On 04/20/2017 03:17 PM, Anne Archibald wrote: >> >> Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. > > FITS BINTABLE extensions can have columns containing strings, and in that case the values are NULL terminated, except that if the string fills the field (i.e. there's no room for a NULL), the NULL will not be written. Ah, that's what I was thinking of, thank you. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker wrote: > > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >>> >>> BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal. >> >> >> I wouldn't call it a pathological use case, it doesn't seem so uncommon to have large datasets of short strings. > > It's pathological for using a variable-length encoding. > >> I personally deal with a database of hundreds of billions of 2 to 5 character ASCII strings. This has been a significant blocker to Python 3 adoption in my world. > > I agree -- it is a VERY common case for scientific data sets. But a one-byte-per-char encoding would handle it nicely, or UCS-4 if you want Unicode. The wasted space is not that big a deal with short strings... Unless if you have hundreds of billions of them. >> BTW, for those new to the list or with a short memory, this topic has been discussed fairly extensively at least 3 times before. Hopefully the *fourth* time will be the charm! > > yes, let's hope so! > > The big difference now is that Julian seems to be committed to actually making it happen! > > Thanks Julian! > > Which brings up a good point -- if you need us to stop the damn bike-shedding so you can get it done -- say so. > > I have strong opinions, but would still rather see any of the ideas on the table implemented than nothing. FWIW, I prefer nothing to just adding a special case for latin-1. Solve the HDF5 problem (i.e. fixed-length UTF-8 strings) or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: >> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. > > +1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > > On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > >>> In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...). >> >> >> Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size. > > Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does: > > arr[i] = a_string > > Which then raises a ValueError, something like: > > String too long for a string[12] dytype array. We have the freedom to make the error message not suck. :-) > When len(a_string) <= 12 > > AND that will only occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests. > > So folks need to do something like: > > len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either. If this becomes the right strategy for dealing with these problems (and I'm not sure that it is), we can easily make a utility function that does this for people. This discussion is why I want to be sure that we have our use cases actually mapped out. For this kind of in-memory manipulation, I'd use an object array (a la pandas), then convert to the uniform-width string dtype when I needed to push this out to a C API, HDF5 file, or whatever actually requires a string-dtype array. The required width gets computed from the data after all of the manipulations are done. Doing in-memory assignments to a fixed-encoding, fixed-width string dtype will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1! >> I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. > > utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context. > > latin-1 or latin-9 buys you (over ASCII): > > - A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better. > > - A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...) > > - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern wrote: >> >> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >> > >> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: >> >> >> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError. >> > >> > +1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. >> >> That says to me that these are properly represented by `bytes` objects, not `unicode/str` objects encoding to and decoding from a hardcoded latin-1 encoding. > > If you could go back 30 years and get every scientist in the world to do the right thing, then sure. But we are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters. I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly. Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`? > So I would beg to actually move forward with a pragmatic solution that addresses very real and consequential problems that we face instead of waiting/praying for a perfect solution. Well, I outlined a solution: work with `bytes` arrays with utilities to convert to/from the Unicode-aware string dtypes (or `object`). A UTF-8-specific dtype and maybe a string-specialized `object` dtype address the very real and consequential problems that I face (namely and respectively, working with HDF5 and in-memory manipulation of string datasets). I'm happy to consider a latin-1-specific dtype as a second, workaround-for-specific-applications-only-you-have- been-warned-you're-gonna-get-mojibake option. It should not be *the* Unicode string dtype (i.e. named np.realstring or np.unicode as in the original proposal). -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
Chris, you've mashed all of my emails together, some of them are in reply to you, some in reply to others. Unfortunately, this dropped a lot of the context from each of them, and appears to be creating some misunderstandings about what each person is advocating. On Mon, Apr 24, 2017 at 2:00 PM, Chris Barker wrote: > > On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern wrote: >> Solve the HDF5 problem (i.e. fixed-length UTF-8 strings) > > I agree-- binary compatibility with utf-8 is a core use case -- though is it so bad to go through python's encoding/decoding machinery to so it? Do numpy arrays HAVE to be storing utf-8 natively? If the point is to have an array that transparently accepts/yields `unicode/str` scalars while maintaining the in-memory encoding, yes. If that's not the point, then IMO the status quo is fine, and *no* new dtypes should be added, just maybe some utility functions to convert between the bytes-ish arrays and the Unicode-holding arrays (which was one of my proposals). I am mostly happy to live in a world where I read in data as bytes-ish arrays, decode into `object` arrays holding `unicode/str` objects, do my manipulations, then encode the array into a bytes-ish array to give to the C API or file format. >> or leave it be until someone else is willing to solve that problem. I don't think we're at the bikeshedding stage yet; we're still disagreeing about fundamental requirements. > > yeah -- though I've seen projects get stuck in the sorting out what to do, so nothing gets done stage before -- I don't want Julian to get too frustrated and end up doing nothing. I understand, but not all tedious discussions that have not yet achieved consensus are bikeshedding to be cut short. We couldn't really decide what to do back in the pre-1.0 days, too, so we just did *something*, and that something is now the very situation that Julian has a problem with. We have more experience now, especially with the added wrinkles of Python 3; other projects have advanced and matured their Unicode string array-handling (e.g. pandas and HDF5); now is a great time to have a real discussion about what we *need* before we make decisions about what we should *do*. > So here I'll lay out what I think are the fundamental requirements: > > 1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do: > > arr = np.array(("this", "that",)) > > you get an array that can store ANY unicode string with 4 or less characters > > and arr[1] will return a native Python string object. > > 2) There be some way to store mostly ascii-compatible strings in a single byte-per-character array -- so not be wasting space for "typical european-oriented data". > > arr = np.array(("this", "that",), dtype=np.single_byte_string) > > (name TBD) > > and arr[1] would return a python string. > > attempting to put in a not-compatible with the encoding string in would raise an Encoding Error. > > I highly recommend that (SO 8859-15 ( latin-9 or latin-1) be the encoding in this case. > > 3) There be a dtype that could store strings in null-terminated utf-8 binary format -- for interchange with other systems (netcdf, HDF, others???) > > 4) a fixed length bytes dtype -- pretty much what 'S' is now under python three -- settable from a bytes or bytearray object, and returns a bytes object. > - you could use astype() to convert between bytes and a specified encoding with no change in binary representation. You'll need to specify what NULL-terminating behavior you want here. np.string_ has NULL-termination. np.void (which could be made to work better with `bytes`) does not. Both have use-cases for text encoding (shakes fist at UTF-16). > 2) and 3) could be fully covered by a dtype with a settable encoding that might as well support all python built-in encodings -- though I think an alias to the common cases would be good -- latin, utf-8. If so, the length would have to be specified in bytes. > > 1) could be covered with the existing 'U': type - only downside being some wasted space -- or with a pointer to a python string dtype -- which would also waste space, though less for long-ish strings, and maybe give us some better access to the nifty built-in string features. > >> > +1. The key point is that there is a HUGE amount of legacy science data in the form of FITS (astronomy-specific binary file format that has been the primary file format for 20+ years) and HDF5 which uses a character data type to store data which can be bytes 0-255. Getting an decoding/encoding error when trying to deal with these datasets is a non-starter from my perspective. > >>
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: >> >> I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly. >> >> Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`? > > This is very simple and obvious but I will state for the record. I appreciate it. What is obvious to you is not obvious to me. > Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert the data to `U` type and incur the 4x memory bloat. > > In [22]: dat = np.array(['yes', 'no'], dtype='S3') > > In [23]: dat == 'yes' # FAIL (but works just fine in Py2) > Out[23]: False > > In [24]: dat == b'yes' # Right answer but not practical > Out[24]: array([ True, False], dtype=bool) I'm curious why you think this is not practical. It seems like a very practical solution to me. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer wrote: > > On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker wrote: >>> >>> On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes. >> >> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't barf and can be passed on through. > > I am totally in agreement with Thomas that "We are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters." > > My question: What are those non-ASCII characters? How often are they truly latin-1/9 vs. some other text encoding vs. non-string binary data? I don't know that we can reasonably make that accounting relevant. Number of such characters per byte of text? Number of files with such characters out of all existing files? What I can say with assurance is that every time I have decided, as a developer, to write code that just hardcodes latin-1 for such cases, I have regretted it. While it's just personal anecdote, I think it's at least measuring the right thing. :-) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern wrote: >> >> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >> > >> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: >> >> >> >> I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit encodings. In that experience, I have found that just treating the data as latin-1 unconditionally is not a pragmatic solution. It's really easy to implement, and you do get a program that runs without raising an exception (at the I/O boundary at least), but you don't often get a program that really runs correctly or treats the data properly. >> >> >> >> Can you walk us through the problems that you are having with working with these columns as arrays of `bytes`? >> > >> > This is very simple and obvious but I will state for the record. >> >> I appreciate it. What is obvious to you is not obvious to me. >> >> > Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert the data to `U` type and incur the 4x memory bloat. >> > >> > In [22]: dat = np.array(['yes', 'no'], dtype='S3') >> > >> > In [23]: dat == 'yes' # FAIL (but works just fine in Py2) >> > Out[23]: False >> > >> > In [24]: dat == b'yes' # Right answer but not practical >> > Out[24]: array([ True, False], dtype=bool) >> >> I'm curious why you think this is not practical. It seems like a very practical solution to me. > > In Py3 most character data will be string, not bytes. So every time you want to interact with the bytes array (compare, assign, etc) you need to explicitly coerce the right hand side operand to be a bytes-compatible object. For code that developers write, this might be possible but results in ugly code. But for the general science and engineering communities that use numpy this is completely untenable. Okay, so the problem isn't with (byte-)string literals, but with variables being passed around from other sources. Eg. def func(dat, scalar): return dat == scalar Every one of those functions deepens the abstraction and moves that unicode-by-default scalar farther away from the bytesish array, so it's harder to demand that users of those functions be aware that they need to pass in `bytes` strings. So you need to implement those functions defensively, which complicates them. > The only practical solution so far is to implement a unicode sandwich and convert to the 4-byte `U` type at the interface. That is precisely what we are trying to eliminate. What do you think about my ASCII-surrogateescape proposal? Do you think that would work with your use cases? In general, I don't think Unicode sandwiches will be eliminated by this or the latin-1 dtype; the sandwich is usually the right thing to do and the surrogateescape the wrong thing. But I'm keenly aware of the problems you get when there just isn't a reliable encoding to use. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically. > > From context I'm assuming FITS files use fixed-width-nul-padding for strings? Is that right? I know HDF5 doesn't. Yes, HDF5 does. Or at least, it is supported in addition to the variable-length ones. https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith wrote: > > On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern wrote: > > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > > > >> That said, AFAICT what people actually want in most use cases is support > >> for arrays that can hold variable-length strings, and the only place where > >> the current approach is *optimal* is when we need mmap compatibility with > >> legacy formats that use fixed-width-nul-padded fields (at which point it's > >> super convenient). It's not even possible to *represent* all Python strings > >> or bytestrings in current numpy unicode or string arrays (Python > >> strings/bytestrings can have trailing nuls). So if we're talking about > >> tweaks to the current system it probably makes sense to focus on this use > >> case specifically. > >> > >> From context I'm assuming FITS files use fixed-width-nul-padding for > >> strings? Is that right? I know HDF5 doesn't. > > > > Yes, HDF5 does. Or at least, it is supported in addition to the > > variable-length ones. > > > > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html > > Doh, I found that page but it was (and is) meaningless to me, so I > went by http://docs.h5py.org/en/latest/strings.html, which says the > options are fixed-width ascii, variable-length ascii, or > variable-length utf-8 ... I guess it's just talking about what h5py > currently supports. It's okay, I made exactly the same mistake earlier in the thread. :-) > But also, is it important whether strings we're loading/saving to an > HDF5 file have the same in-memory representation in numpy as they > would in the file? I *know* [1] no-one is reading HDF5 files using > np.memmap :-). Is it important for some other reason? The lack of such a dtype seems to be the reason why neither h5py nor PyTables supports that kind of HDF5 Dataset. The variable-length Datasets can take up a lot of disk-space because they can't be compressed (even accounting for the wasted padding space). I mean, they probably could have implemented it with objects arrays like h5py does with the variable-length string Datasets, but they didn't. https://github.com/PyTables/PyTables/issues/499 https://github.com/h5py/h5py/issues/624 -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear. > > So I'll try again -- use-case only! we'll keep the possible solutions separate. > > Do we need to write up a NEP for this? it seems we are going a bit in circles, and we really do want to capture the final decision process. > > 1) The default behaviour for numpy arrays of strings is compatible with Python3's string model: i.e. fully unicode supporting, and with a character oriented interface. i.e. if you do:: ... etc. These aren't use cases but rather requirements. I'm looking for something rather more concrete than that. * HDF5 supports fixed-length and variable-length string arrays encoded in ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite the documentation claiming that there are more options). In practice, the ASCII strings permit high-bit characters, but the encoding is unspecified. Memory-mapping is rare (but apparently possible). The two major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to support that HDF5 option. Compression is supported for fixed-length string arrays but not variable-length string arrays. * FITS supports fixed-length string arrays that are NULL-padded. The strings do not have a formal encoding, but in practice, they are typically mostly ASCII characters with the occasional high-bit character from an unspecific encoding. Memory-mapping is a common practice. These arrays can be quite large even if each scalar is reasonably small. * pandas uses object arrays for flexible in-memory handling of string columns. Lengths are not fixed, and None is used as a marker for missing data. String columns must be written to and read from a variety of formats, including CSV, Excel, and HDF5, some of which are Unicode-aware and work with `unicode/str` objects instead of `bytes`. * There are a number of sometimes-poorly-documented, often-poorly-adhered-to, aging file format "standards" that include string arrays but do not specify encodings, or such specification is ignored in practice. This can make the usual "Unicode sandwich" at the I/O boundaries difficult to perform. * In Python 3 environments, `unicode/str` objects are rather more common, and simple operations like equality comparisons no longer work between `bytes` and `unicode/str`, making it difficult to work with numpy string arrays that yield `bytes` scalars. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Tue, Apr 25, 2017 at 10:04 AM, Chris Barker wrote: > > On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI wrote: >> >> 2017-04-25 12:34 GMT-04:00 Chris Barker : >> > I am totally euro-centric, > >> But Shift-JIS is not one-byte; it's two-byte (unless you allow only >> half-width characters and nothing else). :-) > > bad example then -- are their other non-euro-centric one byte per char encodings worth worrying about? I have no clue :-) I've run into Windows-1251 in files (seismic and well log data from Russian wells). Treating them as latin-1 does not make for a happy time. Both encodings also technically derive from ASCII in the lower half, but most of the actual language is written with the high-bit characters. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.face...@gmail.com> wrote: >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur. > > > Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated. Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5): * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs. * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL. I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern wrote: >> >> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: >> > >> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.face...@gmail.com> wrote: >> >> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur. >> > >> > Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated. >> >> Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5): >> >> * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs. >> >> * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL. >> >> I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space. > > Thanks for the clarification. NULL-padded is what I meant. Okay, however, the biggest use-case we have for UTF-8 arrays (HDF5) is NULL-terminated. > I'm wondering how much of the desired functionality we could get by simply subclassing ndarray in python. I think we mostly want to be able to view byte strings and convert to unicode if needed. I'm not sure. Some of these fixed-width string arrays are embedded inside structured arrays with other dtypes. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < chris.bar...@noaa.gov> wrote: >> Presumably you're getting byte strings (with unknown encoding. > > No -- thus is for creating and using mostly ascii string data with python and numpy. > > Unknown encoding bytes belong in byte arrays -- they are not text. You are welcome to try to convince Thomas of that. That is the status quo for him, but he is finding that difficult to work with. > I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that. That sloppiness that you mention is precisely the "unknown encoding" problem. Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-) > Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters. For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-) There are several use cases being brought forth here. Some involve file reading, some involve file writing, and some involve in-memory manipulation. Whatever change we make is going to impinge somehow on all of the use cases. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris wrote: > The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation. The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > Indeed, > Most of this discussion is irrelevant to numpy. > Numpy only really deals with the in memory storage of strings. And in > that it is limited to fixed length strings (in bytes/codepoints). > How you get your messy strings into numpy arrays is not very relevant to > the discussion of a smaller representation of strings. > You couldn't get messy strings into numpy without first sorting it out > yourself before, you won't be able to afterwards. > Numpy will offer a set of encodings, the user chooses which one is best > for the use case and if the user screws it up, it is not numpy's problem. > > You currently only have a few ways to even construct string arrays: > - array construction and loops > - genfromtxt (which is again just a loop) > - memory mapping which I seriously doubt anyone actually does for the S > and U dtype I fear that you decided that the discussion was irrelevant and thus did not read it rather than reading it to decide that it was not relevant. Because several of us have showed that, yes indeed, we do memory-map string arrays. You can add to this list C APIs, like that of libhdf5, that need to communicate (Unicode) string arrays. Look, I know I can be tedious, but *please* go back and read this discussion. We have concrete use cases outlined. We can give you more details if you need them. We all feel the pain of the rushed, inadequate implementation of the U dtype. But each of our pains is a little bit different; you obviously aren't experiencing the same pains that I am. > Having a new dtype changes nothing here. You still need to create numpy > arrays from python strings which are well defined and clean. > If you put something in that doesn't encode you get an encoding error. > No oddities like surrogate escapes are needed, numpy arrays are not > interfaces to operating systems nor does numpy need to _add_ support for > historical oddities beyond what it already has. > If you want to represent bytes exactly as they came in don't use a text > dtype (which includes the S dtype, use i1). Thomas Aldcroft has demonstrated the problem with this approach. numpy arrays are often interfaces to files that have tons of historical oddities. > Concerning variable sized strings, this is simply not going to happen. > Nobody is going to rewrite numpy to support it, especially not just for > something as unimportant as strings. > Best you are going to get (or better already have) is object arrays. It > makes no sense to discuss it unless someone comes up with an actual > proposal and the willingness to code it. No one has suggested such a thing. At most, we've talked about specializing object arrays. > What is a relevant discussion is whether we really need a more compact > but limited representation of text than 4-byte utf32 at all. > Its usecase is for the most part just for python3 porting and saving > some memory in some ascii heavy cases, e.g. astronomy. > It is not that significant anymore as porting to python3 has mostly > already happened via the ugly byte workaround and memory saving is > probably not as significant in the context of numpy which is already > heavy on memory usage. > > My initial approach was to not add a new dtype but to make unicode > parametrizable which would have meant almost no cluttering of numpys > internals and keeping the api more or less consistent which would make > this a relatively simple addition of minor functionality for people that > want it. > But adding a completely new partially redundant dtype for this usecase > may be a too large change to the api. Having two partially redundant > string types may confuse users more than our current status quo of our > single string type (U). > > Discussing whether we want to support truncated utf8 has some merit as > it is a decision whether to give the users an even larger gun to shot > themselves in the foot with. > But I'd like to focus first on the 1 byte type to add a symmetric API > for python2 and python3. > utf8 can always be added latter should we deem it a good idea. What is your current proposal? A string dtype parameterized with the encoding (initially supporting the latin-1 that you desire and maybe adding utf-8 later)? Or a latin-1-specific dtype such that we will have to add a second utf-8 dtype at a later date? If you're not going to support arbitrary encodings right off the bat, I'd actually suggest implementing UTF-8 and ASCII-surrogateescape first as they seem to knock off more use cases straight away. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald wrote: > > On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer wrote: >> >> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: >>> >>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < charlesr.har...@gmail.com> wrote: >>> >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation. >>> >>> The major use case that we have for a UTF-8 array is HDF5, and it specifies the width in bytes, not Unicode characters. >> >> It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: >> http://utf8everywhere.org/#myths >> >> I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them. > > It seems to me that most of the requirements people have expressed in this thread would be satisfied by: > > (1) object arrays of strings. (We have these already; whether a strings-only specialization would permit useful things like string-oriented ufuncs is a question for someone who's willing to implement one.) > > (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All python encodings should be permitted. An additional function to truncate encoded data without mangling the encoding would be handy. I think it makes more sense for this to be NULL-padded than NULL-terminated but it may be necessary to support both; note that NULL-termination is complicated for encodings like UCS4. This also includes the legacy UCS4 strings as a special case. > > (3) a dtype for fixed-length byte strings. This doesn't look very different from an array of dtype u8, but given we have the bytes type, accessing the data this way makes sense. The void dtype is already there for this general purpose and mostly works, with a few niggles. On Python 3, it uses 'int8' ndarrays underneath the scalars (fortunately, they do not appear to be mutable views). It also accepts `bytes` strings that are too short (pads with NULs) and too long (truncates). If it worked more transparently and perhaps rigorously with `bytes`, then it would be quite suitable. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > > On 26.04.2017 19:08, Robert Kern wrote: > > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > > mailto:jtaylor.deb...@googlemail.com>> > > wrote: > > > >> Indeed, > >> Most of this discussion is irrelevant to numpy. > >> Numpy only really deals with the in memory storage of strings. And in > >> that it is limited to fixed length strings (in bytes/codepoints). > >> How you get your messy strings into numpy arrays is not very relevant to > >> the discussion of a smaller representation of strings. > >> You couldn't get messy strings into numpy without first sorting it out > >> yourself before, you won't be able to afterwards. > >> Numpy will offer a set of encodings, the user chooses which one is best > >> for the use case and if the user screws it up, it is not numpy's problem. > >> > >> You currently only have a few ways to even construct string arrays: > >> - array construction and loops > >> - genfromtxt (which is again just a loop) > >> - memory mapping which I seriously doubt anyone actually does for the S > >> and U dtype > > > > I fear that you decided that the discussion was irrelevant and thus did > > not read it rather than reading it to decide that it was not relevant. > > Because several of us have showed that, yes indeed, we do memory-map > > string arrays. > > > > You can add to this list C APIs, like that of libhdf5, that need to > > communicate (Unicode) string arrays. > > > > Look, I know I can be tedious, but *please* go back and read this > > discussion. We have concrete use cases outlined. We can give you more > > details if you need them. We all feel the pain of the rushed, inadequate > > implementation of the U dtype. But each of our pains is a little bit > > different; you obviously aren't experiencing the same pains that I am. > > I have read every mail and it has been a large waste of time, Everything > has been said already many times in the last few years. > Even if you memory map string arrays, of which I have not seen a > concrete use case in the mails beyond "would be nice to have" without > any backing in actual code, but I may have missed it. Yes, we have stated that FITS files with string arrays are currently being read via memory mapping. http://docs.astropy.org/en/stable/io/fits/index.html You were even pointed to a minor HDF5 implementation that memory maps: https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_level.py#L682-L683 I'm afraid that I can't share the actual code of the full variety of proprietary file formats that I've written code for, I can assure you that I have memory mapped many string arrays in my time, usually embedded as columns in structured arrays. It is not "nice to have"; it is "have done many times and needs better support". > In any case it is still irrelevant. My proposal only _adds_ additional > cases that can be mmapped. It does not prevent you from doing what you > have been doing before. You are the one who keeps worrying about the additional complexity, both in code and mental capacity of our users, of adding new overlapping dtypes and solutions, and you're not wrong about that. I think it behooves us to consider if there are solutions that solve multiple related problems at once instead of adding new dtypes piecemeal to solve individual problems. > >> Having a new dtype changes nothing here. You still need to create numpy > >> arrays from python strings which are well defined and clean. > >> If you put something in that doesn't encode you get an encoding error. > >> No oddities like surrogate escapes are needed, numpy arrays are not > >> interfaces to operating systems nor does numpy need to _add_ support for > >> historical oddities beyond what it already has. > >> If you want to represent bytes exactly as they came in don't use a text > >> dtype (which includes the S dtype, use i1). > > > > Thomas Aldcroft has demonstrated the problem with this approach. numpy > > arrays are often interfaces to files that have tons of historical oddities. > > This does not matter for numpy, the text dtype is well defined as bytes > with a specific encoding and null padding. You cannot dismiss something as "not mattering for *numpy*" just because your new, *proposed* text dtype doesn't support it. You seem to have fixed on a course of action and are defining everyone else's use cases as out-of-scope because your course of action doesn't support them. That
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg wrote: > I remember talking with a colleague about something like that. And > basically an annoying thing there was that if you strip the zero bytes > in a zero padded string, some encodings (UTF16) may need one of the > zero bytes to work right. (I think she got around it, by weird > trickery, inverting the endianess or so and thus putting the zero bytes > first). > Maybe will ask her if this discussion is interesting to her. Though I > think it might have been something like "make everything in > hdf5/something similar work" without any actual use case, I don't know. I don't think that will be an issue for an encoding-parameterized dtype. The decoding machinery of that would have access to the full-width buffer for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes for UTF-16). It's only if you have to hack around at a higher level with numpy's S arrays, which return Python byte strings that strip off the trailing NULL bytes, that you have to worry about such things. Getting a Python scalar from the numpy S array loses information in such cases. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > > On Apr 26, 2017 12:09 PM, "Robert Kern" wrote: >> It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently. > > I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? https://github.com/PyTables/PyTables/issues/499 https://github.com/h5py/h5py/issues/379 -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposal: smaller representation of string arrays
On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker wrote: > But a bunch of folks have brought up that while we're messing around with string encoding, let's solve another problem: > > * Exchanging unicode text at the binary level with other systems that generally don't use UCS-4. > > For THAT -- utf-8 is critical. > > But if I understand Julian's proposal -- he wants to create a parameterized text dtype that you can set the encoding on, and then numpy will use the encoding (and python's machinery) to encode / decode when passing to/from python strings. > > It seems this would support all our desires: > > I'd get a latin-1 encoded type for compact representation of mostly-ascii data. > > Thomas would get latin-1 for binary interchange with mostly-ascii data > > The HDF-5 folks would get utf-8 for binary interchange (If we can workout the null-padding issue) > > Even folks that had weird JAVA or Windows-generated UTF-16 data files could do the binary interchange thing > > I'm now lost as to what the hang-up is. The proposal is for only latin-1 and UTF-32 to be supported at first, and the eventual support of UTF-8 will be constrained by specification of the width in terms of characters rather than bytes, which conflicts with the use cases of UTF-8 that have been brought forth. https://mail.python.org/pipermail/numpy-discussion/2017-April/076668.html -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Compare NumPy arrays with threshold and return the differences
On Wed, May 17, 2017 at 9:50 AM, Nissim Derdiger wrote: > Hi, > In my script, I need to compare big NumPy arrays (2D or 3D), and return a > list of all cells with difference bigger than a defined threshold. > The compare itself can be done easily done with "allclose" function, like > that: > Threshold = 0.1 > if (np.allclose(Arr1, Arr2, Threshold, equal_nan=True)): > Print('Same') > But this compare does not return *which* cells are not the same. > > The easiest (yet naive) way to know which cells are not the same is to use > a simple for loops code like this one: > def CheckWhichCellsAreNotEqualInArrays(Arr1,Arr2,Threshold): >if not Arr1.shape == Arr2.shape: >return ['Arrays size not the same'] >Dimensions = Arr1.shape >Diff = [] >for i in range(Dimensions [0]): >for j in range(Dimensions [1]): >if not np.allclose(Arr1[i][j], Arr2[i][j], Threshold, > equal_nan=True): >Diff.append(',' + str(i) + ',' + str(j) + ',' + > str(Arr1[i,j]) + ',' >+ str(Arr2[i,j]) + ',' + str(Threshold) + ',Fail\n') >return Diff > (and same for 3D arrays - with 1 more for loop) > This way is very slow when the Arrays are big and full of none-equal cells. > > Is there a fast straight forward way in case they are not the same - to > get a list of the uneven cells? maybe some built-in function in the NumPy > itself? > Use `close_mask = np.isclose(Arr1, Arr2, Threshold, equal_nan=True)` to return a boolean mask the same shape as the arrays which is True where the elements are close and False where they are not. You can invert it to get a boolean mask which is True where they are "far" with respect to the threshold: `far_mask = ~close_mask`. Then you can use `i_idx, j_idx = np.nonzero(far_mask)` to get arrays of the `i` and `j` indices where the values are far. For example: for i, j in zip(i_idx, j_idx): print("{0}, {1}, {2}, {3}, {4}, Fail".format(i, j, Arr1[i, j], Arr2[i, j], Threshold)) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Compare NumPy arrays with threshold
On Thu, May 18, 2017 at 5:07 AM, Nissim Derdiger wrote: > > Hi again, > Thanks for the responses to my question! > Roberts answer worked very well for me, except for 1 small issue: > > This line: > close_mask = np.isclose(MatA, MatB, Threshold, equal_nan=True) > returns each difference twice – once j in compare to I and once for I in compare to j No, it returns a boolean array the same size as MatA and MatB. It literally can't contain "each difference twice". Maybe there is something else in your code that is producing the doubling that you see, possibly in the printing of the results. I'm not seeing the behavior that you speak of. Please post your complete code that produced the doubled output that you see. import numpy as np MatA = np.array([[10,20,30],[40,50,60]]) MatB = np.array([[10,30,30],[40,50,160]]) Threshold = 1.0 # Note the `atol=` here. I missed it before. close_mask = np.isclose(MatA, MatB, atol=Threshold, equal_nan=True) far_mask = ~close_mask i_idx, j_idx = np.nonzero(far_mask) for i, j in zip(i_idx, j_idx): print("{0}, {1}, {2}, {3}, {4}, Fail".format(i, j, MatA[i, j], MatB[i, j], Threshold)) I get the following output: $ python isclose.py 0, 1, 20, 30, 1.0, Fail 1, 2, 60, 160, 1.0, Fail -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ANN: SfePy 2017.2
I am pleased to announce release 2017.2 of SfePy. Description --- SfePy (simple finite elements in Python) is a software for solving systems of coupled partial differential equations by the finite element method or by the isogeometric analysis (limited support). It is distributed under the new BSD license. Home page: http://sfepy.org Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/ Git (source) repository, issue tracker: https://github.com/sfepy/sfepy Highlights of this release -- - simplified and unified implementation of some homogenized coefficients - support for saving custom structured data to HDF5 files - new tutorial on preparing meshes using FreeCAD/OpenSCAD and Gmsh For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1 (rather long and technical). Cheers, Robert Cimrman --- Contributors to this release in alphabetical order: Robert Cimrman Jan Heczko Lubos Kejzlar Vladimir Lukes Matyas Novak ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] why a[0][0].__mul__(a[0][0]) where a is np.array, gives 'missing 1 required positional argument'?
On Tue, Jun 27, 2017 at 2:54 AM, Larry Evans wrote: > a= > [[ 12.0] > [12.0 <0>]] > a[0]= > [ 12.0] > a[0][0]= > > Traceback (most recent call last): > File "shortestPathABC.py", line 123, in > a00mul=a[0][0].__mul__(a[0][0]) > TypeError: __mul__() missing 1 required positional argument: 'other' > Makefile:7: recipe for target 'all' failed > make: *** [all] Error 1 > --}--cut here-- > > I don't understand why. Apparently, a[0][0] is not a ShortestNull because otherwise, the .__mul__ would have the positional argument, > 'other' equal to a[0][0]. > > What am I missing? > a=np.array([[ShortestNull,ShortestPath(12)],[ShortestPath( 12),ShortestNull()]],dtype=object) You didn't instantiate ShortestNull but passed the class object instead. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Boolean binary '-' operator
On Tue, Jun 27, 2017 at 3:01 PM, Benjamin Root wrote: > > Forgive my ignorance, but what is "Z/2"? https://groupprops.subwiki.org/wiki/Cyclic_group:Z2 https://en.wikipedia.org/wiki/Cyclic_group -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposed changes to array printing in 1.14
On Fri, Jun 30, 2017 at 4:47 PM, Gael Varoquaux < gael.varoqu...@normalesup.org> wrote: > > One problem is that it becomes hard (impossible?) for downstream packages > such as scikit-learn to doctest under multiple versions of the numpy. > Past experience has shown that it could be useful. It's not that hard: wrap the new `set_printoptions(pad=True)` in a `try:` block to catch the error under old versions. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] proposed changes to array printing in 1.14
On Fri, Jun 30, 2017 at 7:23 PM, Juan Nunez-Iglesias wrote: > I do have sympathy for Ralf’s argument that "exact repr's are not part of the NumPy (or Python for that matter) backwards compatibility guarantees”. But it is such a foundational project in Scientific Python that I think extreme care is warranted, beyond any official guarantees. (Hence this thread, yes. Thank you!) I would also like to make another distinction here: I don't think anyone's actual *code* has broken because of this change. To my knowledge, it is only downstream projects' *doctests* that break. This might deserve *some* care on our part (beyond notification and keeping it out of a 1.x.y bugfix release), but "extreme care" is just not warranted. > Anyway, all this is (mostly) moot if the next NumPy ships with this doctest++ thingy. That would be an enormously valuable contribution to the whole ecosystem. I'd recommend just making an independent project on Github and posting it as its own project to PyPI when you think it's ready. We'll link to it in our documentation. I don't think that it ought to be part of numpy and stuck on our release cadence. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record data previous to Numpy use
On Wed, Jul 5, 2017 at 5:41 AM, wrote: > > Dear all > > I’m sorry if my question is too basic (not fully in relation to Numpy – while it is to build matrices and to work with Numpy afterward), but I’m spending a lot of time and effort to find a way to record data from an asci while, and reassign it into a matrix/array … with unsuccessfully! > > The only way I found is to use ‘append()’ instruction involving dynamic memory allocation. :-( Are you talking about appending to Python list objects? Or the np.append() function on numpy arrays? In my experience, it is usually fine to build a list with the `.append()` method while reading the file of unknown size and then converting it to an array afterwards, even for dozens of millions of lines. The list object is quite smart about reallocating memory so it is not that expensive. You should generally avoid the np.append() function, though; it is not smart. > From my current experience under Scilab (a like Matlab scientific solver), it is well know: > > Step 1 : matrix initialization like ‘np.zeros(n,n)’ > Step 2 : record the data > and write it in the matrix (step 3) > > I’m obviously influenced by my current experience, but I’m interested in moving to Python and its packages > > For huge asci files (involving dozens of millions of lines), my strategy is to work by ‘blocks’ as : > > Find the line index of the beginning and the end of one block (this implies that the file is read ounce) > Read the block > (process repeated on the different other blocks) Are the blocks intrinsic parts of the file? Or are you just trying to break up the file into fixed-size chunks? > I tried different codes such as bellow, but each time Python is telling me I cannot mix iteration and record method > > # > > position = []; j=0 > with open(PATH + file_name, "r") as rough_ data: > for line in rough_ data: > if my_criteria in line: > position.append(j) ## huge blocs but limited in number > j=j+1 > > i = 0 > blockdata = np.zeros( (size_block), dtype=np.float) > with open(PATH + file_name, "r") as f: > for line in itertools.islice(f,1,size_block): > blockdata [i]=float(f.readline() ) For what it's worth, this is the line that is causing the error that you describe. When you iterate over the file with the `for line in itertools.islice(f, ...):` loop, you already have the line text. You don't (and can't) call `f.readline()` to get it again. It would mess up the iteration if you did and cause you to skip lines. By the way, it is useful to help us help you if you copy-paste the exact code that you are running as well as the full traceback instead of paraphrasing the error message. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record data previous to Numpy use
While I'm going to bet that the fastest way to build a ndarray from ascii is with a 'io.ByteIO` stream, NumPy does have a function to load from text, `numpy.loadtxt` that works well enough for most purposes. https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html It's hard to tell from the original post if the ascii is being continuously generated or not. If it's being produced in an on-going fashion then a stream object is definitely the way to go, as the array chunks can be produced by `numpy.frombuffer()`. https://docs.python.org/3/library/io.html https://docs.scipy.org/doc/numpy/reference/generated/numpy.frombuffer.html Robert On Wed, Jul 5, 2017 at 3:21 PM, Robert Kern wrote: > On Wed, Jul 5, 2017 at 5:41 AM, wrote: > > > > Dear all > > > > I’m sorry if my question is too basic (not fully in relation to Numpy – > while it is to build matrices and to work with Numpy afterward), but I’m > spending a lot of time and effort to find a way to record data from an asci > while, and reassign it into a matrix/array … with unsuccessfully! > > > > The only way I found is to use ‘append()’ instruction involving dynamic > memory allocation. :-( > > Are you talking about appending to Python list objects? Or the np.append() > function on numpy arrays? > > In my experience, it is usually fine to build a list with the `.append()` > method while reading the file of unknown size and then converting it to an > array afterwards, even for dozens of millions of lines. The list object is > quite smart about reallocating memory so it is not that expensive. You > should generally avoid the np.append() function, though; it is not smart. > > > From my current experience under Scilab (a like Matlab scientific > solver), it is well know: > > > > Step 1 : matrix initialization like ‘np.zeros(n,n)’ > > Step 2 : record the data > > and write it in the matrix (step 3) > > > > I’m obviously influenced by my current experience, but I’m interested in > moving to Python and its packages > > > > For huge asci files (involving dozens of millions of lines), my strategy > is to work by ‘blocks’ as : > > > > Find the line index of the beginning and the end of one block (this > implies that the file is read ounce) > > Read the block > > (process repeated on the different other blocks) > > Are the blocks intrinsic parts of the file? Or are you just trying to > break up the file into fixed-size chunks? > > > I tried different codes such as bellow, but each time Python is telling > me I cannot mix iteration and record method > > > > # > > > > position = []; j=0 > > with open(PATH + file_name, "r") as rough_ data: > > for line in rough_ data: > > if my_criteria in line: > > position.append(j) ## huge blocs but limited in > number > > j=j+1 > > > > i = 0 > > blockdata = np.zeros( (size_block), dtype=np.float) > > with open(PATH + file_name, "r") as f: > > for line in itertools.islice(f,1,size_block): > > blockdata [i]=float(f.readline() ) > > For what it's worth, this is the line that is causing the error that you > describe. When you iterate over the file with the `for line in > itertools.islice(f, ...):` loop, you already have the line text. You don't > (and can't) call `f.readline()` to get it again. It would mess up the > iteration if you did and cause you to skip lines. > > By the way, it is useful to help us help you if you copy-paste the exact > code that you are running as well as the full traceback instead of > paraphrasing the error message. > > -- > Robert Kern > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- Robert McLeod, Ph.D. robert.mcl...@unibas.ch robert.mcl...@bsse.ethz.ch robbmcl...@gmail.com ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record data previous to Numpy use
On Thu, Jul 6, 2017 at 1:49 AM, wrote: > > Dear All > > First of all thanks for the answers and the information’s (I’ll ding into it) and let me trying to add comments on what I want to : > > My asci file mainly contains data (float and int) in a single column > (it is not always the case but I can easily manage it – as well I saw I can use ‘spli’ instruction if necessary) > Comments/texts indicates the beginning of a bloc immediately followed by the number of sub-blocs > So I need to read/record all the values in order to build a matrix before working on it (using Numpy & vectorization) > > The columns 2 and 3 have been added for further treatments > The ‘0’ values will be specifically treated afterward > > > Numpy won’t be a problem I guess (I did some basic tests and I’m quite confident) on how to proceed, but I’m really blocked on data records … I trying to find a way to efficiently read and record data in a matrix: > > avoiding dynamic memory allocation (here using ‘append’ in python meaning, not np), Although you can avoid some list appending in your case (because the blocks self-describe their length), I would caution you against prematurely avoiding it. It's often the most natural way to write the code in Python, so go ahead and write it that way first. Once you get it working correctly, but it's too slow or memory intensive, then you can puzzle over how to preallocate the numpy arrays later. But quite often, it's fine. In this case, the reading and handling of the text data itself is probably the bottleneck, not appending to the lists. As I said, Python lists are cleverly implemented to make appending fast. Accumulating numbers in a list then converting to an array afterwards is a well-accepted numpy idiom. > dealing with huge asci file: the latest file I get contains more than 60 million of lines > > Please find in attachment an extract of the input format (‘example_of_input’), and the matrix I’m trying to create and manage with Numpy > > Thanks again for your time Try something like the attached. The function will return a list of blocks. Each block will itself be a list of numpy arrays, which are the sub-blocks themselves. I didn't bother adding the first three columns to the sub-blocks or trying to assemble them all into a uniform-width matrix by padding with trailing 0s. Since you say that the trailing 0s are going to be "specially treated afterwards", I suspect that you can more easily work with the lists of arrays instead. I assume floating-point data rather than trying to figure out whether int or float from the data. The code can handle multiple data values on one line (not especially well-tested, but it ought to work), but it assumes that the number of sub-blocks, index of the sub-block, and sub-block size are each on the own line. The code gets a little more complicated if that's not the case. -- Robert Kern from __future__ import print_function import numpy as np def write_random_file(filename, n_blocks=42, n_elems=60*1000*1000): q, r = divmod(n_elems, n_blocks) block_lengths = [q] * n_blocks block_lengths[-1] += r with open(filename, 'w') as f: print('##BEGIN', file=f) print(n_blocks, file=f) for i, block_length in enumerate(block_lengths, 1): print(i, file=f) print(block_length, file=f) block = np.random.randint(0, 1000, size=block_length) for x in block: print(x, file=f) def read_blocked_file(filename): blocks = [] with open(filename, 'r') as f: # Loop over all blocks. while True: # Consume lines until the start of the next block. # Unfortunately, we can't use `for line in f:` because we need to # use `f.readline()` later. line = f.readline() found_block = True while '##BEGIN' not in line: line = f.readline() if line == '': # We've reached the end of the file. found_block = False break if not found_block: # We iterated to the end of the file. Break out of the `while` # loop. break # Read the number of sub-blocks. # This assumes that it is on a line all by itself. n_subblocks = int(f.readline()) subblocks = [] for i_subblock in range(1, n_subblocks + 1): read_i_subblock = int(f.readline()) # These ought to match. if read_i_subblock != i_subblock: raise RuntimeError("Mismatched sub-block index") # Read the size of the sub-block. subblock_size = int(f.readline()) # Allocate an array
Re: [Numpy-discussion] record data previous to Numpy use
On Thu, Jul 6, 2017 at 3:19 AM, wrote: > > Thanks Rober for your effort - I'll have a look on it > > ... the goal is be guide in how to proceed (and to understand), and not to have a "ready-made solution" ... but I appreciate honnestly :-) Sometimes it's easier to just write the code than to try to explain in prose what to do. :-) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to compare an array of arrays elementwise to None in Numpy 1.13 (was easy before)?
On Mon, Jul 17, 2017 at 2:13 AM, wrote: > > Dear all > > I have object array of arrays, which I compare element-wise to None in various places: > > >>> a = numpy.array([numpy.arange(5),None,numpy.nan,numpy.arange(6),None],dtype=numpy.object) > >>> a > array([array([0, 1, 2, 3, 4]), None, nan, array([0, 1, 2, 3, 4, 5]), None], dtype=object) > >>> numpy.equal(a,None) > FutureWarning: comparison to `None` will result in an elementwise object comparison in the future. > > > So far, I always ignored the warning, for lack of an idea how to resolve it. > > Now, with Numpy 1.13, I have to resolve the issue, because it fails with: > > ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() > > It seem that the numpy.equal is applied to each inner array, returning a Boolean array for each element, which cannot be coerced to a single Boolean. > > The expression > > >>> numpy.vectorize(operator.is_)(a,None) > > gives the desired result, but feels a bit clumsy. Wrap the clumsiness up in a documented, tested utility function with a descriptive name and use that function everywhere instead. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to compare an array of arrays elementwise to None in Numpy 1.13 (was easy before)?
On Mon, Jul 17, 2017 at 10:52 AM, Eric Wieser wrote: > Here’s a hack that lets you keep using ==: > > class IsCompare: > __array_priority__ = 99 # needed to make it work on either side of > `==` > def __init__(self, val): self._val = val > def __eq__(self, other): return other is self._val > def __neq__(self, other): return other is not self._val > > a == IsCompare(None) # a is None > a == np.array(IsCompare(None)) # broadcasted a is None > > Frankly, I'd stick with a well-named utility function. It's much more kind to those who have to read the code (e.g. you in 6 months). :-) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] power function distribution or power-law distribution?
On Thu, Aug 24, 2017 at 7:56 AM, Renato Fabbri wrote: > > On Thu, Aug 24, 2017 at 11:47 AM, Nathan Goldbaum wrote: >> >> The latest version of numpy is 1.13. >> >> In this case, as described in the docs, a power function distribution is one with a probability desnity function of the form ax^(a-1) for x between 0 and 1. > > ok, let's try ourselves to relate the terms. > Would you agree that the "power function distribution" is a "power-law distribution" > in which the domain is restricted to be [0,1]? I probably wouldn't. The coincidental similarity in functional form (domain and normalizing constants notwithstanding) obscures the very different mechanisms each represent. The ambiguous name of the method `power` instead of `power_function` is my fault. You have my apologies. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] selecting all 2-d slices out of n-dimensional array
On Tue, Aug 29, 2017 at 6:03 PM, Moroney, Catherine M (398E) < catherine.m.moro...@jpl.nasa.gov> wrote: > Hello, > > > > I have an n-dimensional array (say (4,4,2,2)) and I wish to automatically > extract all the (4,4) slices in it. > > i.e. > > > > a = numpy.arange(0, 64).reshape(4,4,2,2) > > slice1 = a[..., 0, 0] > > slice2 = a[..., 0, 1] > > slice3 = a[..., 1, 0] > > slice4 = a[..., 1,1] > > > > Simple enough example but in my case array “a” will have unknown rank and > size. All I know is that it will have more than 2 dimensions, but I don’t > know ahead of time how many dimensions or what the size of those dimensions > are. > > > > What is the best way of tackling this problem without writing a whole > bunch of if-then cases depending on what the rank and shape of a is? Is > there a one-size-fits-all solution? > First, reshape the array to (4, 4, -1). The -1 tells the method to choose whatever's needed to get the size to work out. Then roll the last axis to the front, and then you have a sequence of the (4, 4) arrays that you wanted. E.g. (using (4,4,3,3) as the original shape for clarity) [~] |26> a = numpy.arange(0, 4*4*3*3).reshape(4,4,3,3) [~] |27> b = a.reshape([4, 4, -1]) [~] |28> b.shape (4, 4, 9) [~] |29> c = np.rollaxis(b, -1, 0) [~] |30> c.shape (9, 4, 4) [~] |31> c[0] array([[ 0, 9, 18, 27], [ 36, 45, 54, 63], [ 72, 81, 90, 99], [108, 117, 126, 135]]) [~] |32> c[1] array([[ 1, 10, 19, 28], [ 37, 46, 55, 64], [ 73, 82, 91, 100], [109, 118, 127, 136]]) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Only integer scalar arrays can be converted to a scalar index
On Fri, Sep 15, 2017 at 2:37 PM, Elliot Hallmark wrote: > Nope. Numpy only works on in memory arrays. You can determine your own > chunking strategy using hdf5, or something like dask can figure that > strategy out for you. With numpy you might worry about not accidentally > making duplicates or intermediate arrays, but that's the extent of memory > optimization you can do in numpy itself. > NumPy does have it's own memory map variant on ndarray: https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html -- Robert McLeod, Ph.D. robbmcl...@gmail.com robbmcl...@protonmail.com ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Only integer scalar arrays can be converted to a scalar index
On Sat, Sep 16, 2017 at 7:16 AM, Chris Barker - NOAA Federal < chris.bar...@noaa.gov> wrote: > > No thoughts on optimizing memory, but that indexing error probably comes from np.mean producing float results. An astype call shoulder that work. Why? It's not being used as an index. It's being assigned into a float array. Rather, it's the slicing inside of `trace_block()` when it's being given arrays as inputs for `x` and `y`. numpy simply doesn't support that because in general the result wouldn't have a uniform shape. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ANN: SfePy 2017.3
I am pleased to announce release 2017.3 of SfePy. Description --- SfePy (simple finite elements in Python) is a software for solving systems of coupled partial differential equations by the finite element method or by the isogeometric analysis (limited support). It is distributed under the new BSD license. Home page: http://sfepy.org Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/ Git (source) repository, issue tracker: https://github.com/sfepy/sfepy Highlights of this release -- - support preconditioning in SciPy and PyAMG based linear solvers - user-defined preconditioners for PETSc linear solvers - parallel multiscale (macro-micro) homogenization-based computations - improved tutorial and installation instructions For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1 (rather long and technical). Cheers, Robert Cimrman --- Contributors to this release in alphabetical order: Robert Cimrman Lubos Kejzlar Vladimir Lukes Matyas Novak ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] MATLAB to Numpy
On Sat, Oct 21, 2017 at 10:45 AM, Andrei Berceanu wrote: > > Hi, > > I am new to Numpy, and would like to start by translating a (badly written?) piece of MATLAB code. > What I have come up with so far is this: > > px = np.zeros_like(tmp_px); py = np.zeros_like(tmp_py); pz = np.zeros_like(tmp_pz) > w = np.zeros_like(tmp_w) > x = np.zeros_like(tmp_x); y = np.zeros_like(tmp_y); z = np.zeros_like(tmp_z) > > j=-1 > for i in range(tmp_px.size): > if tmp_px[i] > 2: > j += 1 > px[j] = tmp_px[i] > py[j] = tmp_py[i] > pz[j] = tmp_pz[i] > w[j] = tmp_w[i] > x[j] = tmp_x[i] > y[j] = tmp_y[i] > z[j] = tmp_z[i] > > px=px[:j+1]; py=py[:j+1]; pz=pz[:j+1] > w=w[:j+1] > x=x[:j+1]; y=y[:j+1]; z=z[:j+1] > > It works, but I'm sure it's probably the most inefficient way of doing it. What would be a decent rewrite? Index with a boolean mask. mask = (tmp_px > 2) px = tmp_px[mask] py = tmp_py[mask] # ... etc. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.vstack vs. np.stack
On Thu, Nov 9, 2017 at 1:30 AM, Joe wrote: > > Hello, > > I have a question and hope that you can help me. > > The doc for vstack mentions that "this function continues to be supported for backward compatibility, but you should prefer np.concatenate or np.stack." > > Using vstack was convenient because "the arrays must have the same shape along all but the first axis." > > So it was possible to stack an array (3,) and (2, 3) to a (3, 3) array without using e.g. atleast_2d on the (3,) array. > > Is there a possibility to mimic that behavior with np.concatenate or np.stack? Quite frankly, I ignore the documentation as I think it's recommendation is wrong in these cases. Vive la vstack! -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.vstack vs. np.stack
On Thu, Nov 9, 2017 at 1:58 PM, Mark Bakker wrote: > Can anybody explain why vstack is going the way of the dodo? > Why are stack / concatenate better? What is 'bad' about vstack? As far as I can tell, the discussion happened all on Github, not the mailing list. See here for references: https://github.com/numpy/numpy/pull/7253 -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.vstack vs. np.stack
On Thu, Nov 9, 2017 at 2:49 PM, Allan Haldane wrote: > > On 11/09/2017 05:39 PM, Robert Kern wrote: > > On Thu, Nov 9, 2017 at 1:58 PM, Mark Bakker wrote: > > > >> Can anybody explain why vstack is going the way of the dodo? > >> Why are stack / concatenate better? What is 'bad' about vstack? > > > > As far as I can tell, the discussion happened all on Github, not the > > mailing list. See here for references: > > > > https://github.com/numpy/numpy/pull/7253 > > > > -- > > Robert Kern > > yes, and in particular this linked comment/PR: > > https://github.com/numpy/numpy/pull/5605#issuecomment-85180204 > > Maybe we should reword the vstack docstring so that it doesn't imply > that vstack is going away. It should say something weaker > like "the functions np.stack, np.concatenate, and np.block are often > more general/useful/less confusing alternatives".. or better explain > what the problem is. > > If we limit ourselves to 1d,2d and maybe 3d arrays the vstack behavior > doesn't seem all that confusing to me. I concur. Highlighting that the functions are only being retained "for backward compatibility" does seem to imply to people that they are deprecated and cannot be relied upon to remain. We *do* break backwards compatibility from time to time. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support
On Wed, Nov 8, 2017 at 2:50 PM, Matthew Brett wrote: > Hi, > > On Wed, Nov 8, 2017 at 7:08 PM, Julian Taylor > wrote: > > On 06.11.2017 11:10, Ralf Gommers wrote: > >> > >> > >> On Mon, Nov 6, 2017 at 7:25 AM, Charles R Harris > >> mailto:charlesr.har...@gmail.com>> wrote: > >> > >> Hi All, > >> > >> Thought I'd toss this out there. I'm tending towards better sooner > >> than later in dropping Python 2.7 support as we are starting to run > >> up against places where we would like to use Python 3 features. That > >> is particularly true on Windows where the 2.7 compiler is really old > >> and lacks C99 compatibility. > >> > >> > >> This is probably the most pressing reason to drop 2.7 support. We seem > >> to be expending a lot of effort lately on this stuff. I was previously > >> advocating being more conservative than the timeline you now propose, > >> but this is the pain point that I think gets me over the line. > > > > > > Would dropping python2 support for windows earlier than the other > > platforms a reasonable approach? > > I am not a big fan of to dropping python2 support before 2020, but I > > have no issue with dropping python2 support on windows earlier as it is > > our largest pain point. > > I wonder about this too. I can imagine there are a reasonable number > of people using older Linux distributions on which they cannot upgrade > to a recent Python 3, but is that likely to be true for Windows? > > We'd have to make sure we could persuade pypi to give the older > version for Windows, by default - I don't know if that is possible. > Pip repo names and actual module names don't have to be the same. One potential work-around would be to make a 'numpylts' repo on PyPi which is the 1.17 version with support for Python 2.7 and bug-fix releases as required. This will still cause regressions but it's a matter of modifying `requirements.txt` in downstream Python 2.7 packages and not much else. E.g. in `requirements.txt`: numpy;python_version>"3.0" numpylts; python_version<"3.0" In both cases you still call `import numpy` in the code. Robert -- Robert McLeod, Ph.D. robbmcl...@gmail.com robbmcl...@protonmail.com www.entropyreduction.al ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal of timeline for dropping Python 2.7 support
On Mon, Nov 13, 2017 at 7:47 AM, Matthias Bussonnier < bussonniermatth...@gmail.com> wrote: > > For this to be efficient, it should be done soon enough to allow > downstream projects to adapt their requirements.txt. > > Release managers: how much more effort would it be to upload current > numpy to both numpy and numpylts? > > I'm not quite sure I see the point. you would ask downstream to change > `numpy` to `numpylts` instead of `numpy` to `numpy<2` ? > > Also I think then you have the risk of having for example pandas > saying `numpy<2` and scipy saying `numpylts` and now the pasckages are > incompatible ? The trouble is PyPi doesn't allow multiple branches. So if you upload NumPy 2.0 wheels, then you cannot turn around and upload 1.18.X bug-fix patches. At least, this is my understanding of PyPi. -- Robert McLeod, Ph.D. robbmcl...@gmail.com robbmcl...@protonmail.com www.entropyreduction.al ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NEP process update
On Wed, Dec 6, 2017 at 9:49 AM, Nathaniel Smith wrote: > - NEPs are really part of the development process, not an output for > end-users -- they're certainly useful to have available as a > reference, but if we're asking end-users to look at them on a regular > basis then I think we've messed up and should improve our actual > documentation :-) I agree. The usefulness of NEPs (and PEPs) as documentation degrades over time. The functionality matches the NEP at the time that it is accepted, but features evolve over time through the normal, non-NEP development process. Personally, I've frequently drove myself into corners reading a PEP to learn about a new feature only to "learn" things that did not make it into the final implementation or were quickly overtaken by later development. We should not rely on NEPs to provide documentation for users. The regular docs should be the authority. To the extent that the NEPs happen to provide useful documentation for the new feature (and represent a significant amount of sunk effort to draft that documentation), we should feel free to copy-paste that into the regular docs and evolve it from there. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What is the pythonic way to write a function that handles arrays and scalars?
On Wed, Dec 13, 2017 at 4:50 AM, Joe wrote: > > Hi, > > the best example I found was this one: > > https://stackoverflow.com/a/29319864/7919597 > > def func_for_scalars_or_vectors(x): > x = np.asarray(x) > scalar_input = False > if x.ndim == 0: > x = x[None] # Makes x 1D > scalar_input = True > > # The magic happens here > > if scalar_input: > return np.squeeze(ret) > return ret > > Is this as good as it gets or do you have other suggestions? In general structure, yeah. I'd probably use `np.isscalar(x)` for the test and `x = np.atleast_1d(x)` for the coercion for readability, but otherwise, that's it. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What is the pythonic way to write a function that handles arrays and scalars?
On Wed, Dec 13, 2017 at 5:00 AM, Kirill Balunov wrote: > > On minor thing that instead of 'ret' there should be 'x'. No, `x` is the input. The code that actually does the computation (elided here by the `# The magic happens here` comment) would have assigned to `ret`. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What is the pythonic way to write a function that handles arrays and scalars?
On Wed, Dec 13, 2017 at 5:52 AM, Eric Wieser wrote: > Using np.iscalar is a bad idea, as it fails for 0d arrays. x.ndim is the > better option there. > > I’d maybe suggest not special-casing 0d arrays though, and using: > > def func_for_scalars_or_vectors(x): > x = np.asanyarray(x) # convert scalars to 0d arrays > > # The magic happens here > > return ret[()] # convert 0d arrays to scalars > > Good call. I didn't think that the empty tuple was valid for anything but indexing into 0d arrays, but of course, following the rules of indexing, it works as required for other arrays too. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ANN: SfePy 2017.4
I am pleased to announce release 2017.4 of SfePy. Description --- SfePy (simple finite elements in Python) is a software for solving systems of coupled partial differential equations by the finite element method or by the isogeometric analysis (limited support). It is distributed under the new BSD license. Home page: http://sfepy.org Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/ Git (source) repository, issue tracker: https://github.com/sfepy/sfepy Highlights of this release -- - basic support for penalty-based contacts - support for user-defined contexts in all solvers and preconditioners - new example: dispersion analysis of heterogeneous periodic materials For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1 (rather long and technical). Cheers, Robert Cimrman --- Contributors to this release in alphabetical order: Robert Cimrman Jan Heczko Lubos Kejzlar Jan Kopacka Vladimir Lukes Matyas Novak ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Direct GPU support on NumPy
On Tue, Jan 2, 2018 at 1:21 PM, Yasunori Endo wrote: > > Hi all > > Numba looks so nice library to try. > Thanks for the information. > >> This suggests a new, higher-level data model which supports replicating data into different memory spaces (e.g. host and GPU). Then users (or some higher layer in the software stack) can dispatch operations to suitable implementations to minimize data movement. >> >> Given NumPy's current raw-pointer C API this seems difficult to implement, though, as it is very hard to track memory aliases. > > I understood modifying numpy.ndarray for GPU is technically difficult. > > So my next primitive question is why NumPy doesn't offer > ndarray like interface (e.g. numpy.gpuarray)? > I wonder why everybody making *separate* library, making user confused. Because there is no settled way to do this. All of those separate library implementations are trying different approaches. We are learning from each of their attempts. They can each move at their own pace rather than being tied down to numpy's slow rate of development and strict backwards compatibility requirements. They can try new things and aren't limited to their first mistakes. The user may well be confused by all of the different options currently available. I don't think that's avoidable: there are lots of meaningful options. Picking just one to stick into numpy is a disservice to the community that needs the other options. > Is there any policy that NumPy refuse standard GPU implementation? Not officially, but I'm pretty sure that there is no appetite among the developers for incorporating proper GPU support into numpy (by which I mean that a user would build numpy with certain settings then make use of the GPU using just numpy APIs). numpy is a mature project with a relatively small development team. Much of that effort is spent more on maintenance than new development. What there is appetite for is to listen to the needs of the GPU-using libraries and making sure that numpy's C and Python APIs are flexible enough to do what the GPU libraries need. This ties into the work that's being done to make ndarray subclasses better and formalizing the notions of an "array-like" interface that things like pandas Series, etc. can implement and play well with the rest of numpy. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Understanding np.min behaviour with nan
On Wed, Jan 10, 2018 at 3:03 PM, Andrew Nelson wrote: > > Further to my last message, why is the warning only raised once? > > ``` > >>> import numpy as np > >>> np.min([1, np.nan]) > /Users/andrew/miniconda3/envs/dev3/lib/python3.6/site-packages/numpy/core/_methods.py:29: RuntimeWarning: invalid value encountered in reduce > return umr_minimum(a, axis, None, out, keepdims) > nan > >>> np.min([1, np.nan]) > nan > >>> > ``` This is default behavior for warnings. https://docs.python.org/3/library/warnings.html#the-warnings-filter It also explains the previous results. The same warning would have been issued from the same place in each of the variations you tried. Since the warnings mechanism had already seen that RuntimeWarning with the same message from the same code location, they were not printed. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] custom Welch method for power spectral density
On Fri, Jan 12, 2018 at 1:12 PM, Seb wrote: > > Hello, > > I'm trying to compute a power spectral density of a signal, using the > Welch method, in the broad sense; i.e. splitting the signal into > segments for deriving smoother spectra. This is well implemented in > scipy.signal.welch. However, I'd like to use exponentially increasing > (power 2) segment length to dampen increasing variance in spectra at > higher frequencies. Before hacking the scipy.signal.spectral module for > this, I'd appreciate any tips on available packages/modules that allow > for this kind of binning scheme, or other suggestions. Not entirely sure about this kind of binning scheme per se, but you may want to look at multitaper spectral estimation methods. The Welch method can be viewed as a poor-man's multitaper. Multitaper methods give you better control over the resolution/variance tradeoff that may help with your problem. Googling for "python multitaper" gives you several options; I haven't used any of them in anger, so I don't have a single recommendation for you. The nitime documentation provides more information about multitaper methods that may be useful to you: http://nipy.org/nitime/examples/multi_taper_spectral_estimation.html -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Moving NumPy's PRNG Forward
f the simpler distribution methods to the list that *is* guaranteed to remain stream-compatible, probably `randint()` and `bytes()` and maybe a couple of others. I would appreciate input on the matter. The current API should remain available and working, but not necessarily with the same algorithms. That is, for code like the following: prng = np.random.RandomState(12345) x = prng.normal(10.0, 3.0, size=100) `x` is still guaranteed to be 100 normal variates with the appropriate mean and standard deviation, but they might be computed by the ziggurat method from PCG-generated bytes (or whatever new default core PRNG we have). As an alternative, we may also want to leave `np.random.RandomState` entirely fixed in place as deprecated legacy code that is never updated. This would allow current unit tests that depend on the stream-compatibility that we previously promised to still pass until they decide to update. Development would move to a different class hierarchy with new names. I am personally not at all interested in preserving any stream compatibility for the `numpy.random.*` aliases or letting the user swap out the core PRNG for the global PRNG that underlies them. `np.random.seed()` should be discouraged (if not outright deprecated) in favor of explicitly passing around instances. In any case, we have a lot of different options to discuss if we decide to relax our stream-compatibility policy. At the moment, I'm not pushing for any particular changes to the code, just the policy in order to enable a more wide-ranging field of options that we have been able to work with so far. Thanks. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Behavior of rint?
On Fri, Jan 19, 2018 at 11:56 PM, Mads Ipsen wrote: > > I am confused . Shouldn't rint round to nearest integer. http://en.cppreference.com/w/cpp/numeric/math/rint It does. Matthew was asking specifically about its behavior when it is rounding numbers ending in .5, not the general case. Since that's the only real difference between rounding routines, we often recognize that from the context and speak in a somewhat elliptical way (i.e. just "round-to-even" instead of "round to the nearest integer and rounding numbers ending in .5 to the nearest even integer"). -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Moving NumPy's PRNG Forward
On Sat, Jan 20, 2018 at 2:27 AM, wrote: > I'm not sure I fully understand > Is the proposal to drop stream-backward compatibility completely for the future or just a one time change? For all future. > > No version-selection API would be required as you select the version by installing the desired version of numpy. > > That's not useful if we want to have unit tests that run in the same way across numpy versions. > > There are many unit tests that rely on fixed streams and have hard coded results that rely on specific numbers (up to floating point, numerical noise). > Giving up stream compatibility would essentially kill using np.random for these unit tests. This is a use case that I am sensitive to. However, it should be noted that relying on the exact stream for unit tests makes you vulnerable to platform differences. That's something that we've never guaranteed (because we can't). That said, there are some of the simpler distributions that are more robust to such things, and those are fairly typical in unit tests. As I mentioned, I am open to a small set of methods that we do guarantee stream-compatibility for. I think that unit tests are the obvious use case that should determine what that set is. Unit tests rarely need `noncentral_chisquare()`, for instance. I'd also be willing to make the API a little clunkier in order to maintain the stable set of methods. For example, two methods that are common in unit testing are `normal()` and `choice()`, but those have been the target of the most attempted innovation. I'd be willing to leave them alone while providing other methods that do the same thing but are allowed to innovate. > Similar, reproducibility from another user, e.g. in notebooks, would break without stream compatibility across numpy versions. That is the reproducible-research use case that I discussed already. I argued that the stability that our policy actually provides is rather more muted than what it seems on its face. > One possibility is to keep the current stream-compatible np.random version and maintain it in future for those usecases, and add a new "high-performance" version with the new features. That is one of the alternatives I raised. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Moving NumPy's PRNG Forward
On Sat, Jan 20, 2018 at 2:57 AM, Stephan Hoyer wrote: > > On Fri, Jan 19, 2018 at 6:57 AM Robert Kern wrote: >> >> As an alternative, we may also want to leave `np.random.RandomState` entirely fixed in place as deprecated legacy code that is never updated. This would allow current unit tests that depend on the stream-compatibility that we previously promised to still pass until they decide to update. Development would move to a different class hierarchy with new names. > > I like this alternative, but I would hesitate to call it "deprecated". Users who care about exact reproducibility across NumPy versions (e.g., for testing) are probably less concerned about performance, and could continue to use it. I would be careful about that because quite a few of the methods are not stable across platforms, even on the same numpy version. If you want to declare that some part of the np.random API is stable for such purposes, we need to curate a subset of the methods anyways. As a one-off thing, this alternative proposes to declare that all of `np.random.RandomState` is stable across versions, but we can't guarantee that all of it is unconditionally stable for exact reproducibility. We can make a guarantee for a smaller subset of methods, though. To your point, though, if we freeze the current `RandomState`, we can make that guarantee for a larger subset of the methods than we would for the new API. So I guess I talked myself around to your view, but I would be a bit more cautious in how we advertise the stability of the frozen `RandomState` API. > New random number generator classes could implement their own guarantees about compatibility across their methods. > >> I am personally not at all interested in preserving any stream compatibility for the `numpy.random.*` aliases or letting the user swap out the core PRNG for the global PRNG that underlies them. `np.random.seed()` should be discouraged (if not outright deprecated) in favor of explicitly passing around instances. > > I agree that np.random.seed() should be discouraged, but it feels very late in NumPy's development to remove it. > > If we do alter the random number streams for numpy.random.*, it seems that we should probably issue a warning (at least for a several major versions) whenever numpy.random.seed() is called. This could get pretty noisy. I guess that's all the more incentive to switch to random state objects. True. I like that. The reason I think that it might be worth an exception is that it has been a moral hazard. People aren't just writing correct but improvable code (relying on `np.random.*` methods but seeding exactly once at the start of their single-threaded simulation) but they've been writing incorrect and easily-broken code. For example: np.random.seed(seed) np.random.shuffle(x_train) np.random.seed(seed) np.random.shuffle(labels_train) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Moving NumPy's PRNG Forward
On Sat, Jan 20, 2018 at 7:34 AM, Robert Kern wrote: > > On Sat, Jan 20, 2018 at 2:27 AM, wrote: > > > I'm not sure I fully understand > > Is the proposal to drop stream-backward compatibility completely for the future or just a one time change? > > For all future. To color this a little, while we'll have a burst of activity for the first round to fix all of the mistakes I baked in early, I do not expect the pace of change after that to be very large. While we are going to relax the strictness of the policy, we should carefully weigh the benefit of a change against the pain of changing the stream, and explore ways to implement the improvement that would retain the stream. We should take more care making such changes in `np.random` than in other parts of numpy. When we get to drafting the NEP, I'd be happy to include language to this effect. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Moving NumPy's PRNG Forward
128(seed)) x = prng.normal(mean, std) If someone wants to write a WELL512 core PRNG, they can just implement that object and pass it to `Distributions()`. The `Distributions` code doesn't need to be copied, nor do we need to much around with `__new__` tricks in Cython. Why do this instead of distribution functions? Well, I have a damn lot of code that is expecting an object with at least the broad outlines of the `RandomState` interface. I'm not going to update that code to use functions instead. And if I'm not, no one is. There isn't a good transition path since the PRNG object needs to thread through all of the code, whether it's my code that I'm writing greenfield or library code that I don't control. That said, people writing new distributions outside of numpy.random would be writing functions, not trying to add to the `Distributions` class, but that's fine. It also allows the possibility for the `Distributions` class to be stateful if we want to do things like caching the next Box-Muller variate and not force that onto the core PRNG state like I currently do. Though I'd rather just drop Box-Muller, and that's not a common pattern outside of Box-Muller. But it's a possibility. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Moving NumPy's PRNG Forward
On Tue, Jan 30, 2018 at 5:39 AM, Pierre de Buyl < pierre.deb...@chem.kuleuven.be> wrote: > > Hello, > > On Sat, Jan 27, 2018 at 09:28:54AM +0900, Robert Kern wrote: > > On Sat, Jan 27, 2018 at 1:14 AM, Kevin Sheppard > > wrote: > > > > > > In terms of what is needed, I think that the underlying PRNG should > > be swappable. The will provide a simple mechanism to allow certain > > types of advancement while easily providing backward compat. In the > > current design this is very hard and requires compiling many nearly > > identical copies of RandomState. In pseudocode something like > > > > > > standard_normal(prng) > > > > > > where prng is a basic class that retains the PRNG state and has a > > small set of core random number generators that belong to the > > underlying PRNG -- probably something like int32, int64, double, and > > possibly int53. I am not advocating explicitly passing the PRNG as an > > argument, but having generators which can take any suitable PRNG would > > add a lot of flexibility in terms of taking advantage of improvements > > in the underlying PRNGs (see, e.g., xoroshiro128/xorshift1024). The > > "small" core PRNG would have responsibility over state and streams. > > The remainder of the module would transform the underlying PRNG into > > the required distributions. > > (edit: after writing the following verbiage, I realize it can be summed > > up with more respect to your suggestion: yes, we should do this design, > > but we don't need to and shouldn't give up on a class with distribution > > methods.) > > Once the core PRNG C API is in place, I don't think we necessarily need > > to move away from a class structure per se, though it becomes an > > option. > > (Sorry for cutting so much, I have a short question) > > My typical use case for the C API of NumPy's random features is that I start > coding in pure Python and then switch to Cython. I have at least twice in the > past resorted to include "randomkit.h" and use that directly. My last work > actually implements a Python/Cython interface for rngs, see > http://threefry.readthedocs.io/using_from_cython.html > > The goal is to use exactly the same backend in Python and Cython, with a cimport > and a few cdefs the only changes needed for a first port to Cython. > > Is this type of feature in discussion or in project for the future of > numpy.random? Sort of, but not really. For sure, once we've made the decisions that let us move forward to a new design, we'll have Cython implementations that can be used natively from Cython as well as Python without code changes. *But* it's not going to be an automatic speedup like your threefry library allows. You designed that API such that each of the methods returns a single scalar, so all you need to do is declare your functions `cpdef` and provide a `.pxd`. Our methods return either a scalar or an array depending on the arguments, so the methods will be declared to return `object`, and you will pay the overhead costs for checking the arguments and such. We're not going to change that Python API; we're only considering dropping stream-compatibility, not source-compatibility. I would like to make sure that we do expose a C/Cython API to the distribution functions (i.e. that only draw a single variate and return a scalar), but it's not likely to look exactly like the Python API. There might be clever tricks that we can do to minimize the amount of changes that one needs to do, though, if you are only drawing single variates at a time (e.g. an agent-based simulation) and you want to make it go faster by moving to Cython. For example, maybe we collect all of the single-variate C-implemented methods into a single object sitting as an attribute on the `Distributions` object. cdef class DistributionsCAPI: cdef double normal(double loc, double scale) cdef double uniform(double low, double high) cdef class Distributions: cdef DistributionsCAPI capi cpdef object normal(loc, scale, size=None): if size is None and np.isscalar(loc) and np.isscalar(scale): return self.capi.normal(loc, scale) else: # ... Make an array prng = Distributions(...) # From Python: x = prng.normal(mean, std) # From Cython: cdef double x = prng.capi.normal(mean, std) But we need to make some higher-level decisions first before we can get down to this level of design. Please do jump in and remind us of this use case once we do get down to actual work on the new API design. Thanks! -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Extending C with Python
On Wed, Jan 31, 2018 at 3:25 PM, Jialin Liu wrote: > Hello, > I'm extending C with python (which is opposite way of what people usually > do, extending python with C), I'm currently stuck in passing a C array to > python layer, could anyone plz advise? > > I have a C buffer in my C code and want to pass it to a python function. > In the C code, I have: > > npy_intp dims [2]; >> dims[0] = 10; >> dims[1] = 20; >> import_array(); >> npy_intp m=2; >> PyObject * py_dims = PyArray_SimpleNewFromData(1, &m, NPY_INT16 ,(void >> *)dims ); // I also tried NPY_INT >> PyObject_CallMethod(pInstance, method_name, "O", py_dims); > > > In the Python code, I want to just print that array: > > def f(self, dims): > >print ("np array:%d,%d"%(dims[0],dims[1])) > > > > But it only prints the first number correctly, i.e., dims[0]. The second > number is always 0. > The correct typecode would be NPY_INTP. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] documentation, sound and music [was: doc? music through mathematical relations between LPCM samples and musical elements/characteristics]
On Thu, Feb 1, 2018 at 7:38 AM, Renato Fabbri wrote: > > Dear Scipy-ers, > > If you think I should split the message so that > things get more clear... > > But the things are: > 1) How to document numpy-heavy projects? There is nothing particularly special about numpy-heavy projects with respect to documentation. The tools used for numpy and scipy's documentation may (or may not) be useful to you. For example, you may want to embed matplotlib plots into your documentation. https://pypi.python.org/pypi/numpydoc But if not, don't worry about it. Sphinx is all most numpy-heavy projects need, like most Python projects. Hosting the documentation on either readthedocs or Github is fine. Do whatever's convenient for you. Either is just as convenient for us. > 2) How to better make these contributions available > to the numpy/scipy community? Having it up on Github and PyPI is all you need to do. Participate in relevant mailing list conversations. While we don't forbid release announcement emails on either of the lists, I wouldn't do much more than announce the initial release and maybe a really major update (i.e. not for every bugfix release). > Directions will be greatly appreciated. > I suspect that this info is all gathered somewhere > I did not find. Sorry this isn't gathered anywhere, but truly, the answer is "there is not much to it". You're doing everything right. :-) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Prototype CorePRNG implementation feedback (Neal Becker)
On Feb 22, 2018 16:30, "Kevin Sheppard" wrote: > What is the syntax to construct an initialized generator? > RandomGenerator(Xoroshiro128(my_seed))? > > Not 100% certain on this. There was talk in the earlier thread that seed should be killed, No, just the np.random.seed() function alias for the hidden global PRNG. Core PRNGs should be initializable with a seed number like this, and instances of these objects should also have a seed() method to reset the state. I think all of the core PRNGs MUST be seedable by being given a uint32 integer, at minimum. They MAY be seedable by other inputs (uint64, Python long, array of ints) depending on the algorithm. although I wasn't clear what mathod would be preferrable to get easily reproducible random numbers. Another option would be RandomGenerator(Xoroshiro128, seed=my_seed) I recommend the first option. With things like settable streams, the diversity of initialization options between algorithms is best implemented by a diversity of initializers than trying to force everything into one complicated initializer. While I think the first option is the best way to implement things, we can make some syntactic sugar to make things nicer. For example, `Xoroshiro128(my_seed).generator` where `.generator` is implemented as a property that returns `RandomGenerator(self)`. ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ANN: SfePy 2018.1
I am pleased to announce release 2018.1 of SfePy. Description --- SfePy (simple finite elements in Python) is a software for solving systems of coupled partial differential equations by the finite element method or by the isogeometric analysis (limited support). It is distributed under the new BSD license. Home page: http://sfepy.org Mailing list: https://mail.python.org/mm3/mailman3/lists/sfepy.python.org/ Git (source) repository, issue tracker: https://github.com/sfepy/sfepy Highlights of this release -- - major update of time-stepping solvers and solver handling - Newmark and Bathe elastodynamics solvers - interface to MUMPS linear solver - new examples: - iron plate impact problem (elastodynamics) - incompressible Mooney-Rivlin material model (hyperelasticity) as a script For full release notes see http://docs.sfepy.org/doc/release_notes.html#id1 (rather long and technical). Cheers, Robert Cimrman --- Contributors to this release in alphabetical order: Robert Cimrman Jan Heczko Jan Kopacka Vladimir Lukes ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.random.randn
On Tue, Mar 6, 2018 at 1:39 AM, Marko Asplund wrote: > > I've some neural network code in NumPy that I'd like to compare with a Scala based implementation. > My problem is currently random initialization of the neural net parameters. > I'd like to be able to get the same results from both implementations when using the same random seed. > > One approach I've though of would be to use the NumPy random generator also with the Scala implementation, but unfortunately the linear algebra library I'm using doesn't provide an equivalent for this. > > Could someone give pointers to implementing numpy.random.randn? > Or alternatively, is there an equivalent random generator for Scala or Java? I would just recommend using one of the codebases to initialize the network, save the network out to disk, and load up the initialized network in each of the different codebases for training. That way you are sure that they are both starting from the same exact network parameters. Even if you do rewrite a precisely equivalent np.random.randn() for Scala/Java, you ought to write the code to serialize the initialized network anyways so that you can test that the two initialization routines are equivalent. But if you're going to do that, you might as well take my recommended approach. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.random.randn
On Wed, Mar 7, 2018 at 1:10 PM, Marko Asplund wrote: > > However, the results look very different when using random initialization. > With respect to exact cost this is course expected, but what I find troublesome > is that after N training iterations the cost starts approaching zero with the NumPy > code (most of of the time), whereas with the Scala based implementations cost fails > to converge (most of the time). > > With NumPy I'm simply using the following random initilization code: > > np.random.randn(n_h, n_x) * 0.01 > > I'm trying to emulate the same behaviour in my Scala code by sampling from a > Gaussian distribution with mean = 0 and std dev = 1. `np.random.randn(n_h, n_x) * 0.01` gives a Gaussian distribution of mean=0 and stdev=0.01 -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.random.randn
On Thu, Mar 8, 2018 at 12:44 PM, Marko Asplund wrote: > > On Wed, 7 Mar 2018 13:14:36, Robert Kern wrote: > > > > With NumPy I'm simply using the following random initilization code: > > > > > > np.random.randn(n_h, n_x) * 0.01 > > > > > > I'm trying to emulate the same behaviour in my Scala code by sampling > > from a > > > Gaussian distribution with mean = 0 and std dev = 1. > > > `np.random.randn(n_h, n_x) * 0.01` gives a Gaussian distribution of mean=0 > > and stdev=0.01 > > Sorry for being a bit inaccurate. > My Scala code actually mirrors the NumPy based random initialization, so I sample with Gaussian of mean = 0 and std dev = 1, then multiply with 0.01. Have you verified this? I.e. save out the Scala-initialized network and load it up with numpy to check the mean and std dev? How about if you run the numpy NN training with the Scala-initialized network? Does that also diverge? -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ENH: softmax
On Wed, Mar 14, 2018 at 5:22 PM, Warren Weckesser < warren.weckes...@gmail.com> wrote: > > On Wed, Mar 14, 2018 at 4:05 AM, Kulick, Johannes wrote: >> >> Hi, >> >> I regularly need the softmax function ( https://en.wikipedia.org/wiki/Softmax_function) for my code. I have a quite efficient pure python implementation (credits to Nolan Conaway). I think it would be a valuable enhancement of the ndarray class. But since it is kind of a specialty function I wanted to ask you if you would consider it to be part of the numpy core (alongside ndarray.max and ndarray.argmax) or rather in scipy (e.g. scipy.stats seems also an appropriate place). > > Johannes, > > If the numpy devs aren't interested in adding it to numpy, I'm pretty sure we can get it in scipy. I've had adding it (or at least proposing that it be added) to scipy on my to-do list for quite a while now. +1 for scipy.special. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion