[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.
I think ultimately the copy is unnecessary. That being said introducing prepend and append functions concentrates the complexity of the mapping in one place. Trying to avoid the extra copy would probably lead to a more complex implementation of accumulate. How would in your view the prepend interface differ from concatenation or stacking? ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0
Our Numpy arrays are pickled when they are transported over Pipes between Processors (using multiprocessing). Just to point out that there uses of pickling not involving files. Would that affect your analysis? ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0
If needed I can try to construct a minimal example for testing purposes. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0
OK. Then we will just weight for 2.x and test then. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0
I have one more useCase to consider from our ecosystem. We dump numpy arrays into a MongoDB using GridFS for subsequent visualization, some snippets: '''Python with BytesIO() as BIO: np.save(BIO, numpy_array) serialized_A = BIO.getvalue() filehandle_id = self.representations_files.put(serialized_A) ''' and then restore them in the other application: '''Python numpy_array = np.load(BytesIO(serializedA)) ''' For us this is for development work only and I am less concerned about having mixed versions in my database, but in principle that is a scenario. But it seems to me that for this to work the reading application needs to be migrated to version 2 and temporarily extended with the NumpyUnpickler before the writing application is migrated. Or they need to be migrated at the same time. Is that correct? ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Switching default order to column-major
My Cython code and my swig wrapped C++ code assumes the C-ordering and contiguous layout which allows for super fast code. I guess making it agnostic for the ordering would require implementing everything twice and then switch between them based on what comes in. That is a lot of work for no gain. Rewriting it for F-ordering would also be a pain. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Enhancement: np.convolve(..., mode="normalized")
I wonder whether you are looking for the solution in the right direction. Is there theory for the shape of the curve? In that case it might be better to see the problem as a fitting problem. Other than that I think option 2 is too ad hoc for scientific work. I would opt for simply not showing the smoothed curve where it is not available. The convol function you specified here is a very narrow Gaussian, is that the function you actually used? Note: The code you provided can not be executed ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] mean_std function returning both mean and std
I created a solution for ENH: Computing std/var and mean at the same time, issue #23741. The solution can be found here: https://github.com/soundappraisal/numpy/tree/stdmean-dev-001 I still need to add tests and the solution does touch the implementation of var. But before starting a pull request I like to check whether mean_std is a welcome addition. Also I was struggling with the internally needed format of the arrays containing mean and std and the format produced as a result. Which makes me uncertain whether the chosen solution is correct for all cases. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
Steps to make this complete: - move resize of the mean array out of _mean_var and into the calling mean_std function (to reduce the impact of the code changes on existing functions) - establish whether numpy/core/_add_newdocs.py needs to be updated (What is the function of this file?) - add tests at numpy/core/tests/test_numeric.py - add tests that establish whether the specified out matrix returns as output (It is easy to make mistakes and introduce changes which are not in place.) Should we add mean_var aswell? ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
Mean_var, mean_std and tests are now ready. (https://github.com/soundappraisal/numpy/tree/stdmean-dev-001) Some decisions made during implementation: - the output shape of mean follows the output shape of the variance or the standard deviation. So it responds in the same way to the keepdims flag as the variance and the standard deviation. - the resizing of the mean is placed in _mean_var the overhead on the old functions std and var is minimal as they set mean_out to None. - the intermediate mean used can not be replaced with the mean produced by _mean as the output of the latter can not be broadcast to the incoming data. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
I think I left those aspects of the implementation untouched. But having someone more experienced look at it is probably a good idea. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
Aha, the unnecessary copy mentioned in the https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf. paper is a copy of the input. Here it is about discarding a valuable output (the mean) and then calculating that result separately. Not throwing the mean away saves about 20% computation time. Or phrased differently the calculation of the variance spends about a 25% of the computation time on calculating the mean. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
I am agnostic to the order of those changes. Also this is my first attempt to contribute to numpy, so I am not aware of all the ongoing discussions. I'll try to read the issue you just mentioned. But in the code I rewrote replacing _mean_var with a faster version would benefit var, std, mean_var and mean_std because they all call _mean_var. The mean function is untouched. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
I had a closer look at the paper. When I have more brain and time I may check the mathematics. The focus is however more on streaming data, which is an application with completely different demands. I think that here we can not afford to sample the data, which is an option in streaming database systems. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
I had a look at C-solution, it delegates the summation over one axis from the axis tuple to the C-helper. And then the remaining axes are summed from _methods.py. Worst case: if the axis delegated to helper is very short compared to the other axes I would expect hardly any speed-up, and savings on memory usage would also be limited. Sticking with this solution it would be a better from the point of view of speed and memory use to delegate the longest axis from the axis tuple to C-code. In my view a solution with which many would be happier (https://github.com/numpy/numpy/pull/13263#issuecomment-1048122467) would probably delegate all the axes to the helper function. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
I had a look at what it would take to improve the C-solution. However I find that it is beyond my C-programming skils. The gufunc defintion seems to be at odds with current working of the axis keyword for mean, std and var. The latter support computation over multiple axes, whereas the gufunc only seems to support calculation over a single axis. As the behaviour of std, mean and var is largely inherited from ufuncs those might offer a better starting point. If the operator used in the ufunc could take a parameter from the outer_loop accessing in this case the mean, then it would be possible to calculate the required intermediate quantities. This should be a possibility as somewhere the out array is also accessed in the correct manner and we should step through both arrays in the same way. Instead of: '''Pseudocode result = np.full(result_shape, op.identity) # op = ufunc loop_outer_axes_result_array: loop_over_inner_axes_input_array: result[outer_axes] = op(result[outer_axes], InArray[outer_axes + inner_axes]) ''' we would then get: '''Pseudocode result = np.full(result_shape, op.identity) # op = ufunc loop_outer_axes_result_array: loop_over_inner_axes_input_array: result[outer_axes] = op(result[outer_axes], InArray[outer_axes + inner_axes], ParameterArray[outer_axes]) ''' Using for op: '''Pseudocode op(a,b,c) = a+b-c ''' and for b the original data and for c the mean (M_1) you would obtain the Neely correction for the mean. Similarly using: '''Pseudocode op(a,b,c) = a+(b-c)^2 ''' you would obtain the sum of squared errors. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
Note: the suggested solution requires no allocation of memory beyond that needed for storing the result. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
2nd note: I implicit based this on the reduce function. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
I have a pull request, but I am stuck for a day now on how to handle the masked arrays. I made some progress by calling the MaskedArray methods, but in some cases those methods call back the ndarray methods via their super class. The method _mean_var for ndarray need to resize the produced mean to align the shape of the mean and variance or standard deviation, but if the incoming and therefore the outgoing object is a MaskedArray that is not allowed. Also I sometimes see some uppredictable behavior which gives me the feeling I am looking at pointer problems. python runtests.py -t numpy/core/tests/test_numeric.py passes now python runtests.py -t numpy/ma/tests/ is fialing with weird erros on complex masked arrays, particularly: test_varstd test_complex ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
OK, same two tests fail on main (50984037) aswell. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
Issue #23896 is the cause of these two failing tests. With CFLAGS="NPY_DISABLE_OPTIMIZATION=1" the tests pass. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mean_std function returning both mean and std
Second attempt after the triage review of last week: ENH: add mean keyword to std and var #24126 (https://github.com/numpy/numpy/pull/24126) ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.
Ilhan Polat wrote: > I think all these point to the missing convenient functionality that > extends arrays. In matlab "[0 arr 10]" nicely extends the array to a new > one but in NumPy you need to punch quite some code and some courage to > remember whether it is hstack or vstack or concat or block as the correct > naming which decreases the "code morale". Not having a convenient workaround is not the only problem. The workaround is wastefull with memory and involves unnecessary copying of an array. Having a keyword implemented with these concerns in mind might avoid this. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.
I was trying to get a feel for how often the work around occurs. I found three clear examples in Scipy and one unclear case. One case in holoviews. Two in numpy. One from soundappraisal's code base. Next to prepending to the output, I also see prepending to the input as a workaround. Some examples of workarounds: scipy: (prepending to the output) scipy/scipy/sparse/construct.py: '''Python row_offsets = np.append(0, np.cumsum(brow_lengths)) col_offsets = np.append(0, np.cumsum(bcol_lengths)) ''' scipy/scipy/sparse/dia.py: '''Python indptr = np.zeros(num_cols + 1, dtype=idx_dtype) indptr[1:offset_len+1] = np.cumsum(mask.sum(axis=0)) ''' scipy/scipy/sparse/csgraph/_tools.pyx: '''Python indptr = np.zeros(N + 1, dtype=ITYPE) indptr[1:] = mask.sum(1).cumsum() ''' Not sure whether this is also an example: scipy/scipy/stats/_hypotests_pythran.py '''Python # Now fill in the values. We cannot use cumsum, unfortunately. val = 0.0 if minj == 0 else 1.0 for jj in range(maxj - minj): j = jj + minj val = (A[jj + minj - lastminj] * i + val * j) / (i + j) A[jj] = val ''' holoviews: (prepending to the input) '''Python # We add a zero in the begging for the cumulative sum points = np.zeros((areas_in_radians.shape[0] + 1)) points[1:] = areas_in_radians points = points.cumsum() ''' numpy (prepending to the input): numpy/numpy/lib/_iotools.py : '''Python idx = np.cumsum([0] + list(delimiter)) ''' numpy/numpy/lib/histograms.py '''Python cw = np.concatenate((zero, sw.cumsum())) ''' soundappraisal own code: (prepending to the output) '''Python def get_cumulativepixelareas(whiteboard): whiteboard['cumulativepixelareas'] = \ np.concatenate((np.array([0, ]), np.cumsum(whiteboard['pixelareas']))) return True ''' ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.
> Whether it's necessary to have other keywords to prepend anything other > than zero, or append rather than prepend, is a lot less clear. Did you find > a clear need for those things? No, I haven't found them. For streaming data there might be usecases for starting with an initial offset, but I expect there might be no need for a returned offset there. What is notable is that all examples above are 1D. To get the behavior of the API right, the simplest solution is to make the workaround part of the implementation. What I was pondering on is whether it is desirable to allocate the memory once and avoid copying the data. What is the price to pay in terms of code complexity and developer time? Also if the accumulation would run in place on a copy of the input data then prepending the input might be a good option introducing very little new overhead. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Automatic binning for np.histogram
I don't think there is an automatic method for correct binning. The methods mentioned in the pull request and related issue are all based on the assumption that the underlying distribution is Gaussian. There is absolutely no reason to assume that. Reasonable expectations for automatic binning: - it will be wrong most of the time. Reasonable number of bins for a sample of size n: - max(10, sqrt(n)) to make sure there is a large number of filled bins, while still providing information about the data values for low numbers. The documentation could point out that automatic binning should only be used for exploring a single data set as it is unsuited for comparing two different datasets. Also for later use in testing distribution similarity automatically binned data is not suited. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com