[Numpy-discussion] Re: ENH: add functionality NpyAppendArray to numpy.format
My memories reappeared: 3. One could think about allowing variable sized .npy files without header modification at all, e.g. by setting the variable sized shape entry (axis 0) to -1. The length of the array would then be inferred by the file size. However, what I personally dislike about that approach is that given a .npy file, it would be impossible to determine whether it was actually completed or for some reason data got lost, e.g. by an incomplete download. Indeed, the mere length is not as reliable as e.g. a sha256 sum, but still better than nothing. Could this be a thing or is this maybe the preferable solution after all? On Sun, Nov 7, 2021 at 6:11 PM Michael Siebert wrote: > Dear all, > > I'd like to add the NpyAppendArray functionality, compare > > https://github.com/xor2k/npy-append-array (15 Stars so far) > > and > > https://stackoverflow.com/a/64403144/2755796 (10 Upvotes so far) > > I have prepared a pull request and want to "test the waters" as suggested > by the message I have received when creating the pull request. > > So what is NpyAppendArray about? > > I love the .npy file format. It is really great! I cannot appreciate the > .npy capabilities mentioned in > > https://numpy.org/devdocs/reference/generated/numpy.lib.format.html > > enough, especially its simplicity. No comparison with the struggles we had > with HDF5. However, there is one feature Numpy currently does not provide: > a simple, efficient, easy-to-use and safe option to append to .npy (here > the text I've used in the Github repository above): > > Appending to an array created by np.save might be possible under certain > circumstances, since the .npy total header byte count is required to be > evenly divisible by 64. Thus, there might be some spare space to grow the > shape entry in the array descriptor. However, this is not guaranteed and > might randomly fail. Initialize the array with NpyAppendArray(filename) > directly so the header will be created with 64 byte of spare header space > for growth. Will this be enough? It allows for up to 10^64 >= 2^212 array > entries or data bits. Indeed, this is less than the number of atoms in the > universe. However, fully populating such an array, due to limits imposed by > quantum mechanics, would require more energy than would be needed to boil > the oceans, compare > > https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans > > Therefore, a wide range of use cases should be coverable with this > approach. > > Who could use that? > > I developed and use NpyAppendArray to efficiently create .npy arrays which > are larger than the main memory and can be loaded by memory mapping later, > e.g. for Deep Learning workflows. Another use case are binary log files, > which could be created on low end embedded devices and later be processed > without parsing, optionally again using memory maps. > > How does the code look like? > > Here some demo code of how this would look like in practice (taken from > the test file): > > def test_NpyAppendArray(tmpdir): > arr1 = np.array([[1,2],[3,4]]) > arr2 = np.array([[1,2],[3,4],[5,6]]) > > fname = os.path.join(tmpdir, 'npaa.npy') > > with format.NpyAppendArray(fname) as npaa: > npaa.append(arr1) > npaa.append(arr2) > npaa.append(arr2) > > arr = np.load(fname, mmap_mode="r") > arr_ref = np.concatenate([arr1, arr2, arr2]) > > assert_array_equal(arr, arr_ref) > > Some more aspects: > 1. appending efficiently only works along axis=0 at least for c order > (probably different for Fortran order) > 2. One could also add the 64 bytes of spare space right on np.save. > However, I cannot really judge on how much of an issue that would be to the > users of np.save and it is not really necessary since users who want to > append to .npy files would create them with NpyAppendArray anyway. > 3. Probably I have forgotten something here, some time has passed since > the initial Github commit. > > So what do you think? Yes/No/Maybe? It would be really nice to get some > feedback on the mailing list here! > > Although this might not be perfectly consistent with the protocol, I've > created the pull request already, just to force myself to finish this up > and I'm prepared to fail if there is no interest to get NpyAppendArray > directly into numpy ;) > > Best from Berlin, Michael > ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Deprecate np.MachAr?
Hi, On Sat, Nov 6, 2021 at 4:54 PM Ralf Gommers wrote: > > > > On Tue, Oct 26, 2021 at 9:20 PM bas van beek wrote: >> >> Hi all, >> >> >> >> The subject of `MachAr` recently came up in >> https://github.com/numpy/numpy/pull/20132 and >> >> an earlier community meeting, more notably: it’s questionably role as public >> (and even private?) API. >> >> >> >> Ever since the introduction of hard-coded `finfo` parameters back in >> numpy/numpy#8504 there has >> >> been progressively less need for computing machine parameters during >> runtime, to the point where >> >> `MachAr` is effectively unused in the numpy codebase itself[1]. From a >> user-API point of view, the main >> >> difference between ` finfo` and `MachAr` these days is that the latter >> produces the same results roughly >> >> 4 orders of magnitude slower than the former… >> >> >> >> Are there any thoughts about deprecating it? > > > For the record: we discussed this in the community meeting two weeks ago, and > everyone seemed fine with or in favor of deprecating MachAr. I haven't looked at this code for a while - but is there any point in keeping it somewhere outside the NumPy codebase, for use when we hit an unusual e.g. longdouble type, and we want to infer `finfo`? My slight guess is no, that it's too unreliable in that case, and we'd be better off annotating the remaining code with suggestions to get this information. Cheers, Matthew ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Percentile/Quantile "interpolation" refactor
Great to found from this website such a tremendous blogs. https://bit.ly/3w74pG0 ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Revert the return of a single NaN for `np.unique` with floating point numbers?
On Fri, 2021-11-05 at 18:44 -0500, Juan Nunez-Iglesias wrote: > I agree with the argument to revert. Consistency among libraries should > be near the top of the list of priorities, as should performance. > Whether the new behaviour "makes more sense", meanwhile, is debatable. > I don't have much of an opinion either way. But while correctness is debatable, is there any reason someone would prefer getting multiple (potentially very man) NaNs? I admit it feels more like a trap to me. About Performance - Note that at least on CPUs there is no reason why there even is an overhead [1]! (We simply miss an `np.islessgreater` function.) The overhead is a bit annoying admittedly, although pretty typical for NumPy, I guess. This one seems to be reducible to around 10% if you use `a != a` instead of `isnan` (the ufunc incurs a lot of overhead). Which is on par with 1.19.x, since NumPy got faster ;). About Reverting --- It feels like this should have a compat release note, which seems like a very strong flag for "not just a bug fix". But maybe there is a good reason to break with that? Cheers, Sebastian [1] C99 has `islessgreater` and both SSE2 and AVX have intrinsics which allow reformulating the whole thing to have exactly the same speed as before, these intrinsics effectively calculate things like: !(a < b) or !(a != b) Admittedly, the SSE2 version probably sets FPEs for NaNs (but that is a problem that we already have and is even now not really avoidable). So at least for all relevant CPUs, the implementation detail that we currently don't have `islessgreater`. (I have not found information of whether the C99 `islessgreater` has decent performance on CUDA) (Not surprisingly, MSVC is arguably buggy and does to not compile the C99 `islessgreate` to fast byte-code – although plausibly link-time optimizer may safe that. But since we have hand coded SSE2/AVX versions of comparisons, even that would probably not matter) > On Fri, 5 Nov 2021, at 4:08 PM, Ralf Gommers wrote: > > > > > > On Mon, Aug 2, 2021 at 7:49 PM Ralf Gommers > > wrote: > > > > > > > > > On Mon, Aug 2, 2021 at 7:04 PM Sebastian Berg > > > wrote: > > > > Hi all, > > > > > > > > In NumPy 1.21, the output of `np.unique` changed in the presence > > > > of > > > > multiple NaNs. Previously, all NaNs were returned when we now > > > > only > > > > return one (all NaNs were considered unique): > > > > > > > > a = np.array([1, 1, np.nan, np.nan, np.nan]) > > > > > > > > Before 1.21: > > > > > > > > >>> np.unique(a) > > > > array([ 1., nan, nan, nan]) > > > > > > > > After 1.21: > > > > > > > > array([ 1., nan]) > > > > > > > > > > > > This change was requested in an old issue: > > > > > > > > https://github.com/numpy/numpy/issues/2111 > > > > > > > > And happened here: > > > > > > > > https://github.com/numpy/numpy/pull/18070 > > > > > > > > While, it has a release note. I am not sure the change got the > > > > attention it deserved. This would be especially worrying if it > > > > is a > > > > regression for anyone? > > > > > > I think it's now the expected answer, not a regression. `unique` is > > > not an elementwise function that needs to adhere to IEEE-754 where > > > nan != nan. I can't remember reviewing this change, but it makes > > > perfect sense to me. > > > > It turns out there's a lot more to this story. The short summary of > > it is that: > > - use case wise it's still true that considering NaN's equal makes > > more sense. > > - scikit-learn has a custom implementation which removes duplicate > > NaN's: > > https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_encode.py#L7 > > - all other array/tensor libraries match NumPy's old behavior (return > > multiple NaN's) > > - there's a performance and implementation cost for CuPy et al. to > > match NumPy's changed behavior (multiple NaN's is the natural result, > > no extra checks needed) > > - There is also a significant performance cost for NumPy, ~25% for > > small/medium-size arrays with no/few NaN's (say the benchmarks in > > https://github.com/numpy/numpy/pull/18070 - which is a common case, > > and that's not negligible like the PR description claims. > > - the "single NaN" behavior is easy to get as a utility function on > > top of the "multiple NaN' (like scikit-learn does), the opposite is > > much harder > > - for the array API standard, the final decision was to go with the > > "multiple NaN' behavior, so we'll need that in `numpy.array_api`. > > - more discussion in > > https://github.com/data-apis/array-api/issues/249 > > > > It would be good to make up our minds before 1.22.0. Two choices: > > 1. We can leave it as it is now and work around the issue in > > `numpy.array_api`. It would also require CuPy and others to make > > changes, which probably cost performance. > > 2. We can revert it in 1.22.0, and possibly in 1.21.5 if such a > > release will be m
[Numpy-discussion] Re: Make the pickle default protocol 4.
On Sun, 2021-11-07 at 12:36 -0700, Charles R Harris wrote: > Hi All, > > I'd like to propose making the NumPy default pickle protocol 4, the > same as > the Python 3.8 default. That would have the advantage of supporting > large > pickles. The current default protocol is 2, last the default in > Python 2.7. This sounds like a good idea to me to align with the lowest support Python version. The only question would be whether we have/need a workaround to save older pickles? I suppose the work-around is likely to use `pickle` directly instead of `np.save(z)`? Cheers, Sebastian > > Thoughts? > > Chuck > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: sebast...@sipsolutions.net signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Make the pickle default protocol 4.
Hi all, > Am 08.11.2021 um 19:15 schrieb Sebastian Berg : > > On Sun, 2021-11-07 at 12:36 -0700, Charles R Harris wrote: >> Hi All, >> >> I'd like to propose making the NumPy default pickle protocol 4, the >> same as >> the Python 3.8 default. That would have the advantage of supporting >> large >> pickles. The current default protocol is 2, last the default in >> Python 2.7. > > This sounds like a good idea to me to align with the lowest support > Python version. Are we aligning with the highest supported pickle version supported by the lowest supported Python version, or the default version that it pickles with? As long as we have a workaround, we should go as high as possible in one go. 🤷🏻♂️ > The only question would be whether we have/need a workaround to save > older pickles? I suppose the work-around is likely to use `pickle` > directly instead of `np.save(z)`? I agree here — We need a workaround. I know companies that are somehow still on 2.7 and may need to pass data back and forth between old and new Python versions with janky mechanisms, including pickling. It’d be sad to see those people lost support, especially if it causes no maintenance burden. > > Cheers, > > Sebastian > >> >> Thoughts? >> >> Chuck >> ___ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: sebast...@sipsolutions.net > > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: einstein.edi...@gmail.com Best regards, Hameer Abbasi ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: branching NumPy 1.22.x
I would really like to get my array API PRs in the next release. They are https://github.com/numpy/numpy/pull/20066 and https://github.com/numpy/numpy/pull/19980. Currently both require a few updates from me before they can be merged. I will ping on them when they are ready. I hope to have them in that state either today or tomorrow. Aaron Meurer On Sun, Nov 7, 2021 at 12:44 PM Charles R Harris wrote: > > Hi All, > > I am aiming to branch NumPy 1.22.x next weekend. If there are any PRs that > you think need to be merged before the branch, please raise the issue. > > Chuck > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: asmeu...@gmail.com ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com