[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
On Wed, Jun 29, 2022 at 7:10 AM Miles Cranmer wrote: > Regarding 2., did you have a particular approach in mind? This new lookup > table method is already O(n) scaling (similar to a counting sort), so I > cannot fathom a method that, as you suggest, would get significantly better > performance for integer arrays. The sorting here is "free" in some sense > since you aren't spending additional cycles, the table is initialized > pre-sorted. > I'm not an expert, just remember having seen talks and implementations using hashing that claim to be a lot faster. https://douglasorr.github.io/2019-09-hash-vs-sort/article.html is quite interesting, and shows it's a complex topic. Did you look for literature on this? I'd imagine that that exists. Your PRs don't have a References section in the docstring; if this is a published method then that would be great to add. Cheers, Ralf > I could see in the future that one could create a similar approach for > arbitrary datatypes, and you would use a hash table rather than a > fixed-size array. In this case you would similarly get O(n) performance, > and would produce an **unsorted** output. But in any case a hash table > would probably be much less efficient for integers than an array as > implemented here - the latter which also gives you a sorted output as a > side effect. Personally I have used np.unique for integer data far more > than any other type, and I would _guess_ it is similar in the broader > community (though I have no data or insight to support that)–so I think > having a big speedup like this could be quite nice for users. > > Thanks, > Miles > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: ralf.gomm...@googlemail.com > ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
So, this new method is in fact a hash table as discussed in that blog post. However, because it assumes integer arrays, we can go even further than that blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash table". Thus, you don't actually need to use a hashing function at all, you can just use the integer itself as the hash of an integer, and use fast array operations. This is why it happens to return a sorted output, even though it's not technically necessary. The scaling is the same, but the overall constant in front of O(n) is much lower. In other words, returning a sorted array for integers gets you the best performance since you can use ``hash(integer)=integer`` (+ assume that your range of integers is relatively small). But for other datatypes like strings, as discussed on that blog post, ``hash(string)`` might have a large range over your strings, and it would be inefficient to allocate the entire thing. The closest analogy to this comparison would be counting sorts vs radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a hash table, whereas a counting sort (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array of length equal to the range of integer data. A radix sort has scaling O(n*k), for k the key size, and a counting sort has scaling O(n+k), for k the range of your data. However, a counting sort might have a better constant in front of the scaling, for fixed k, since it avoids having to create the hash table, and gets to use super efficient array operations. This PR is analogous to a counting sort, but a radix sort-like method is described on that blog (and would be useful for generic types). Does this help explain things? Cheers, Miles ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
On Wed, Jun 29, 2022 at 2:59 PM Miles Cranmer wrote: > So, this new method is in fact a hash table as discussed in that blog > post. However, because it assumes integer arrays, we can go even further > than that blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash > table". Thus, you don't actually need to use a hashing function at all, you > can just use the integer itself as the hash of an integer, and use fast > array operations. This is why it happens to return a sorted output, even > though it's not technically necessary. The scaling is the same, but the > overall constant in front of O(n) is much lower. > > In other words, returning a sorted array for integers gets you the best > performance since you can use ``hash(integer)=integer`` (+ assume that your > range of integers is relatively small). But for other datatypes like > strings, as discussed on that blog post, ``hash(string)`` might have a > large range over your strings, and it would be inefficient to allocate the > entire thing. > > The closest analogy to this comparison would be counting sorts vs radix > sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a > hash table, whereas a counting sort ( > https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array of > length equal to the range of integer data. A radix sort has scaling O(n*k), > for k the key size, and a counting sort has scaling O(n+k), for k the range > of your data. However, a counting sort might have a better constant in > front of the scaling, for fixed k, since it avoids having to create the > hash table, and gets to use super efficient array operations. This PR is > analogous to a counting sort, but a radix sort-like method is described on > that blog (and would be useful for generic types). > > Does this help explain things? > That is indeed helpful, thanks Miles! Cheers, Ralf ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
My pleasure! Cheers, Miles ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
On Wed, 2022-06-29 at 12:56 +, Miles Cranmer wrote: > So, this new method is in fact a hash table as discussed in that blog > post. However, because it assumes integer arrays, we can go even > further than that blog, and simply use `np.arange(ar_min, ar_max + > 1)` is the "hash table". Thus, you don't actually need to use a > hashing function at all, you can just use the integer itself as the > hash of an integer, and use fast array operations. This is why it > happens to return a sorted output, even though it's not technically > necessary. The scaling is the same, but the overall constant in front > of O(n) is much lower. Yeah, which is why I was happy with adding `kind=`. There was some support (and no-one against it) and it seemed unlikely that there is a "generic" solution that is can just be a great default in all cases. (If there is, deprecating `kind` is probably not terrible either.) Whether the `kind=` should cover methods that do not sort the output here, I am not sure. Maybe that should be its own kwarg. (Kinds that do not return a sorted result can still sort the result and hope it is much smaller). In general, I am happy with the addition though. The only larger argument against it that I can think of, would be if we promoted another library that does the job much better. Cheers, Sebastian > > In other words, returning a sorted array for integers gets you the > best performance since you can use ``hash(integer)=integer`` (+ > assume that your range of integers is relatively small). But for > other datatypes like strings, as discussed on that blog post, > ``hash(string)`` might have a large range over your strings, and it > would be inefficient to allocate the entire thing. > > The closest analogy to this comparison would be counting sorts vs > radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) > uses a hash table, whereas a counting sort > (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array > of length equal to the range of integer data. A radix sort has > scaling O(n*k), for k the key size, and a counting sort has scaling > O(n+k), for k the range of your data. However, a counting sort might > have a better constant in front of the scaling, for fixed k, since it > avoids having to create the hash table, and gets to use super > efficient array operations. This PR is analogous to a counting sort, > but a radix sort-like method is described on that blog (and would be > useful for generic types). > > Does this help explain things? > Cheers, > Miles > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: sebast...@sipsolutions.net > signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Changes to `np.loadtxt(..., max_rows=)`
Hi all, these changes came up in https://github.com/numpy/numpy/issues/21852 where a user had the use-case to look up the line number until where they want to read a file. The change is that `max_rows` now: * Represents the number of *rows* in the result * Gives a `UserWarning` when empty lines are skipped as a result. While previously `max_rows` used the number of *lines* (except those skipped initially). The difference is for a file formatted like: 1,2,3 # comment 2,3,4 The work-around to get the old version back is: import itertools lines = itertools.islice(open("file"), 0, max_rows) result = np.loadtxt(lines, ...) (Noted in the release notes and `UserWarning` – although the warning text could be improved.) There three possible "actions" I can think of: 1. We can add `max_lines` to do the `itertools` trick for the user. 2. The change is considered too big, we could revert it. 3. We could revert+deprecate the name for a new one, e.g. `nrows` and `nlines`. As an additional point of reference `pandas.read_csv` has `nrows` matching the new behavior. I do not have a strong opinion. I lean towards the new one, Chuck prefers the old meaning (I think). One reasoning for me was that users may also read too few data right now thinking `max_rows` has the new meaning already (i.e. we fix a bug for them). Cheers, Sebastian On Tue, 2022-02-08 at 08:08 -0600, Sebastian Berg wrote: > Hi all, > > just a brief heads up that: > > https://github.com/numpy/numpy/pull/20580 > > is now merged. This moves `np.loadtxt` to C. Mainly making it much > faster. There are also some other improvements and changes though: > > * It now supports `quotechar='"'` to support Excel dialect CSV. > * Parsing some numbers is stricter (e.g. removed support for `_` > or hex float parsing by default). > * `max_rows` now actually counts rows and not lines. A warning > is given if this makes a difference (blank lines). > * Some exception will change, parsing failures now (almost) always > give an informative `ValueError`. > * `converters=callable` is now valid to provide a single converter > for all columns. > > Please test, and let us know if there is any issue or followup you > would like to see. > > We do have possible followups planned > * Consider deprecating the `encoding="bytes"` default which exists > for Python 2 compatibility. > * Consider renaming `skip_rows` to the more precise `skip_lines`. > > Moving to C unlocks possible further improvement, such as full > `csv.Dialect` support. We do not have this on the roadmap, but such > contributions are possible now. > Similarly, it should be possible to rewrite `genfromtxt` based on > this > work. > > Cheers, > > Sebastian > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: sebast...@sipsolutions.net signature.asc Description: This is a digitally signed message part ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Setting C and linker flags for Windows build
Hi, I am very sorry - I feel I should know this, or be able to work it out, but is there a way of setting the flags to the C compiler and the linker, for the Numpy build, on Windows? I'm trying to set the flags for a build with Windows mingw-w64 - but I believe Numpy is ignoring $env:LDFLAGS, $env:CFLAGS and $env:OPT - and I can't see any way of setting these options from the command line. Am I missing something? Cheers, Matthew ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Setting C and linker flags for Windows build
> > > Hi, > > I am very sorry - I feel I should know this, or be able to work it > out, but is there a way of setting the flags to the C compiler and the > linker, for the Numpy build, on Windows? > > I'm trying to set the flags for a build with Windows mingw-w64 - but I > believe Numpy is ignoring $env:LDFLAGS, $env:CFLAGS and $env:OPT - and > I can't see any way of setting these options from the command line. > Am I missing something? > > Cheers, > > Matthew > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: kevin.k.shepp...@gmail.com I think these are all hard coded and non-changable. This was the case when I got NumPy building with clang-cl. Kevin ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com