date:20220629

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-29 Thread Ralf Gommers

On Wed, Jun 29, 2022 at 7:10 AM Miles Cranmer 
wrote:

> Regarding 2., did you have a particular approach in mind? This new lookup
> table method is already O(n) scaling (similar to a counting sort), so I
> cannot fathom a method that, as you suggest, would get significantly better
> performance for integer arrays. The sorting here is "free" in some sense
> since you aren't spending additional cycles, the table is initialized
> pre-sorted.
>

I'm not an expert, just remember having seen talks and implementations
using hashing that claim to be a lot faster.
https://douglasorr.github.io/2019-09-hash-vs-sort/article.html  is quite
interesting, and shows it's a complex topic.

Did you look for literature on this? I'd imagine that that exists. Your PRs
don't have a References section in the docstring; if this is a published
method then that would be great to add.

Cheers,
Ralf


> I could see in the future that one could create a similar approach for
> arbitrary datatypes, and you would use a hash table rather than a
> fixed-size array. In this case you would similarly get O(n) performance,
> and would produce an **unsorted** output. But in any case a hash table
> would probably be much less efficient for integers than an array as
> implemented here - the latter which also gives you a sorted output as a
> side effect. Personally I have used np.unique for integer data far more
> than any other type, and I would _guess_ it is similar in the broader
> community (though I have no data or insight to support that)–so I think
> having a big speedup like this could be quite nice for users.
>
> Thanks,
> Miles
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ralf.gomm...@googlemail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-29 Thread Miles Cranmer

So, this new method is in fact a hash table as discussed in that blog post. 
However, because it assumes integer arrays, we can go even further than that 
blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash table". Thus, 
you don't actually need to use a hashing function at all, you can just use the 
integer itself as the hash of an integer, and use fast array operations. This 
is why it happens to return a sorted output, even though it's not technically 
necessary. The scaling is the same, but the overall constant in front of O(n) 
is much lower.

In other words, returning a sorted array for integers gets you the best 
performance since you can use ``hash(integer)=integer`` (+ assume that your 
range of integers is relatively small). But for other datatypes like strings, 
as discussed on that blog post, ``hash(string)`` might have a large range over 
your strings, and it would be inefficient to allocate the entire thing.

The closest analogy to this comparison would be counting sorts vs radix sorts. 
A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a hash table, 
whereas a counting sort (https://en.wikipedia.org/wiki/Counting_sort) 
pre-allocates an array of length equal to the range of integer data. A radix 
sort has scaling O(n*k), for k the key size, and a counting sort has scaling 
O(n+k), for k the range of your data. However, a counting sort might have a 
better constant in front of the scaling, for fixed k, since it avoids having to 
create the hash table, and gets to use super efficient array operations. This 
PR is analogous to a counting sort, but a radix sort-like method is described 
on that blog (and would be useful for generic types).

Does this help explain things?
Cheers,
Miles
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-29 Thread Ralf Gommers

On Wed, Jun 29, 2022 at 2:59 PM Miles Cranmer 
wrote:

> So, this new method is in fact a hash table as discussed in that blog
> post. However, because it assumes integer arrays, we can go even further
> than that blog, and simply use `np.arange(ar_min, ar_max + 1)` is the "hash
> table". Thus, you don't actually need to use a hashing function at all, you
> can just use the integer itself as the hash of an integer, and use fast
> array operations. This is why it happens to return a sorted output, even
> though it's not technically necessary. The scaling is the same, but the
> overall constant in front of O(n) is much lower.
>
> In other words, returning a sorted array for integers gets you the best
> performance since you can use ``hash(integer)=integer`` (+ assume that your
> range of integers is relatively small). But for other datatypes like
> strings, as discussed on that blog post, ``hash(string)`` might have a
> large range over your strings, and it would be inefficient to allocate the
> entire thing.
>
> The closest analogy to this comparison would be counting sorts vs radix
> sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) uses a
> hash table, whereas a counting sort (
> https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array of
> length equal to the range of integer data. A radix sort has scaling O(n*k),
> for k the key size, and a counting sort has scaling O(n+k), for k the range
> of your data. However, a counting sort might have a better constant in
> front of the scaling, for fixed k, since it avoids having to create the
> hash table, and gets to use super efficient array operations. This PR is
> analogous to a counting sort, but a radix sort-like method is described on
> that blog (and would be useful for generic types).
>
> Does this help explain things?
>

That is indeed helpful, thanks Miles!

Cheers,
Ralf
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-29 Thread Miles Cranmer

My pleasure!

Cheers,
Miles
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-29 Thread Sebastian Berg

On Wed, 2022-06-29 at 12:56 +, Miles Cranmer wrote:
> So, this new method is in fact a hash table as discussed in that blog
> post. However, because it assumes integer arrays, we can go even
> further than that blog, and simply use `np.arange(ar_min, ar_max +
> 1)` is the "hash table". Thus, you don't actually need to use a
> hashing function at all, you can just use the integer itself as the
> hash of an integer, and use fast array operations. This is why it
> happens to return a sorted output, even though it's not technically
> necessary. The scaling is the same, but the overall constant in front
> of O(n) is much lower.


Yeah, which is why I was happy with adding `kind=`.  There was some
support (and no-one against it) and it seemed unlikely that there is a
"generic" solution that is can just be a great default in all cases.
(If there is, deprecating `kind` is probably not terrible either.)

Whether the `kind=` should cover methods that do not sort the output
here, I am not sure.  Maybe that should be its own kwarg.
(Kinds that do not return a sorted result can still sort the result and
hope it is much smaller).

In general, I am happy with the addition though.  The only larger
argument against it that I can think of, would be if we promoted
another library that does the job much better.

Cheers,

Sebastian


> 
> In other words, returning a sorted array for integers gets you the
> best performance since you can use ``hash(integer)=integer`` (+
> assume that your range of integers is relatively small). But for
> other datatypes like strings, as discussed on that blog post,
> ``hash(string)`` might have a large range over your strings, and it
> would be inefficient to allocate the entire thing.
> 
> The closest analogy to this comparison would be counting sorts vs
> radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort)
> uses a hash table, whereas a counting sort
> (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array
> of length equal to the range of integer data. A radix sort has
> scaling O(n*k), for k the key size, and a counting sort has scaling
> O(n+k), for k the range of your data. However, a counting sort might
> have a better constant in front of the scaling, for fixed k, since it
> avoids having to create the hash table, and gets to use super
> efficient array operations. This PR is analogous to a counting sort,
> but a radix sort-like method is described on that blog (and would be
> useful for generic types).
> 
> Does this help explain things?
> Cheers,
> Miles
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sebast...@sipsolutions.net
> 



signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Changes to `np.loadtxt(..., max_rows=)`

2022-06-29 Thread Sebastian Berg

Hi all,

these changes came up in https://github.com/numpy/numpy/issues/21852
where a user had the use-case to look up the line number until where
they want to read a file.

The change is that `max_rows` now:
* Represents the number of *rows* in the result
* Gives a `UserWarning` when empty lines are skipped as a result.

While previously `max_rows` used the number of *lines* (except those
skipped initially).  The difference is for a file formatted like:

   1,2,3
   # comment
   2,3,4

The work-around to get the old version back is:

    import itertools
    lines = itertools.islice(open("file"), 0, max_rows)
    result = np.loadtxt(lines, ...)

(Noted in the release notes and `UserWarning` – although the warning
text could be improved.)

There three possible "actions" I can think of:
1. We can add `max_lines` to do the `itertools` trick for the user.
2. The change is considered too big, we could revert it.
3. We could revert+deprecate the name for a new one, e.g. `nrows`
   and `nlines`.

As an additional point of reference `pandas.read_csv` has `nrows`
matching the new behavior.

I do not have a strong opinion.  I lean towards the new one, Chuck
prefers the old meaning (I think).
One reasoning for me was that users may also read too few data right
now thinking `max_rows` has the new meaning already (i.e. we fix a bug
for them).

Cheers,

Sebastian

On Tue, 2022-02-08 at 08:08 -0600, Sebastian Berg wrote:
> Hi all,
> 
> just a brief heads up that:
> 
>     https://github.com/numpy/numpy/pull/20580
> 
> is now merged.  This moves `np.loadtxt` to C.  Mainly making it much
> faster.  There are also some other improvements and changes though:
> 
> * It now supports `quotechar='"'` to support Excel dialect CSV.
> * Parsing some numbers is stricter (e.g. removed support for `_`
>   or hex float parsing by default).
> * `max_rows` now actually counts rows and not lines.  A warning
>   is given if this makes a difference (blank lines).
> * Some exception will change, parsing failures now (almost) always
>   give an informative `ValueError`.
> * `converters=callable` is now valid to provide a single converter
>   for all columns.
> 
> Please test, and let us know if there is any issue or followup you
> would like to see.  
> 
> We do have possible followups planned
> * Consider deprecating the `encoding="bytes"` default which exists
>   for Python 2 compatibility.
> * Consider renaming `skip_rows` to the more precise `skip_lines`.
> 
> Moving to C unlocks possible further improvement, such as full
> `csv.Dialect` support.  We do not have this on the roadmap, but such
> contributions are possible now.
> Similarly, it should be possible to rewrite `genfromtxt` based on
> this
> work.
> 
> Cheers,
> 
> Sebastian
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sebast...@sipsolutions.net

signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Setting C and linker flags for Windows build

2022-06-29 Thread Matthew Brett

Hi,

I am very sorry - I feel I should know this, or be able to work it
out, but is there a way of setting the flags to the C compiler and the
linker, for the Numpy build, on Windows?

I'm trying to set the flags for a build with Windows mingw-w64 - but I
believe Numpy is ignoring $env:LDFLAGS, $env:CFLAGS and $env:OPT - and
I can't see any way of setting these options from the command line.
Am I missing something?

Cheers,

Matthew
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Setting C and linker flags for Windows build

2022-06-29 Thread Kevin Sheppard

>
>
> Hi,
>
> I am very sorry - I feel I should know this, or be able to work it
> out, but is there a way of setting the flags to the C compiler and the
> linker, for the Numpy build, on Windows?
>
> I'm trying to set the flags for a build with Windows mingw-w64 - but I
> believe Numpy is ignoring $env:LDFLAGS, $env:CFLAGS and $env:OPT - and
> I can't see any way of setting these options from the command line.
> Am I missing something?
>
> Cheers,
>
> Matthew
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: kevin.k.shepp...@gmail.com


I think these are all hard coded and non-changable.  This was the case when
I got NumPy building with clang-cl.

Kevin
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

[Numpy-discussion] Re: Changes to `np.loadtxt(..., max_rows=)`

[Numpy-discussion] Setting C and linker flags for Windows build

[Numpy-discussion] Re: Setting C and linker flags for Windows build

8 matches

Site Navigation

Mail list logo

Footer information