[Numpy-discussion] Curious performance different with np.unique on arrays of characters

2023-09-14 Thread saladi
Hello -

In the course of some genomics simulations, I seem to have come across a 
curious (to me at least) performance difference in np.unique that I wanted to 
share. (If this is not the right forum for this, please let me know!)

With a np.array of characters (U1), np.unique seems to be much faster when 
doing np.view as int -> np.unique -> np.view as U1 for arrays of decent size. I 
would not have expected this since np.unique knows what's coming in as S1 and 
could handle the view-stuff internally. I've played with this a number of ways 
(e.g. S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or 
10k elements) and seem to notice the same pattern. A short illustration below 
with U1, int32, return_counts = False, 10 vs 10k.

I wonder if this is actually intended behavior, i.e. the view-stuff is actually 
a good idea for the user to think about and implement if appropriate for their 
usecase (as it is for me).

Best regards,
Shyam


import numpy as np

charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1')
charlist_10k = np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')

def unique_basic(x):
return np.unique(x)

def unique_view(x):
return np.unique(x.view(np.int32)).view(x.dtype)

In [27]: %timeit unique_basic(charlist_10)
2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [28]: %timeit unique_view(charlist_10)
2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [29]: %timeit unique_basic(charlist_10k)
204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [30]: %timeit unique_view(charlist_10k)
66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [31]: np.__version__
Out[31]: '1.25.2'



--
Shyam Saladi
https://shyam.saladi.org
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Curious performance different with np.unique on arrays of characters

2023-09-14 Thread Nathan
Looking at a py-spy profile of a slightly modified version of the code you
shared, it seems the difference comes down to NumPy's sorting
implementation simply being faster for ints than unicode strings. In
particular, it looks like string_quicksort_ is two
or three times slower than quicksort_ when passed the
same data.

We could probably add a special case in the sorting code to improve
performance for sorting single-character arrays. I have no idea if that
would be complicated or would make the code difficult to deal with. I'll
also note that string sorting is a more general problem than integer
sorting, since a generic string sort can't assume that it is handed
single-character strings.

Note also that U1 arrays are arrays of a single *unicode* character, which
in NumPy is stored as a 4-byte codepoint. If all you care about is ASCII or
Latin-1 characters, an S1 encoding will be a bit faster. On my machine,
using S1 makes unique_basic(charlist_10k) go from 466 us to 400 us.
However, I can also rewrite unique_view in that case to cast to int8, which
takes the runtime for unique_view(charlist_10k) from 172 us to 155 us.


On Thu, Sep 14, 2023 at 8:10 AM  wrote:

> Hello -
>
> In the course of some genomics simulations, I seem to have come across a
> curious (to me at least) performance difference in np.unique that I wanted
> to share. (If this is not the right forum for this, please let me know!)
>
> With a np.array of characters (U1), np.unique seems to be much faster when
> doing np.view as int -> np.unique -> np.view as U1 for arrays of decent
> size. I would not have expected this since np.unique knows what's coming in
> as S1 and could handle the view-stuff internally. I've played with this a
> number of ways (e.g. S1 vs U1; int32 vs int64; return_counts = True vs
> False; 100, 1000, or 10k elements) and seem to notice the same pattern. A
> short illustration below with U1, int32, return_counts = False, 10 vs 10k.
>
> I wonder if this is actually intended behavior, i.e. the view-stuff is
> actually a good idea for the user to think about and implement if
> appropriate for their usecase (as it is for me).
>
> Best regards,
> Shyam
>
>
> import numpy as np
>
> charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1')
> charlist_10k = np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')
>
> def unique_basic(x):
> return np.unique(x)
>
> def unique_view(x):
> return np.unique(x.view(np.int32)).view(x.dtype)
>
> In [27]: %timeit unique_basic(charlist_10)
> 2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>
> In [28]: %timeit unique_view(charlist_10)
> 2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>
> In [29]: %timeit unique_basic(charlist_10k)
> 204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>
> In [30]: %timeit unique_view(charlist_10k)
> 66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>
> In [31]: np.__version__
> Out[31]: '1.25.2'
>
>
>
> --
> Shyam Saladi
> https://shyam.saladi.org
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: nathan12...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Curious performance different with np.unique on arrays of characters

2023-09-14 Thread Devulapalli, Raghuveer
What processor you are running this on? np.sort uses AVX-512 accelerated 
sorting for np.int32, so just wondering if you that is the reason for this 
difference.

Raghuveer 

> -Original Message-
> From: sal...@caltech.edu 
> Sent: Wednesday, September 13, 2023 6:14 PM
> To: numpy-discussion@python.org
> Subject: [Numpy-discussion] Curious performance different with np.unique on
> arrays of characters
> 
> Hello -
> 
> In the course of some genomics simulations, I seem to have come across a 
> curious
> (to me at least) performance difference in np.unique that I wanted to share. 
> (If
> this is not the right forum for this, please let me know!)
> 
> With a np.array of characters (U1), np.unique seems to be much faster when
> doing np.view as int -> np.unique -> np.view as U1 for arrays of decent size. 
> I
> would not have expected this since np.unique knows what's coming in as S1 and
> could handle the view-stuff internally. I've played with this a number of 
> ways (e.g.
> S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or 10k
> elements) and seem to notice the same pattern. A short illustration below with
> U1, int32, return_counts = False, 10 vs 10k.
> 
> I wonder if this is actually intended behavior, i.e. the view-stuff is 
> actually a good
> idea for the user to think about and implement if appropriate for their 
> usecase (as
> it is for me).
> 
> Best regards,
> Shyam
> 
> 
> import numpy as np
> 
> charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1') charlist_10k =
> np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')
> 
> def unique_basic(x):
> return np.unique(x)
> 
> def unique_view(x):
> return np.unique(x.view(np.int32)).view(x.dtype)
> 
> In [27]: %timeit unique_basic(charlist_10)
> 2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> 
> In [28]: %timeit unique_view(charlist_10)
> 2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> 
> In [29]: %timeit unique_basic(charlist_10k)
> 204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
> 
> In [30]: %timeit unique_view(charlist_10k)
> 66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
> 
> In [31]: np.__version__
> Out[31]: '1.25.2'
> 
> 
> 
> --
> Shyam Saladi
> https://shyam.saladi.org
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe
> send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: raghuveer.devulapa...@intel.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Curious performance different with np.unique on arrays of characters

2023-09-14 Thread Charles R Harris
On Thu, Sep 14, 2023 at 11:34 AM Devulapalli, Raghuveer <
raghuveer.devulapa...@intel.com> wrote:

> What processor you are running this on? np.sort uses AVX-512 accelerated
> sorting for np.int32, so just wondering if you that is the reason for this
> difference.
>
> Raghuveer


We also have radix sort for stable sorting of int8, int16, which should be
quite fast.

Hmm, I wonder if radix sort could be vectorized?

When we dropped Python 2.7, there were some folks who ended up using a
uint8 array subtype for storing data. All they needed to add was automatic
translation to strings for
certain accesses. This gave them a 4x advantage in storage space.



>
>
> > -Original Message-
> > From: sal...@caltech.edu 
> > Sent: Wednesday, September 13, 2023 6:14 PM
> > To: numpy-discussion@python.org
> > Subject: [Numpy-discussion] Curious performance different with np.unique
> on
> > arrays of characters
> >
> > Hello -
> >
> > In the course of some genomics simulations, I seem to have come across a
> curious
> > (to me at least) performance difference in np.unique that I wanted to
> share. (If
> > this is not the right forum for this, please let me know!)
> >
> > With a np.array of characters (U1), np.unique seems to be much faster
> when
> > doing np.view as int -> np.unique -> np.view as U1 for arrays of decent
> size. I
> > would not have expected this since np.unique knows what's coming in as
> S1 and
> > could handle the view-stuff internally. I've played with this a number
> of ways (e.g.
> > S1 vs U1; int32 vs int64; return_counts = True vs False; 100, 1000, or
> 10k
> > elements) and seem to notice the same pattern. A short illustration
> below with
> > U1, int32, return_counts = False, 10 vs 10k.
> >
> > I wonder if this is actually intended behavior, i.e. the view-stuff is
> actually a good
> > idea for the user to think about and implement if appropriate for their
> usecase (as
> > it is for me).
> >
> > Best regards,
> > Shyam
> >
> >
> > import numpy as np
> >
> > charlist_10 = np.array(list('ASDFGHJKLZ'), dtype='U1') charlist_10k =
> > np.array(list('ASDFGHJKLZ' * 1000), dtype='U1')
> >
> > def unique_basic(x):
> > return np.unique(x)
> >
> > def unique_view(x):
> > return np.unique(x.view(np.int32)).view(x.dtype)
> >
> > In [27]: %timeit unique_basic(charlist_10)
> > 2.17 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops
> each)
> >
> > In [28]: %timeit unique_view(charlist_10)
> > 2.53 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops
> each)
> >
> > In [29]: %timeit unique_basic(charlist_10k)
> > 204 µs ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
> >
> > In [30]: %timeit unique_view(charlist_10k)
> > 66.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops
> each)
> >
> > In [31]: np.__version__
> > Out[31]: '1.25.2'
> >
> >
> >
> > --
> > Shyam Saladi
> > https://shyam.saladi.org
> > ___
> > NumPy-Discussion mailing list -- numpy-discussion@python.org To
> unsubscribe
> > send an email to numpy-discussion-le...@python.org
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: raghuveer.devulapa...@intel.com
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: charlesr.har...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com