[Numpy-discussion] npreadtext: `numpy.loadtxt` in C

2021-09-16 Thread Ross Barnowski
Hi all,

This is to announce [`npreadtext`](https://github.com/BIDS-numpy/npreadtext),
a drop-in replacement for `numpy.loadtxt` written in C for improved
performance. We are now at feature parity with `loadtxt`, and would greatly
appreciate your feedback & testing. We hope eventually to include
`npreadtext` in NumPy itself.

## Installation

`npreadtext` has been tested with NumPy v1.18 and higher and can be
installed using:

```
python -m pip install numpy
python -m pip install git+git://github.com/BIDS-numpy/npreadtext
```

To enable the C-accelerated version of `np.loadtxt`, monkey-patch NumPy:

```python
>>> import numpy as np
>>> from npreadtxt import monkeypatch_numpy
```

This replaces `np.loadtxt` with `npreadtext._loadtxt`.

## Feedback

You may leave comments here or file issues on the [project issue tracker](
https://github.com/BIDS-numpy/npreadtext/issues). Please also share text
files that strain or break the reader.

## Benchmarks

Preliminary benchmarks show a significant improvement in performance:

```
python runtests.py --bench-compare monkeypatch-npreadtext bench_io

   npreadtext   np.loadtxt  speedup  function

+ 7.74±0.04ms146±0.8ms18.85
 bench_io.LoadtxtCSVStructured.time_loadtxt_csv_struct_dtype
+  9.67±0.1ms181±0.6ms18.67
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10)
+969±10μs   17.9±0.1ms18.48
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 1)
+ 950±7μs  14.6±0.04ms15.39
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 1)
+ 9.65±0.03ms146±0.2ms15.13
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10)
+ 11.8±0.06ms141±0.3ms11.96
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10)
+  11.9±0.1ms141±0.3ms11.88
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10)
+  12.6±0.1ms150±0.6ms11.85
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10)
+ 1.18±0.01ms   13.9±0.1ms11.74
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 1)
+ 1.19±0.01ms  13.9±0.09ms11.68
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 1)
+1.27±0ms  14.7±0.06ms11.64
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 1)
+ 12.4±0.06ms140±0.6ms11.28
 bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10)
+ 1.22±0.02ms  13.8±0.09ms11.26
 bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(1)
+  20.8±0.2μs194±0.5μs 9.32
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
+  20.4±0.2μs162±0.3μs 7.97
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
+1.04±0ms  8.17±0.08ms 7.84
 bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3, 5, 7])
+ 884±2μs  6.79±0.02ms 7.68
 bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3])
+ 1.56±0.01ms  12.0±0.05ms 7.68
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 1)
+ 16.1±0.05ms122±0.3ms 7.56
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 10)
+ 23.4±0.04μs163±0.9μs 6.94
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
+ 22.6±0.09μs153±0.2μs 6.76
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
+  22.9±0.5μs154±0.7μs 6.72
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
+  22.8±0.5μs150±0.8μs 6.58
 bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100)
+ 809±8μs  5.10±0.02ms 6.30
 bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv(2)
+ 7.31±0.01ms  42.0±0.08ms 5.75
 bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(2)
+ 748±2μs  4.11±0.04ms 5.50
 bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(2000)
+  26.0±0.2μs131±0.3μs 5.02
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100)
+  87.3±0.4μs  436±1μs 5.00
 bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(200)
+ 2.09±0.01ms  10.1±0.04ms 4.86
 bench_io.LoadtxtReadUint64Integers.time_read_uint64(1)
+2.09±0ms  10.1±0.04ms 4.83
 bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1)
+   215±0.5μs 1.03±0ms 4.82
 bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
+   217±0.9μs 1.02±0ms 4.72
 bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
+   123±0.6μs  580±3μs 4.71
 bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
+   124±0.8μs  573±4μs 4.63
 bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
+ 4.15±0.01ms  14.4±0.05ms 3.46
 bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 1)
+  5

[Numpy-discussion] Re: deprecating float(x) for ndim > 0

2021-09-16 Thread Aaron Meurer
On Thu, Sep 16, 2021 at 12:32 AM Nico Schlömer  wrote:
>
> > I was playing with this though and was a little surprised to find
> > NumPy allows things like this:
> >
> > >>> a = np.array([1, 2, 3])
> > >>> a[:] = np.array([[[5, 6, 7]]])
> > >>> a
> > array([5, 6, 7])
>
> Thanks Aaron for this example! I hadn't seen this before, and indeed
> the suggested PR doesn't intercept this case. Perhaps that's something
> we should consider deprecating as well. I'll add a comment to the PR;
> let's take it from there.

Is there some valid use-case for this "reverse broadcasting"? It seems
to me like an error would be better. Clearly a[idx] = b should work if
a[idx] and b broadcast to a[idx].shape, but here the broadcast shape
is not the same as a[idx]. I would expect this to generally be a bug
in user code, but maybe I'm missing why this was implemented in the
first place.

The semantics are a bit odd. Clearly something like a[:] =
np.array([[[5, 6, 7], [6, 7, 8]]]) in the above example can't work,
even though they are also both broadcast compatible. So it only works
if they are broadcast compatible and the broadcasting of the lhs just
adds 1s to its shape (which get implicitly removed in the assignment
since slice assignment can't change an array's shape)?

Aaron Meurer

>
> Cheers,
> Nico
>
> On Thu, Sep 16, 2021 at 2:00 AM Aaron Meurer  wrote:
> >
> > Presumably this also changes int(), bool(), and complex() in the same way.
> >
> > The array API standard (and numpy.array_api) only requires float(),
> > bool(), and int() (and soon complex()) for dimension 0 arrays (the
> > standard does not have scalars), in part because of this NumPy issue
> > https://data-apis.org/array-api/latest/API_specification/array_object.html#float-self.
> >
> > On Wed, Sep 15, 2021 at 6:18 AM Nico Schlömer  
> > wrote:
> > >
> > > Hi everyone,
> > >
> > > This is seeking input on PR [1] which I've worked on with @eric-wieser
> > > and @seberg. It deprecates
> > > ```
> > > float(x)
> > > ```
> > > if `x` is an array of ndim > 0. (It works with all arrays of size 1
> > > right now.) This aligns the behavior of float() on ndarrays with
> > > float() on lists which already fails today. It also deprecates the
> > > implicit conversion to float in assignment expressions like
> > > ```
> > > a = np.array([1, 2, 3])
> > > a[0] = [5]  # deprecated, should be a[0] = 5
> >
> > This already gives a ValueError in NumPy 1.21.1. Do you mean a[0] =
> > np.array([5]) is deprecated?
> >
> > I was playing with this though and was a little surprised to find
> > NumPy allows things like this:
> >
> > >>> a = np.array([1, 2, 3])
> > >>> a[:] = np.array([[[5, 6, 7]]])
> > >>> a
> > array([5, 6, 7])
> >
> > Array assignment allows some sort of reverse broadcasting? Given this
> > behavior, it seems to me that a[0] = np.array([5]) actually should
> > work. Or is the idea that this entire behavior would be deprecated?
> >
> > Aaron Meurer
> >
> > > ```
> > > In general, the PR makes numpy a tad bit stricter on how it treats
> > > scalars vs. single-item arrays.
> > >
> > > The change also prevents the #1 wrong usage of float(), namely for
> > > extracting the scalar value from an array. One should rather use
> > > `x[0]` or `x.item()` to that which doesn't convert the value to a
> > > Python float.
> > >
> > > To estimate the impact of the PR, I looked at major numpy dependents
> > > like matplotlib, scipy, pandas etc., and of course numpy itself.
> > > Except scipy, all projects were virtually clean to start with. Scipy
> > > needed some changes for all tests to pass without warning, and all of
> > > the changes were improvements. In particular, the deprecation
> > > motivates users to use actual scalars when scalars are needed, e.g.,
> > > in the case of scipy, as the return value of a goal functional.
> > >
> > > It'd be great if you could try the branch against your own project and
> > > let us know (here or in the PR) about and problems that you might
> > > have.
> > >
> > > Thanks!
> > > Nico
> > >
> > > [1] https://github.com/numpy/numpy/pull/10615
> > > [2] https://github.com/numpy/numpy/issues/10404
> > > ___
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com