[Numpy-discussion] Deprecate Promotion of numbers to strings?

2020-04-30 Thread Sebastian Berg
Hi all,

in https://github.com/numpy/numpy/pull/15925 I propose to deprecate
promotion of strings and numbers. I have to double check whether this
has a large effect on pandas, but it currently seems to me that it will
be reasonable.

This means that `np.promote_types("S", "int8")`, etc. will lead to an
error instead of returning `"S4"`.  For the user, I believe the two
main visible changes are that:

np.array(["string", 0])

will stop creating a string array and return either an `object` array
or give an error (object array would be the default currently).

Another larger visible change will be code such as:

np.concatenate([np.array(["string"]), np.array([2])])

will result in an error instead of returning a string array. (Users
will have to cast manually here.)

The alternative is to return an object array also for the concatenate
example.  I somewhat dislike that because `object` is not homogeneously
typed and we thus lose type information.  This also affects functions
that wish to cast inputs to a common type (ufuncs also do this
sometimes).
A further example of this and discussion is at the end of the mail [1].


So the first question is whether we can form an agreement that an error
is the better choice for `concatenate` and `np.promote_types()`.
I.e. there is no one dtype that can faithfully represent both strings
and integers. (This is currently the case e.g. for datetime64 and
float64.)


The second question is what to do for:

np.array(["string", 0])

which currently always returns strings.  Arguably, it must also either
return an `object` array, or raise an error (requiring the user to pick
string or object using `dtype=object`).

The default would be to create a FutureWarning that an `object` array
will be returned for `np.asarray(["string", 0])` in the future.
But if we know already that we prefer an error, it would be better to
give a DeprecationWarning right away. (It just does not seem nice to
change the same thing twice even if the workaround is identical.)

Cheers,

Sebastian


[1]

A second more in-depth point is that code such as:

common_dtype = np.result_type(arr1, arr2)  # or promote_types
arr1 = arr1.astype(common_dtype, copy=False)
arr2 = arr2.astype(common_dtype, copy=False)

will currently use `string` in this case while it would error in the
future. This already fails with other type combinations such as
`datetime64` and `float64` at the moment.

The main alternative to this proposal is to return `object` for the
common dtype, since an object array is not homogeneously typed, it
arguably can represent both inputs.  I do not quite like this choice
personally because in the above example, it may be that the next line
is something like:

return arr1 * arr2

in which case, the preferred return may be `str` and not `object`.
We currently never promote to `object` unless one of the arrays is
already an `object` array, and that seems like the right choice to me.


___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Deprecate Promotion of numbers to strings?

2020-04-30 Thread Eric Wieser
> Another larger visible change will be code such as:
>
> np.concatenate([np.array(["string"]), np.array([2])])
>
> will result in an error instead of returning a string array. (Users
> will have to cast manually here.)

I wonder if we can lessen the blow by allowing
`np.concatenate([np.array(["string"]), np.array([2])], casting='unsafe',
dtype=str)` or similar in its place.
It seems a little unfortunate that with this change, we lose the ability to
concatenate numbers to strings without making intermediate copies.

Eric



On Thu, 30 Apr 2020 at 18:32, Sebastian Berg 
wrote:

> Hi all,
>
> in https://github.com/numpy/numpy/pull/15925 I propose to deprecate
> promotion of strings and numbers. I have to double check whether this
> has a large effect on pandas, but it currently seems to me that it will
> be reasonable.
>
> This means that `np.promote_types("S", "int8")`, etc. will lead to an
> error instead of returning `"S4"`.  For the user, I believe the two
> main visible changes are that:
>
> np.array(["string", 0])
>
> will stop creating a string array and return either an `object` array
> or give an error (object array would be the default currently).
>
> Another larger visible change will be code such as:
>
> np.concatenate([np.array(["string"]), np.array([2])])
>
> will result in an error instead of returning a string array. (Users
> will have to cast manually here.)
>
> The alternative is to return an object array also for the concatenate
> example.  I somewhat dislike that because `object` is not homogeneously
> typed and we thus lose type information.  This also affects functions
> that wish to cast inputs to a common type (ufuncs also do this
> sometimes).
> A further example of this and discussion is at the end of the mail [1].
>
>
> So the first question is whether we can form an agreement that an error
> is the better choice for `concatenate` and `np.promote_types()`.
> I.e. there is no one dtype that can faithfully represent both strings
> and integers. (This is currently the case e.g. for datetime64 and
> float64.)
>
>
> The second question is what to do for:
>
> np.array(["string", 0])
>
> which currently always returns strings.  Arguably, it must also either
> return an `object` array, or raise an error (requiring the user to pick
> string or object using `dtype=object`).
>
> The default would be to create a FutureWarning that an `object` array
> will be returned for `np.asarray(["string", 0])` in the future.
> But if we know already that we prefer an error, it would be better to
> give a DeprecationWarning right away. (It just does not seem nice to
> change the same thing twice even if the workaround is identical.)
>
> Cheers,
>
> Sebastian
>
>
> [1]
>
> A second more in-depth point is that code such as:
>
> common_dtype = np.result_type(arr1, arr2)  # or promote_types
> arr1 = arr1.astype(common_dtype, copy=False)
> arr2 = arr2.astype(common_dtype, copy=False)
>
> will currently use `string` in this case while it would error in the
> future. This already fails with other type combinations such as
> `datetime64` and `float64` at the moment.
>
> The main alternative to this proposal is to return `object` for the
> common dtype, since an object array is not homogeneously typed, it
> arguably can represent both inputs.  I do not quite like this choice
> personally because in the above example, it may be that the next line
> is something like:
>
> return arr1 * arr2
>
> in which case, the preferred return may be `str` and not `object`.
> We currently never promote to `object` unless one of the arrays is
> already an `object` array, and that seems like the right choice to me.
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Deprecate Promotion of numbers to strings?

2020-04-30 Thread Stephan Hoyer
On Thu, Apr 30, 2020 at 10:32 AM Sebastian Berg 
wrote:

> Hi all,
>
> in https://github.com/numpy/numpy/pull/15925 I propose to deprecate
> promotion of strings and numbers. I have to double check whether this
> has a large effect on pandas, but it currently seems to me that it will
> be reasonable.
>

Sebastian -- thanks for driving this forward!

Pandas and Xarray already override these casting rules, so I think this is
generally a good idea:
https://github.com/pydata/xarray/blob/3820fb77256682d909c1e41d962e29bec0edd62d/xarray/core/dtypes.py#L34-L42

Note that Xarray also overrides np.promote_types(np.bytes_, np.unicode_) to
object.

This means that `np.promote_types("S", "int8")`, etc. will lead to an
> error instead of returning `"S4"`.  For the user, I believe the two
> main visible changes are that:
>
> np.array(["string", 0])
>
> will stop creating a string array and return either an `object` array
> or give an error (object array would be the default currently).
>

In the long term, I guess this would error as part of the plan to require
explicitly writing dtype=object to get object arrays?


> Another larger visible change will be code such as:
>
> np.concatenate([np.array(["string"]), np.array([2])])
>
> will result in an error instead of returning a string array. (Users
> will have to cast manually here.)
>

I agree, it is better to raise an error than inadvertently cast to
object dtype. This can make errors appear later in strange ways.

We would need to make this change slowly over several releases, e.g., by
issuing a warning first.


> The alternative is to return an object array also for the concatenate
> example.  I somewhat dislike that because `object` is not homogeneously
> typed and we thus lose type information.  This also affects functions
> that wish to cast inputs to a common type (ufuncs also do this
> sometimes).
> A further example of this and discussion is at the end of the mail [1].
>
>
> So the first question is whether we can form an agreement that an error
> is the better choice for `concatenate` and `np.promote_types()`.
> I.e. there is no one dtype that can faithfully represent both strings
> and integers. (This is currently the case e.g. for datetime64 and
> float64.)
>
>
> The second question is what to do for:
>
> np.array(["string", 0])
>
> which currently always returns strings.  Arguably, it must also either
> return an `object` array, or raise an error (requiring the user to pick
> string or object using `dtype=object`).
>
> The default would be to create a FutureWarning that an `object` array
> will be returned for `np.asarray(["string", 0])` in the future.
> But if we know already that we prefer an error, it would be better to
> give a DeprecationWarning right away. (It just does not seem nice to
> change the same thing twice even if the workaround is identical.)
>
> Cheers,
>
> Sebastian
>
>
> [1]
>
> A second more in-depth point is that code such as:
>
> common_dtype = np.result_type(arr1, arr2)  # or promote_types
> arr1 = arr1.astype(common_dtype, copy=False)
> arr2 = arr2.astype(common_dtype, copy=False)
>
> will currently use `string` in this case while it would error in the
> future. This already fails with other type combinations such as
> `datetime64` and `float64` at the moment.
>
> The main alternative to this proposal is to return `object` for the
> common dtype, since an object array is not homogeneously typed, it
> arguably can represent both inputs.  I do not quite like this choice
> personally because in the above example, it may be that the next line
> is something like:
>
> return arr1 * arr2
>
> in which case, the preferred return may be `str` and not `object`.
> We currently never promote to `object` unless one of the arrays is
> already an `object` array, and that seems like the right choice to me.
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Deprecate Promotion of numbers to strings?

2020-04-30 Thread Sebastian Berg
On Thu, 2020-04-30 at 18:47 +0100, Eric Wieser wrote:
> > Another larger visible change will be code such as:
> > 
> > np.concatenate([np.array(["string"]), np.array([2])])
> > 
> > will result in an error instead of returning a string array. (Users
> > will have to cast manually here.)
> 
> I wonder if we can lessen the blow by allowing
> `np.concatenate([np.array(["string"]), np.array([2])],
> casting='unsafe',
> dtype=str)` or similar in its place.
> It seems a little unfortunate that with this change, we lose the
> ability to
> concatenate numbers to strings without making intermediate copies.
> 

I agree we can do that for concatenate and am happy to add just add it.
Adding the dtype argument (maybe for now only force-casting is fine?)
to `np.concatenate` seems like a reasonable extension of concatenate
even without the loss of this potential use-case.

- Sebastian


> Eric
> 
> 
> 
> On Thu, 30 Apr 2020 at 18:32, Sebastian Berg <
> sebast...@sipsolutions.net>
> wrote:
> 
> > Hi all,
> > 
> > in https://github.com/numpy/numpy/pull/15925 I propose to deprecate
> > promotion of strings and numbers. I have to double check whether
> > this
> > has a large effect on pandas, but it currently seems to me that it
> > will
> > be reasonable.
> > 
> > This means that `np.promote_types("S", "int8")`, etc. will lead to
> > an
> > error instead of returning `"S4"`.  For the user, I believe the two
> > main visible changes are that:
> > 
> > np.array(["string", 0])
> > 
> > will stop creating a string array and return either an `object`
> > array
> > or give an error (object array would be the default currently).
> > 
> > Another larger visible change will be code such as:
> > 
> > np.concatenate([np.array(["string"]), np.array([2])])
> > 
> > will result in an error instead of returning a string array. (Users
> > will have to cast manually here.)
> > 
> > The alternative is to return an object array also for the
> > concatenate
> > example.  I somewhat dislike that because `object` is not
> > homogeneously
> > typed and we thus lose type information.  This also affects
> > functions
> > that wish to cast inputs to a common type (ufuncs also do this
> > sometimes).
> > A further example of this and discussion is at the end of the mail
> > [1].
> > 
> > 
> > So the first question is whether we can form an agreement that an
> > error
> > is the better choice for `concatenate` and `np.promote_types()`.
> > I.e. there is no one dtype that can faithfully represent both
> > strings
> > and integers. (This is currently the case e.g. for datetime64 and
> > float64.)
> > 
> > 
> > The second question is what to do for:
> > 
> > np.array(["string", 0])
> > 
> > which currently always returns strings.  Arguably, it must also
> > either
> > return an `object` array, or raise an error (requiring the user to
> > pick
> > string or object using `dtype=object`).
> > 
> > The default would be to create a FutureWarning that an `object`
> > array
> > will be returned for `np.asarray(["string", 0])` in the future.
> > But if we know already that we prefer an error, it would be better
> > to
> > give a DeprecationWarning right away. (It just does not seem nice
> > to
> > change the same thing twice even if the workaround is identical.)
> > 
> > Cheers,
> > 
> > Sebastian
> > 
> > 
> > [1]
> > 
> > A second more in-depth point is that code such as:
> > 
> > common_dtype = np.result_type(arr1, arr2)  # or promote_types
> > arr1 = arr1.astype(common_dtype, copy=False)
> > arr2 = arr2.astype(common_dtype, copy=False)
> > 
> > will currently use `string` in this case while it would error in
> > the
> > future. This already fails with other type combinations such as
> > `datetime64` and `float64` at the moment.
> > 
> > The main alternative to this proposal is to return `object` for the
> > common dtype, since an object array is not homogeneously typed, it
> > arguably can represent both inputs.  I do not quite like this
> > choice
> > personally because in the above example, it may be that the next
> > line
> > is something like:
> > 
> > return arr1 * arr2
> > 
> > in which case, the preferred return may be `str` and not `object`.
> > We currently never promote to `object` unless one of the arrays is
> > already an `object` array, and that seems like the right choice to
> > me.
> > 
> > 
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> > 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Season of Docs technical writer

2020-04-30 Thread Ben Nathanson
I look forward to participating in this year's Season of Docs. Though it's
early, I'm eager to start a conversation; I've posted the webpage
https://bennathanson.com/numpy2020 to share my thoughts on contributing.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Season of Docs technical writer

2020-04-30 Thread Melissa Mendonça
Hi Ben,

That is great news. Thanks for that! Let's keep our fingers crossed and see
if we can participate in the program this year.

Cheers!

- Melissa

Em qui, 30 de abr de 2020 15:26, Ben Nathanson 
escreveu:

> I look forward to participating in this year's Season of Docs. Though it's
> early, I'm eager to start a conversation; I've posted the webpage
> https://bennathanson.com/numpy2020 to share my thoughts on contributing.
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion