Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Sebastian Berg
On Mo, 2015-11-09 at 20:36 +0100, Ralf Gommers wrote:
> 
> 
> On Mon, Nov 9, 2015 at 7:42 PM, Benjamin Root 
> wrote:
> My personal rule for flexible inputs like that is that it
> should be encouraged so long as it does not introduce
> ambiguity. Furthermore, Allowing a scalar as an input doesn't
> add a congitive disconnect on the user on how to specify
> multiple columns. Therefore, I'd give this a +1.
> 
> 
> On Mon, Nov 9, 2015 at 4:15 AM, Irvin Probst
>  wrote:
> Hi,
> I've recently seen many students, coming from Matlab,
> struggling against the usecols argument of loadtxt.
> Most of them tried something like:
> loadtxt("foo.bar", usecols=2) or the ones with better
> documentation reading skills tried loadtxt("foo.bar",
> usecols=(2)) but none of them understood they had to
> write usecols=[2] or usecols=(2,).
> 
> Is there a policy in numpy stating that this kind of
> arguments must be sequences ?
> 
> 
> There isn't. In many/most cases it's array_like, which means scalar,
> sequence or array.
>  

Agree, I think we have, or should have, to types of things there (well,
three since we certainly have "must be sequence").
Args such as "axes" which is typically just one, so we allow scalar, but
can often be generalized to a sequence. And things that are array-likes
(and broadcasting).

So, if this is an array-like, however, the "correct" result could be
different by broadcasting between `1` and `(1,)` analogous to indexing
the full array with usecols:

usecols=1 result:
array([2, 3, 4, 5])

usecols=(1,) result [1]:
array([[2, 3, 4, 5]])

since a scalar row (so just one row) is read and not a 2D array. I tend
to say it should be an array-like argument and not a generalized
sequence argument, just wanted to note that, since I am not sure what
matlab does.

- Sebastian


[1] could go further and do `usecols=[[1]]` and get
`array([[[2, 3, 4, 5]]])`

> 
> I think that being able to an int or a sequence when a
> single column is needed would make this function a bit
> more user friendly for beginners. I would gladly
> submit a PR if noone disagrees.
> 
> +1
> 
> 
> Ralf
> 
> 
> 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion



signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Irvin Probst

On 10/11/2015 09:19, Sebastian Berg wrote:

since a scalar row (so just one row) is read and not a 2D array. I tend
to say it should be an array-like argument and not a generalized
sequence argument, just wanted to note that, since I am not sure what
matlab does.


Hi,
By default Matlab reads everything, silently fails on what can't be 
converted into a float and the user has to guess what was read or not.

Say you have a file like this:

2010-01-01 00:00:00 3.026
2010-01-01 01:00:00 4.049
2010-01-01 02:00:00 4.865


>> M=load('CONCARNEAU_2010.txt');
>> M(1:3,:)

ans =

   1.0e+03 *

2.0100 00.0030
2.01000.00100.0040
2.01000.00200.0049


I think this is a terrible way of doing it even if newcomers might find 
this handy. There are of course optionnal arguments (even regexps !) but 
to my knowledge almost no Matlab user even knows these arguments are there.


Anyway, I made a PR here https://github.com/numpy/numpy/pull/6656 with 
usecols as an array-like.


Regards.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Sebastian Berg
On Di, 2015-11-10 at 10:24 +0100, Irvin Probst wrote:
> On 10/11/2015 09:19, Sebastian Berg wrote:
> > since a scalar row (so just one row) is read and not a 2D array. I tend
> > to say it should be an array-like argument and not a generalized
> > sequence argument, just wanted to note that, since I am not sure what
> > matlab does.
> 
> Hi,
> By default Matlab reads everything, silently fails on what can't be 
> converted into a float and the user has to guess what was read or not.
> Say you have a file like this:
> 
> 2010-01-01 00:00:00 3.026
> 2010-01-01 01:00:00 4.049
> 2010-01-01 02:00:00 4.865
> 
> 
>  >> M=load('CONCARNEAU_2010.txt');
>  >> M(1:3,:)
> 
> ans =
> 
> 1.0e+03 *
> 
>  2.0100 00.0030
>  2.01000.00100.0040
>  2.01000.00200.0049
> 
> 
> I think this is a terrible way of doing it even if newcomers might find 
> this handy. There are of course optionnal arguments (even regexps !) but 
> to my knowledge almost no Matlab user even knows these arguments are there.
> 
> Anyway, I made a PR here https://github.com/numpy/numpy/pull/6656 with 
> usecols as an array-like.
> 

Actually, it is the "sequence special case" type ;). (matlab does not
have this, since matlab always returns 2-D I realized).

As I said, if usecols is like indexing, the result should mimic:

arr = np.loadtxt(f)
arr = arr[usecols]

in which case a 1-D array is returned if you put in a scalar into
usecols (and you could even generalize usecols to higher dimensional
array-likes).
The way you implemented it -- which is fine, but I want to stress that
there is a real decision being made here --, you always see it as a
sequence but allow a scalar for convenience (i.e. always return a 2-D
array). It is a `sequence of ints or int` type argument and not an
array-like argument in my opinion.

- Sebastian


> Regards.
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 



signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Irvin Probst

On 10/11/2015 14:17, Sebastian Berg wrote:

Actually, it is the "sequence special case" type ;). (matlab does not
have this, since matlab always returns 2-D I realized).

As I said, if usecols is like indexing, the result should mimic:

arr = np.loadtxt(f)
arr = arr[usecols]

in which case a 1-D array is returned if you put in a scalar into
usecols (and you could even generalize usecols to higher dimensional
array-likes).
The way you implemented it -- which is fine, but I want to stress that
there is a real decision being made here --, you always see it as a
sequence but allow a scalar for convenience (i.e. always return a 2-D
array). It is a `sequence of ints or int` type argument and not an
array-like argument in my opinion.


I think we have two separate problems here:

The first one is whether loadtxt should always return a 2D array or 
should it match the shape of the usecol argument. From a CS guy point of 
view I do understand your concern here. Now from a teacher point of view 
I know many people expect to get a "matrix" (thank you Matlab...) and 
the "purity" of matching the dimension of the usecol variable will be 
seen by many people [1] as a nerdy useless heavyness noone cares of (no 
offense). So whatever you, seadoned numpy devs from this mailing list, 
decide I think it should be explained in the docstring with a very clear 
wording.


My own opinion on this first problem is that loadtxt() should always 
return a 2D array, no less, no more. If I write np.loadtxt(f)[42] it 
means I want to read the whole file and then I explicitely ask for 
transforming the 2-D array loadtxt() returned into a 1-D array. Otoh if 
I write loadtxt(f, usecol=42) it means I don't want to read the other 
columns and I want only this one, but it does not mean that I want to 
change the returned array from 2-D to 1-D. I know this new behavior 
might break a lot of existing code as usecol=(42,) used to return a 1-D 
array, but usecol=42, also returns a 1-D array so the current 
behavior is not consistent imho.


The second problem is about the wording in the docstring, when I see 
"sequence of int or int" I uderstand I will have to cast into a 1-D 
python list whatever wicked N-dimensional object I use to store my 
column indexes, or hope list(my_object) will do it fine. On the other 
hand when I read "array-like" the function is telling me I don't have to 
worry about my object, as long as numpy knows how to cast it into an 
array it will be fine.


Anyway I think something like that:

import numpy as np
a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)

should just work and return me a 2-D (or 1-D if you like) array with the 
data I asked for and I don't think "a" here is an int or a sequence of 
int (but it's a good example of why loadtxt() should not match the shape 
of the usecol argument).


To make it short, let the reading function read the data in a consistent 
and predictible way and then let the user explicitely change the data's 
shape into anything he likes.


Regards.

[1] read non CS people trying to switch to numpy/scipy
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Benjamin Root
Just pointing out np.loadtxt(..., ndmin=2) will always return a 2D array.
Notice that without that option, the result is effectively squeezed. So if
you don't specify that option, and you load up a CSV file with only one
row, you will get a very differently shaped array than if you load up a CSV
file with two rows.

Ben Root

On Tue, Nov 10, 2015 at 10:07 AM, Irvin Probst <
irvin.pro...@ensta-bretagne.fr> wrote:

> On 10/11/2015 14:17, Sebastian Berg wrote:
>
>> Actually, it is the "sequence special case" type ;). (matlab does not
>> have this, since matlab always returns 2-D I realized).
>>
>> As I said, if usecols is like indexing, the result should mimic:
>>
>> arr = np.loadtxt(f)
>> arr = arr[usecols]
>>
>> in which case a 1-D array is returned if you put in a scalar into
>> usecols (and you could even generalize usecols to higher dimensional
>> array-likes).
>> The way you implemented it -- which is fine, but I want to stress that
>> there is a real decision being made here --, you always see it as a
>> sequence but allow a scalar for convenience (i.e. always return a 2-D
>> array). It is a `sequence of ints or int` type argument and not an
>> array-like argument in my opinion.
>>
>
> I think we have two separate problems here:
>
> The first one is whether loadtxt should always return a 2D array or should
> it match the shape of the usecol argument. From a CS guy point of view I do
> understand your concern here. Now from a teacher point of view I know many
> people expect to get a "matrix" (thank you Matlab...) and the "purity" of
> matching the dimension of the usecol variable will be seen by many people
> [1] as a nerdy useless heavyness noone cares of (no offense). So whatever
> you, seadoned numpy devs from this mailing list, decide I think it should
> be explained in the docstring with a very clear wording.
>
> My own opinion on this first problem is that loadtxt() should always
> return a 2D array, no less, no more. If I write np.loadtxt(f)[42] it means
> I want to read the whole file and then I explicitely ask for transforming
> the 2-D array loadtxt() returned into a 1-D array. Otoh if I write
> loadtxt(f, usecol=42) it means I don't want to read the other columns and I
> want only this one, but it does not mean that I want to change the returned
> array from 2-D to 1-D. I know this new behavior might break a lot of
> existing code as usecol=(42,) used to return a 1-D array, but
> usecol=42, also returns a 1-D array so the current behavior is not
> consistent imho.
>
> The second problem is about the wording in the docstring, when I see
> "sequence of int or int" I uderstand I will have to cast into a 1-D python
> list whatever wicked N-dimensional object I use to store my column indexes,
> or hope list(my_object) will do it fine. On the other hand when I read
> "array-like" the function is telling me I don't have to worry about my
> object, as long as numpy knows how to cast it into an array it will be fine.
>
> Anyway I think something like that:
>
> import numpy as np
> a=[[[2,],[],[],],[],[],[]]
> foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)
>
> should just work and return me a 2-D (or 1-D if you like) array with the
> data I asked for and I don't think "a" here is an int or a sequence of int
> (but it's a good example of why loadtxt() should not match the shape of the
> usecol argument).
>
> To make it short, let the reading function read the data in a consistent
> and predictible way and then let the user explicitely change the data's
> shape into anything he likes.
>
> Regards.
>
> [1] read non CS people trying to switch to numpy/scipy
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Daπid
On 10 November 2015 at 16:07, Irvin Probst 
wrote:

> I know this new behavior might break a lot of existing code as
> usecol=(42,) used to return a 1-D array, but usecol=42, also
> returns a 1-D array so the current behavior is not consistent imho.


42,  is exactly the same as (42,) If you want a tuple of tuples,
you have to do ((42,),), but then it raises: TypeError: list indices must
be integers, not tuple.

What numpy cares about is that whatever object you give it is iterable, and
its entries are ints, so usecol={0:'a', 5:'b'} is perfectly valid.

I think loadtxt should be a tool to read text files in the least surprising
fashion, and a text file is a 1 or 2D container, so it shouldn't return any
other shapes. Any fancy stuff one may want to do with the output should be
done with the typical indexing tricks. If I want a single column, I would
first be very surprised if I got a 2D array (I was bitten by this design in
MATLAB many many times). For the rare cases where I do want a "fake" 2D
array, I can make it explicit by expanding it with arr[:, np.newaxis], and
then I know that the shape will be (N, 1) and not (1, N). Thus, usecols
should be int or sequence of ints, and the result 1 or 2D.


In your example:

a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)

What would the shape of foo be?


/David.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Sebastian Berg
On Di, 2015-11-10 at 10:24 -0500, Benjamin Root wrote:
> Just pointing out np.loadtxt(..., ndmin=2) will always return a 2D
> array. Notice that without that option, the result is effectively
> squeezed. So if you don't specify that option, and you load up a CSV
> file with only one row, you will get a very differently shaped array
> than if you load up a CSV file with two rows.
> 

Oh, well I personally think that default squeeze is an abomination :).

Anyway, I just wanted to point out that it is two different possible
logics, and we have to pick one.
I have a slight preference for the indexing/array-like interpretation,
but I am aware that from a usage point of view the sequence one is
likely better.
I could throw in another option: Throw an explicit error instead of the
general.

Anyway, I *really* do not have an opinion about what is better.

Array-like would only suggest that you also accept buffer interface
objects or array_interface stuff. Which in this case is really
unnecessary I think.

- Sebastian


> 
> Ben Root
> 
> 
> On Tue, Nov 10, 2015 at 10:07 AM, Irvin Probst
>  wrote:
> On 10/11/2015 14:17, Sebastian Berg wrote:
> Actually, it is the "sequence special case" type ;).
> (matlab does not
> have this, since matlab always returns 2-D I
> realized).
> 
> As I said, if usecols is like indexing, the result
> should mimic:
> 
> arr = np.loadtxt(f)
> arr = arr[usecols]
> 
> in which case a 1-D array is returned if you put in a
> scalar into
> usecols (and you could even generalize usecols to
> higher dimensional
> array-likes).
> The way you implemented it -- which is fine, but I
> want to stress that
> there is a real decision being made here --, you
> always see it as a
> sequence but allow a scalar for convenience (i.e.
> always return a 2-D
> array). It is a `sequence of ints or int` type
> argument and not an
> array-like argument in my opinion.
> 
> I think we have two separate problems here:
> 
> The first one is whether loadtxt should always return a 2D
> array or should it match the shape of the usecol argument.
> From a CS guy point of view I do understand your concern here.
> Now from a teacher point of view I know many people expect to
> get a "matrix" (thank you Matlab...) and the "purity" of
> matching the dimension of the usecol variable will be seen by
> many people [1] as a nerdy useless heavyness noone cares of
> (no offense). So whatever you, seadoned numpy devs from this
> mailing list, decide I think it should be explained in the
> docstring with a very clear wording.
> 
> My own opinion on this first problem is that loadtxt() should
> always return a 2D array, no less, no more. If I write
> np.loadtxt(f)[42] it means I want to read the whole file and
> then I explicitely ask for transforming the 2-D array
> loadtxt() returned into a 1-D array. Otoh if I write
> loadtxt(f, usecol=42) it means I don't want to read the other
> columns and I want only this one, but it does not mean that I
> want to change the returned array from 2-D to 1-D. I know this
> new behavior might break a lot of existing code as
> usecol=(42,) used to return a 1-D array, but
> usecol=42, also returns a 1-D array so the current
> behavior is not consistent imho.
> 
> The second problem is about the wording in the docstring, when
> I see "sequence of int or int" I uderstand I will have to cast
> into a 1-D python list whatever wicked N-dimensional object I
> use to store my column indexes, or hope list(my_object) will
> do it fine. On the other hand when I read "array-like" the
> function is telling me I don't have to worry about my object,
> as long as numpy knows how to cast it into an array it will be
> fine.
> 
> Anyway I think something like that:
> 
> import numpy as np
> a=[[[2,],[],[],],[],[],[]]
> foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)
> 
> should just work and return me a 2-D (or 1-D if you like)
> array with the data I asked for and I don't think "a" here is
> an int or a sequence of int (but it's a good example of why
> loadtxt() should not match the shape of the usecol argument).
> 
> To make it short, let the reading function read the data in a
> consistent and predictible way and then le

Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Irvin Probst

On 10/11/2015 16:52, Daπid wrote:
42,  is exactly the same as (42,) If you want a tuple of 
tuples, you have to do ((42,),), but then it raises: TypeError: list 
indices must be integers, not tuple.


My bad, I wrote that too fast, please forget this.

I think loadtxt should be a tool to read text files in the least 
surprising fashion, and a text file is a 1 or 2D container, so it 
shouldn't return any other shapes.


And I *do* agree with the "shouldn't return any other shapes" part of 
your phrase. What I was trying to say, admitedly with a very bogus 
example, is that either loadtxt() should always output an array whose 
shape matches the shape of the object passed to usecol or it should 
never do it, and I'm if favor of never.
I'm perfectly aware that what I suggest would break the current behavior 
of usecols=(2,) so I know it does not have the slightest probability of 
being accepted but still, I think that the "least surprising fashion" is 
to always return an 2-D array because for many, many, many people a text 
data file has N lines and M columns and N=1 or M=1 is not a specific case.


Anyway I will of course modify my PR according to any decision made here.

In your example:


a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)

What would the shape of foo be?


As I said in my previous email:

> should just work and return me a 2-D (or 1-D if you like) array with 
the data I asked for


So, 1-D or 2-D it is up to you, but as long as there is no ambiguity in 
which columns the user is asking for it should imho work.


Regards.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Question about structure arrays

2015-11-10 Thread aerojockey
Nathaniel Smith wrote
> On Sat, Nov 7, 2015 at 1:18 PM, aerojockey <

> pythondev1@

> > wrote:
>> Hello,
>>
>> Recently I made some changes to a program I'm working on, and found that
>> the
>> changes made it four times slower than before.  After some digging, I
>> found
>> out that one of the new costs was that I added structure arrays.  Inside
>> a
>> low-level loop, I create a structure array, populate it Python, then turn
>> it
>> over to some handwritten C code for processing.  It turned out that, when
>> passed a structure array as a dtype, numpy has to parse the dtype, which
>> included calls to re.match and eval.
>>
>> Now, this is not a big deal for me to work around by using ordinary
>> slicing
>> and such, and also I can improve things by reusing arrays.  Since this is
>> inner loop stuff, sacrificing readability for speed is an appropriate
>> tradeoff.
>>
>> Nevertheless, I was curious if there was a way (or any plans for there to
>> be
>> a way) to compile a struture array dtype.  I realize it's not the
>> bread-and-butter of numpy, but it turned out to be a very convenient
>> feature
>> for my use case (populating an array of structures to pass off to C).
> 
> Does it help to turn your dtype string into a dtype object and then
> pass the dtype object around? E.g.
> 
> In [1]: dt = np.dtype("i4,i4")
> 
> In [2]: np.zeros(2, dtype=dt)
> Out[2]:
> array([(0, 0), (0, 0)],
>   dtype=[('f0', ' 
> -n


I actually don't know, since I removed the structure array part about ten
minutes after I posted.  However, I did a quick test of your suggestion, and
indeed numpy calls exec and re.match only when creating the dtype object,
not when creating the array.  So certainly it would have helped.

I wasn't actually aware you could do that with dtypes.  In fact, I was only
vaguely that there were dtype types at all.  Thanks for the suggestion.

Carl Banks



--
View this message in context: 
http://numpy-discussion.10968.n7.nabble.com/Question-about-structure-arrays-tp41653p41676.html
Sent from the Numpy-discussion mailing list archive at Nabble.com.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion