[Numpy-discussion] loadtxt and usecols

2015-11-09 Thread Irvin Probst

Hi,
I've recently seen many students, coming from Matlab, struggling against 
the usecols argument of loadtxt. Most of them tried something like:
loadtxt("foo.bar", usecols=2) or the ones with better documentation 
reading skills tried loadtxt("foo.bar", usecols=(2)) but none of them 
understood they had to write usecols=[2] or usecols=(2,).


Is there a policy in numpy stating that this kind of arguments must be 
sequences ? I think that being able to an int or a sequence when a 
single column is needed would make this function a bit more user 
friendly for beginners. I would gladly submit a PR if noone disagrees.


Regards.

--
Irvin
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Irvin Probst

On 10/11/2015 09:19, Sebastian Berg wrote:

since a scalar row (so just one row) is read and not a 2D array. I tend
to say it should be an array-like argument and not a generalized
sequence argument, just wanted to note that, since I am not sure what
matlab does.


Hi,
By default Matlab reads everything, silently fails on what can't be 
converted into a float and the user has to guess what was read or not.

Say you have a file like this:

2010-01-01 00:00:00 3.026
2010-01-01 01:00:00 4.049
2010-01-01 02:00:00 4.865


>> M=load('CONCARNEAU_2010.txt');
>> M(1:3,:)

ans =

   1.0e+03 *

2.0100 00.0030
2.01000.00100.0040
2.01000.00200.0049


I think this is a terrible way of doing it even if newcomers might find 
this handy. There are of course optionnal arguments (even regexps !) but 
to my knowledge almost no Matlab user even knows these arguments are there.


Anyway, I made a PR here https://github.com/numpy/numpy/pull/6656 with 
usecols as an array-like.


Regards.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Irvin Probst

On 10/11/2015 14:17, Sebastian Berg wrote:

Actually, it is the "sequence special case" type ;). (matlab does not
have this, since matlab always returns 2-D I realized).

As I said, if usecols is like indexing, the result should mimic:

arr = np.loadtxt(f)
arr = arr[usecols]

in which case a 1-D array is returned if you put in a scalar into
usecols (and you could even generalize usecols to higher dimensional
array-likes).
The way you implemented it -- which is fine, but I want to stress that
there is a real decision being made here --, you always see it as a
sequence but allow a scalar for convenience (i.e. always return a 2-D
array). It is a `sequence of ints or int` type argument and not an
array-like argument in my opinion.


I think we have two separate problems here:

The first one is whether loadtxt should always return a 2D array or 
should it match the shape of the usecol argument. From a CS guy point of 
view I do understand your concern here. Now from a teacher point of view 
I know many people expect to get a "matrix" (thank you Matlab...) and 
the "purity" of matching the dimension of the usecol variable will be 
seen by many people [1] as a nerdy useless heavyness noone cares of (no 
offense). So whatever you, seadoned numpy devs from this mailing list, 
decide I think it should be explained in the docstring with a very clear 
wording.


My own opinion on this first problem is that loadtxt() should always 
return a 2D array, no less, no more. If I write np.loadtxt(f)[42] it 
means I want to read the whole file and then I explicitely ask for 
transforming the 2-D array loadtxt() returned into a 1-D array. Otoh if 
I write loadtxt(f, usecol=42) it means I don't want to read the other 
columns and I want only this one, but it does not mean that I want to 
change the returned array from 2-D to 1-D. I know this new behavior 
might break a lot of existing code as usecol=(42,) used to return a 1-D 
array, but usecol=42, also returns a 1-D array so the current 
behavior is not consistent imho.


The second problem is about the wording in the docstring, when I see 
"sequence of int or int" I uderstand I will have to cast into a 1-D 
python list whatever wicked N-dimensional object I use to store my 
column indexes, or hope list(my_object) will do it fine. On the other 
hand when I read "array-like" the function is telling me I don't have to 
worry about my object, as long as numpy knows how to cast it into an 
array it will be fine.


Anyway I think something like that:

import numpy as np
a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)

should just work and return me a 2-D (or 1-D if you like) array with the 
data I asked for and I don't think "a" here is an int or a sequence of 
int (but it's a good example of why loadtxt() should not match the shape 
of the usecol argument).


To make it short, let the reading function read the data in a consistent 
and predictible way and then let the user explicitely change the data's 
shape into anything he likes.


Regards.

[1] read non CS people trying to switch to numpy/scipy
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-10 Thread Irvin Probst

On 10/11/2015 16:52, Daπid wrote:
42,  is exactly the same as (42,) If you want a tuple of 
tuples, you have to do ((42,),), but then it raises: TypeError: list 
indices must be integers, not tuple.


My bad, I wrote that too fast, please forget this.

I think loadtxt should be a tool to read text files in the least 
surprising fashion, and a text file is a 1 or 2D container, so it 
shouldn't return any other shapes.


And I *do* agree with the "shouldn't return any other shapes" part of 
your phrase. What I was trying to say, admitedly with a very bogus 
example, is that either loadtxt() should always output an array whose 
shape matches the shape of the object passed to usecol or it should 
never do it, and I'm if favor of never.
I'm perfectly aware that what I suggest would break the current behavior 
of usecols=(2,) so I know it does not have the slightest probability of 
being accepted but still, I think that the "least surprising fashion" is 
to always return an 2-D array because for many, many, many people a text 
data file has N lines and M columns and N=1 or M=1 is not a specific case.


Anyway I will of course modify my PR according to any decision made here.

In your example:


a=[[[2,],[],[],],[],[],[]]
foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)

What would the shape of foo be?


As I said in my previous email:

> should just work and return me a 2-D (or 1-D if you like) array with 
the data I asked for


So, 1-D or 2-D it is up to you, but as long as there is no ambiguity in 
which columns the user is asking for it should imho work.


Regards.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loadtxt and usecols

2015-11-13 Thread Irvin Probst

On 11/11/2015 18:38, Sebastian Berg wrote:


Sounds fine to me, and considering the squeeze logic (which I think is
unfortunate, but it is not something you can easily change), I would be
for simply adding logic to accept a single integral argument and
otherwise not change anything.
[...]

As said before, the other/additional thing that might be very helpful is
trying to give a more useful error message.



I've modified my PR to (hopefully) match these requests.
https://github.com/numpy/numpy/pull/6656

Regards.

--
Irvin
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] About inv/lstsq

2016-03-01 Thread Irvin Probst

Hi,
I'm not sure if I should send this here or to scipy-user, feel free to 
redirect me there if I'm off topic.


So, there is something I don't understand using inv and lstsq in numpy.

I've built *on purpose* an ill conditioned system to fit a quadric 
a*x**2+b*y**2+c*x*y+d*x+e*y+f, the data points are taken on a narrow 
stripe four times longer than wide. My goal is obviously to find 
(a,b,c,d,e,f) so I built the following matrix:


A[:,0] = data[:,0]**2

A[:,1] = data[:,1]**2

A[:,2] = data[:,1]*data[:,0]

A[:,3] = data[:,0]

A[:,4] = data[:,1]

A[:,5] = 1;


The condition number of A is around 2*1e5 but I can make it much bigger 
if needed by scaling the data along an axis.


I then tried to find the best estimate of X in order to minimize the 
norm of A*X - B with B being my data points and X the vector 
(a,b,c,d,e,f). That's a very basic usage of least squares and it works 
fine with lstsq despite the bad condition number.


However I was expecting to fail to solve it properly using 
inv(A.T.dot(A)).dot(A.T).dot(B) but actually while I scaled up the 
condition number lstsq began to give obviously wrong results (that's 
expected) whereas using inv constantly gave "visually good" results. I 
have no residuals to show but lstsq was just plain wrong (again that is 
expected when cond(A) rises) while inv "worked". I was expecting to see 
inv fail much before lstsq.


Interestingly the same dataset fails in Matlab using inv without any 
scaling of the condition number while it works using \ (mldivide, i.e 
least squares). On octave it works fine using both methods with the 
original dataset, I did not try to scale up the condition number.


So my question is very simple, what's going on here ? It looks like 
Matlab, Numpy and Octave both use the same lapack functions for inv and 
lstsq. As they don't use the same version of lapack I can understand 
that they do not exhibit the same behavior but how can it be possible to 
have lstsq failing before inv(A.T.dot(A)) when I scale up the condition 
number of A ? I feel like I'm missing something obvious but I can't find it.


Thanks.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ndarray.T2 for 2D transpose

2016-04-07 Thread Irvin Probst

On 06/04/2016 04:11, Todd wrote:


When you try to transpose a 1D array, it does nothing.  This is the 
correct behavior, since it transposing a 1D array is meaningless.  
However, this can often lead to unexpected errors since this is rarely 
what you want.  You can convert the array to 2D, using `np.atleast_2d` 
or `arr[None]`, but this makes simple linear algebra computations more 
difficult.


I propose adding an argument to transpose, perhaps called `expand` or 
`expanddim`, which if `True` (it is `False` by default) will force the 
array to be at least 2D.  A shortcut property, `ndarray.T2`, would be 
the same as `ndarray.transpose(True)`



Hello,
My two cents here, I've seen hundreds of people (literally hundreds) 
stumbling on this .T trick with 1D vectors when they were trying to do 
some linear algebra with numpy so at first I had the same feeling as 
you. But the real issue was that *all* these people were coming from 
matlab and expected numpy to behave the same way. Once the logic behind 
1D vectors was explained it made sense to most of them and there were no 
more problems.


And by the way I don't see any way to tell apart a 1D "row vector" from 
a 1D "column vector", think of a code mixing a Rn=>R jacobian matrix and 
some data supposed to be used as measurements in a linear system, so we 
have J=np.array([1,2,3,4]) and B=np.array([5,6,7,8]), what would the 
output of J.T2 and B.T2 be ?


I think it's much better to get used to writing 
J=np.array([1,2,3,4]).reshape(1,4) and 
B=np.array([5,6,7,8]).reshape(4,1), then you can use .T and @ without 
any verbosity and at least if forces users (read "my students" here) to 
think twice before writing some linear algebra nonsense.


Regards.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ndarray.T2 for 2D transpose

2016-04-07 Thread Irvin Probst

On Thu, 7 Apr 2016 14:31:17 -0400, josef.p...@gmail.com wrote:

So this discussion brings up that we also need an easy an obvious
way to make a column vector -- 

maybe:

np.col_vector(arr)



FWIW I would give a +1e42 to something like np.colvect and np.rowvect 
(or whatever variant of these names). This is human readable and does 
not break anything, it's just an explicit shortcut to 
reshape/atleast_2d/etc.


Regards.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion