[Numpy-discussion] loadtxt and usecols
Hi, I've recently seen many students, coming from Matlab, struggling against the usecols argument of loadtxt. Most of them tried something like: loadtxt("foo.bar", usecols=2) or the ones with better documentation reading skills tried loadtxt("foo.bar", usecols=(2)) but none of them understood they had to write usecols=[2] or usecols=(2,). Is there a policy in numpy stating that this kind of arguments must be sequences ? I think that being able to an int or a sequence when a single column is needed would make this function a bit more user friendly for beginners. I would gladly submit a PR if noone disagrees. Regards. -- Irvin ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On 10/11/2015 09:19, Sebastian Berg wrote: since a scalar row (so just one row) is read and not a 2D array. I tend to say it should be an array-like argument and not a generalized sequence argument, just wanted to note that, since I am not sure what matlab does. Hi, By default Matlab reads everything, silently fails on what can't be converted into a float and the user has to guess what was read or not. Say you have a file like this: 2010-01-01 00:00:00 3.026 2010-01-01 01:00:00 4.049 2010-01-01 02:00:00 4.865 >> M=load('CONCARNEAU_2010.txt'); >> M(1:3,:) ans = 1.0e+03 * 2.0100 00.0030 2.01000.00100.0040 2.01000.00200.0049 I think this is a terrible way of doing it even if newcomers might find this handy. There are of course optionnal arguments (even regexps !) but to my knowledge almost no Matlab user even knows these arguments are there. Anyway, I made a PR here https://github.com/numpy/numpy/pull/6656 with usecols as an array-like. Regards. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On 10/11/2015 14:17, Sebastian Berg wrote: Actually, it is the "sequence special case" type ;). (matlab does not have this, since matlab always returns 2-D I realized). As I said, if usecols is like indexing, the result should mimic: arr = np.loadtxt(f) arr = arr[usecols] in which case a 1-D array is returned if you put in a scalar into usecols (and you could even generalize usecols to higher dimensional array-likes). The way you implemented it -- which is fine, but I want to stress that there is a real decision being made here --, you always see it as a sequence but allow a scalar for convenience (i.e. always return a 2-D array). It is a `sequence of ints or int` type argument and not an array-like argument in my opinion. I think we have two separate problems here: The first one is whether loadtxt should always return a 2D array or should it match the shape of the usecol argument. From a CS guy point of view I do understand your concern here. Now from a teacher point of view I know many people expect to get a "matrix" (thank you Matlab...) and the "purity" of matching the dimension of the usecol variable will be seen by many people [1] as a nerdy useless heavyness noone cares of (no offense). So whatever you, seadoned numpy devs from this mailing list, decide I think it should be explained in the docstring with a very clear wording. My own opinion on this first problem is that loadtxt() should always return a 2D array, no less, no more. If I write np.loadtxt(f)[42] it means I want to read the whole file and then I explicitely ask for transforming the 2-D array loadtxt() returned into a 1-D array. Otoh if I write loadtxt(f, usecol=42) it means I don't want to read the other columns and I want only this one, but it does not mean that I want to change the returned array from 2-D to 1-D. I know this new behavior might break a lot of existing code as usecol=(42,) used to return a 1-D array, but usecol=42, also returns a 1-D array so the current behavior is not consistent imho. The second problem is about the wording in the docstring, when I see "sequence of int or int" I uderstand I will have to cast into a 1-D python list whatever wicked N-dimensional object I use to store my column indexes, or hope list(my_object) will do it fine. On the other hand when I read "array-like" the function is telling me I don't have to worry about my object, as long as numpy knows how to cast it into an array it will be fine. Anyway I think something like that: import numpy as np a=[[[2,],[],[],],[],[],[]] foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a) should just work and return me a 2-D (or 1-D if you like) array with the data I asked for and I don't think "a" here is an int or a sequence of int (but it's a good example of why loadtxt() should not match the shape of the usecol argument). To make it short, let the reading function read the data in a consistent and predictible way and then let the user explicitely change the data's shape into anything he likes. Regards. [1] read non CS people trying to switch to numpy/scipy ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On 10/11/2015 16:52, Daπid wrote: 42, is exactly the same as (42,) If you want a tuple of tuples, you have to do ((42,),), but then it raises: TypeError: list indices must be integers, not tuple. My bad, I wrote that too fast, please forget this. I think loadtxt should be a tool to read text files in the least surprising fashion, and a text file is a 1 or 2D container, so it shouldn't return any other shapes. And I *do* agree with the "shouldn't return any other shapes" part of your phrase. What I was trying to say, admitedly with a very bogus example, is that either loadtxt() should always output an array whose shape matches the shape of the object passed to usecol or it should never do it, and I'm if favor of never. I'm perfectly aware that what I suggest would break the current behavior of usecols=(2,) so I know it does not have the slightest probability of being accepted but still, I think that the "least surprising fashion" is to always return an 2-D array because for many, many, many people a text data file has N lines and M columns and N=1 or M=1 is not a specific case. Anyway I will of course modify my PR according to any decision made here. In your example: a=[[[2,],[],[],],[],[],[]] foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a) What would the shape of foo be? As I said in my previous email: > should just work and return me a 2-D (or 1-D if you like) array with the data I asked for So, 1-D or 2-D it is up to you, but as long as there is no ambiguity in which columns the user is asking for it should imho work. Regards. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loadtxt and usecols
On 11/11/2015 18:38, Sebastian Berg wrote: Sounds fine to me, and considering the squeeze logic (which I think is unfortunate, but it is not something you can easily change), I would be for simply adding logic to accept a single integral argument and otherwise not change anything. [...] As said before, the other/additional thing that might be very helpful is trying to give a more useful error message. I've modified my PR to (hopefully) match these requests. https://github.com/numpy/numpy/pull/6656 Regards. -- Irvin ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] About inv/lstsq
Hi, I'm not sure if I should send this here or to scipy-user, feel free to redirect me there if I'm off topic. So, there is something I don't understand using inv and lstsq in numpy. I've built *on purpose* an ill conditioned system to fit a quadric a*x**2+b*y**2+c*x*y+d*x+e*y+f, the data points are taken on a narrow stripe four times longer than wide. My goal is obviously to find (a,b,c,d,e,f) so I built the following matrix: A[:,0] = data[:,0]**2 A[:,1] = data[:,1]**2 A[:,2] = data[:,1]*data[:,0] A[:,3] = data[:,0] A[:,4] = data[:,1] A[:,5] = 1; The condition number of A is around 2*1e5 but I can make it much bigger if needed by scaling the data along an axis. I then tried to find the best estimate of X in order to minimize the norm of A*X - B with B being my data points and X the vector (a,b,c,d,e,f). That's a very basic usage of least squares and it works fine with lstsq despite the bad condition number. However I was expecting to fail to solve it properly using inv(A.T.dot(A)).dot(A.T).dot(B) but actually while I scaled up the condition number lstsq began to give obviously wrong results (that's expected) whereas using inv constantly gave "visually good" results. I have no residuals to show but lstsq was just plain wrong (again that is expected when cond(A) rises) while inv "worked". I was expecting to see inv fail much before lstsq. Interestingly the same dataset fails in Matlab using inv without any scaling of the condition number while it works using \ (mldivide, i.e least squares). On octave it works fine using both methods with the original dataset, I did not try to scale up the condition number. So my question is very simple, what's going on here ? It looks like Matlab, Numpy and Octave both use the same lapack functions for inv and lstsq. As they don't use the same version of lapack I can understand that they do not exhibit the same behavior but how can it be possible to have lstsq failing before inv(A.T.dot(A)) when I scale up the condition number of A ? I feel like I'm missing something obvious but I can't find it. Thanks. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ndarray.T2 for 2D transpose
On 06/04/2016 04:11, Todd wrote: When you try to transpose a 1D array, it does nothing. This is the correct behavior, since it transposing a 1D array is meaningless. However, this can often lead to unexpected errors since this is rarely what you want. You can convert the array to 2D, using `np.atleast_2d` or `arr[None]`, but this makes simple linear algebra computations more difficult. I propose adding an argument to transpose, perhaps called `expand` or `expanddim`, which if `True` (it is `False` by default) will force the array to be at least 2D. A shortcut property, `ndarray.T2`, would be the same as `ndarray.transpose(True)` Hello, My two cents here, I've seen hundreds of people (literally hundreds) stumbling on this .T trick with 1D vectors when they were trying to do some linear algebra with numpy so at first I had the same feeling as you. But the real issue was that *all* these people were coming from matlab and expected numpy to behave the same way. Once the logic behind 1D vectors was explained it made sense to most of them and there were no more problems. And by the way I don't see any way to tell apart a 1D "row vector" from a 1D "column vector", think of a code mixing a Rn=>R jacobian matrix and some data supposed to be used as measurements in a linear system, so we have J=np.array([1,2,3,4]) and B=np.array([5,6,7,8]), what would the output of J.T2 and B.T2 be ? I think it's much better to get used to writing J=np.array([1,2,3,4]).reshape(1,4) and B=np.array([5,6,7,8]).reshape(4,1), then you can use .T and @ without any verbosity and at least if forces users (read "my students" here) to think twice before writing some linear algebra nonsense. Regards. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ndarray.T2 for 2D transpose
On Thu, 7 Apr 2016 14:31:17 -0400, josef.p...@gmail.com wrote: So this discussion brings up that we also need an easy an obvious way to make a column vector -- maybe: np.col_vector(arr) FWIW I would give a +1e42 to something like np.colvect and np.rowvect (or whatever variant of these names). This is human readable and does not break anything, it's just an explicit shortcut to reshape/atleast_2d/etc. Regards. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion