On 04/04/2011 11:20 AM, Charles R Harris wrote:


On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <[email protected] <mailto:[email protected]>> wrote:

    On 03/31/2011 12:02 PM, Derek Homeier wrote:
    > On 31 Mar 2011, at 17:03, Bruce Southey wrote:
    >
    >> This is an invalid ticket because the docstring clearly states
    that in
    >> 3 different, yet critical places, that missing values are not
    handled
    >> here:
    >>
    >> "Each row in the text file must have the same number of values."
    >> "genfromtxt : Load data with missing values handled as specified."
    >> "   This function aims to be a fast reader for simply formatted
    >> files.  The
    >>     `genfromtxt` function provides more sophisticated handling of,
    >> e.g.,
    >>     lines with missing values."
    >>
    >> Really I am trying to separate the usage of loadtxt and
    genfromtxt to
    >> avoid unnecessary duplication and confusion. Part of this is
    >> historical because loadtxt was added in 2007 and genfromtxt was
    added
    >> in 2009. So really certain features of loadtxt have been
     'kept' for
    >> backwards compatibility purposes yet these features can be
    'abused' to
    >> handle missing data. But I really consider that any missing values
    >> should cause loadtxt to fail.
    >>
    > OK, I was not aware of the design issues of loadtxt vs. genfromtxt -
    > you could probably say also for historical reasons since I have not
    > used genfromtxt much so far.
    > Anyway the docstring statement "Converters can also be used to
    >           provide a default value for missing data:"
    > then appears quite misleading, or an invitation to abuse, if you
    will.
    > This should better be removed from the documentation then, or users
    > explicitly discouraged from using converters instead of genfromtxt
    > (I don't see how you could completely prevent using converters in
    > this way).
    >
    >> The patch is incorrect because it should not include a space in the
    >> split() as indicated in the comment by the original reporter. Of
    > The split('\r\n') alone caused test_dtype_with_object(self) to fail,
    > probably
    > because it relies on stripping the blanks. But maybe the test is
    ill-
    > formed?
    >
    >> course a corrected patch alone still is not sufficient to
    address the
    >> problem without the user providing the correct converter. Also you
    >> start to run into problems with multiple delimiters (such as
    one space
    >> versus two spaces) so you start down the path to add all the
    features
    >> that duplicate genfromtxt.
    > Given that genfromtxt provides that functionality more conveniently,
    > I agree again users should be encouraged to use this instead of
    > converters.
    > But the actual tab-problem causes in fact an issue not related to
    > missing
    > values at all (well, depending on what you call a missing value).
    > I am describing an example on the ticket.
    >
    > Cheers,
    >                                       Derek
    >
    > _______________________________________________
    > NumPy-Discussion mailing list
    > [email protected] <mailto:[email protected]>
    > http://mail.scipy.org/mailman/listinfo/numpy-discussion
    Okay I see that 1071 got closed which I am fine with.

    I think that your following example should be a test because the two
    spaces should not be removed with a tab delimiter:
    np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t',
    dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))


Make a test and we'll put it in.

Chuck


I know!
Trying to write one made me realize that loadtxt is not handling string arrays correctly. So I have to check more on this as I think loadtxt is giving a 1-d array instead of a 2-d array.

I do agree with you Pierre but this is a nice corner case that Derek raised where a space does not necessarily mean a missing value when there is a tab delimiter:

data = StringIO("aa\tbb\n \t \ncc\tdd")
dt=np.dtype([('label', 'S2'), ('comment', 'S2')])
test = np.loadtxt(data, delimiter="\t", dtype=dt)
control = np.array([['aa','bb'], [' ', ' '],['cc','dd']], dtype=dt)

So 'test' and 'control' should give the same array.

Bruce
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to