[Numpy-discussion] Use of NameValidator in np.genfromtxt is inconsistent with the rules for naming structured array fields

Alistair Muldal Sat, 21 Mar 2015 07:33:24 -0700

Hi all,

I originally posted this to the issue tracker(https://github.com/numpy/numpy/issues/5686), and am posting here aswell at the request of charris.

Currently, np.genfromtxt uses a numpy.lib._iotools.NameValidator whichmangles field names by replacing spaces and stripping out certainnon-alphanumeric characters etc.:


    import numpy as np
    from io import BytesIO

    s = 'name,name with spaces,2*(x-1)!\n1,2,3\n4,5,6'
    x = np.genfromtxt(BytesIO(s), delimiter=',', names=True)
    print(repr(x))
    # array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],

# dtype=[('name', '<f8'), ('name_with_spaces', '<f8'),('2x1', '<f8')])

This behaviour has been the cause of some confusion in the past, e.g.http://stackoverflow.com/q/29097917/1461210,http://stackoverflow.com/q/16020137/1461210. Part of the issue is thatit's currently not very well covered by the documentation fornp.genfromtext - at best, it's alluded to in the descriptions for someof the keyword arguments ('deletechars', 'autostrip', 'replace_space' etc.).

However, I think the more fundamental problem is that this behaviourseems to be inconsistent with the rules for naming the fields instructured arrays. In the example above, all of the original field namesare perfectly legal:


    names = ['name', 'name with spaces', '2*(x-1)!']
    types = ('f',) * 3
    dtype = zip(names, types)

    x2 = np.empty(2, dtype=dtype)
    x2[:] = [(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)]
    print(repr(x2))
    # array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],

# dtype=[('name', '<f4'), ('name with spaces', '<f4'),('2*(x-1)!', '<f4')])

    print(x2['2*(x-1)!'])
    # [3. 6.]

What is the rationale behind the use of NameValidator here? One possiblereason would be to ensure that the field names would also be legal foran np.recarray. However, this doesn't make sense for several reasons:


Firstly, the names above also seem to be legal field names for a recarray:

    xr = x2.view(np.recarray)
print(repr(x2))
# rec.array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],

# dtype=[('name', '<f4'), ('name with spaces', '<f4'),('2*(x-1)!', '<f4')])

Obviously if the field names aren't valid Python identifiers then itwon't be possible to access them via 'xr.fieldname' syntax, butdict-style indexing is still fine, e.g. xr['2*(x-1)!']. Also, if thegoal of NameValidator were to ensure that the field names were alwaysvalid Python identifiers then it currently fails at this anyway, sincein my first example, '2x1' is not a valid Python identifier.

What is perhaps most confusing is the fact that np.genfromtxt will evenmangle field names that you pass in directly via the 'names' keywordargument. Suppose you wanted to specify field names that NameValidatordoesn't like. You might try something like this:

print(repr(np.genfromtxt(BytesIO(s), delimiter=',', names=names,skip_header=1)))

    # array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],

# dtype=[('name', '<f8'), ('name_with_spaces', '<f8'),('2x1', '<f8')])


Or even this:

print(repr(np.genfromtxt(BytesIO(s), delimiter=',', names=names,skip_header=1,deletechars=[], replace_space=False, excludelist=[],autostrip=False)))

    # array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],

# dtype=[('name', '<f8'), ('name_with_spaces', '<f8'),('2x1', '<f8')])

Still no luck! As far as I can tell, there is no option in np.genfromtxtthat allows you to preserve field names that don't conform toNameValidator's seemingly arbitrary rules.

What should be done about this? Personally, I think that either thingslike spaces and non-alphanumeric characters should be disallowed instructured array field names altogether (and my second example shouldraise an exception), or np.genfromtxt should leave field names alone bydefault.

It would also be a good idea to raise a SyntaxWarning in a case wherethe user creates a recarray containing field names that are not validPython identifiers (and are therefore incompatible with the dot indexingsyntax). This is essentially what PyTables does for non-conforming HDF5node names:https://github.com/PyTables/PyTables/blob/13047c897d28b7278cbeab732f12feadbfef3f22/tables/exceptions.py#L285-L294.


Any thoughts on this?

Alistair

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Use of NameValidator in np.genfromtxt is inconsistent with the rules for naming structured array fields

Reply via email to