Dave, your example is not a problem with numpy per se, rather that the default generation is in local timezone (same as what python datetime does). If you localize to UTC you get the results that you expect.
In [49]: dates = pd.date_range('01-Apr-2014', '04-Apr-2014', freq='H')[:-1] In [50]: pd.TimeSeries(values, dates.tz_localize('UTC')).groupby(lambda d: d.date()).mean() Out[50]: 2014-04-01 1 2014-04-02 2 2014-04-03 3 dtype: int64 In [51]: records = zip(map(str, dates.tz_localize('UTC')), values) In [52]: df = pd.DataFrame(np.array(records, dtype=[('dates', 'M8[h]'),('values', float)])) In [53]: df.set_index('dates').groupby(lambda x: x.date()).mean() Out[53]: values 2014-04-01 1 2014-04-02 2 2014-04-03 3 [3 rows x 1 columns] On Wed, Mar 19, 2014 at 5:21 AM, Dave Hirschfeld <novi...@gmail.com> wrote: > Sankarshan Mudkavi <smudkavi <at> uwaterloo.ca> writes: > > > > > Hey all, > > It's been a while since the last datetime and timezones discussion thread > was visited (linked below): > > > > http://thread.gmane.org/gmane.comp.python.numeric.general/53805 > > > > It looks like the best approach to follow is the UTC only approach in the > linked thread with an optional flag to indicate the timezone (to avoid > confusing applications where they don't expect any timezone info). Since > this is slightly more useful than having just a naive datetime64 package > and > would be open to extension if required, it's probably the best way to start > improving the datetime64 library. > > > <snip> > > I would like to start writing a NEP for this followed by implementation, > however I'm not sure what the format etc. is, could someone direct me to a > page where this information is provided? > > > > Please let me know if there are any ideas, comments etc. > > > > Cheers, > > Sankarshan > > > > See: http://article.gmane.org/gmane.comp.python.numeric.general/55191 > > > You could use a current NEP as a template: > https://github.com/numpy/numpy/tree/master/doc/neps > > > I'm a huge +100 on the simplest UTC fix. > > As is, using numpy datetimes is likely to silently give incorrect results - > something I've already seen several times in end-user data analysis code. > > Concrete Example: > > In [16]: dates = pd.date_range('01-Apr-2014', '04-Apr-2014', freq='H')[:-1] > ...: values = np.array([1,2,3]).repeat(24) > ...: records = zip(map(str, dates), values) > ...: pd.TimeSeries(values, dates).groupby(lambda d: d.date()).mean() > ...: > Out[16]: > 2014-04-01 1 > 2014-04-02 2 > 2014-04-03 3 > dtype: int32 > > In [17]: df = pd.DataFrame(np.array(records, dtype=[('dates', 'M8[h]'), > ('values', float)])) > ...: df.set_index('dates', inplace=True) > ...: df.groupby(lambda d: d.date()).mean() > ...: > Out[17]: > values > 2014-03-31 1.000000 > 2014-04-01 1.041667 > 2014-04-02 2.041667 > 2014-04-03 3.000000 > > [4 rows x 1 columns] > > Try it in your timezone and see what you get! > > -Dave > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion