BUG: provide for automatic conversion of object -> datetime64 #2595

jreback · 2012-12-25T04:48:59Z

construct series that have M8[ns] dtype even in the presence of NaN (rather than object)
see below.
allow np.nan to be used in place of NaT in series construction (requires explicity dtype to be passed)
e.g. Series(np.nan,index=range(5),dtype='M8[ns]')
bug fix in applymap (tseries/tests/test_timeseries/test_frame_apply_dont_convert_datetime64), because explicit
conversion of the series to M8ns, now has the function operating on datetime64 values and ufuncs dont' work for some reason,
explicity required conversion to Timestamp values make this work (same issue as in GH Localize a column of Timestamps #2591)

In [1]: import numpy as np

In [2]: import pandas as pd

This currently works

In [12]: df = pd.DataFrame(np.random.randn(8,2),pd.date_range('20010102',periods=8),columns=['A','B'])

In [13]: df['timestamp'] = pd.Timestamp('20010103')

In [14]: x = df.convert_objects()

In [15]: x
Out[15]: 
                   A         B           timestamp
2001-01-02 -0.228005 -1.376432 2001-01-03 00:00:00
2001-01-03 -0.555251  0.277356 2001-01-03 00:00:00
2001-01-04 -0.295289  0.534981 2001-01-03 00:00:00
2001-01-05 -0.332815 -1.436975 2001-01-03 00:00:00
2001-01-06 -1.848818  0.405773 2001-01-03 00:00:00
2001-01-07 -0.734903 -0.174140 2001-01-03 00:00:00
2001-01-08 -0.388864 -0.765329 2001-01-03 00:00:00
2001-01-09 -0.341015  0.170630 2001-01-03 00:00:00

In [16]: x.get_dtype_counts()
Out[16]: 
datetime64[ns]    1
float64           2

In [21]: x.ix[3,'timestamp'] = iNaT

In [22]: x
Out[22]: 
                   A         B           timestamp
2001-01-02 -0.228005 -1.376432 2001-01-03 00:00:00
2001-01-03 -0.555251  0.277356 2001-01-03 00:00:00
2001-01-04 -0.295289  0.534981 2001-01-03 00:00:00
2001-01-05 -0.332815 -1.436975                 NaT
2001-01-06 -1.848818  0.405773 2001-01-03 00:00:00
2001-01-07 -0.734903 -0.174140 2001-01-03 00:00:00
2001-01-08 -0.388864 -0.765329 2001-01-03 00:00:00
2001-01-09 -0.341015  0.170630 2001-01-03 00:00:00

This leaves you in limbo, as you have basically a datatime64 column, but with a float nan in it (so it has to stay object)

In [3]: df = pd.DataFrame(np.random.randn(8,2),pd.date_range('20010102',periods=8),columns=['A','B'])

In [4]: df
Out[4]: 
                   A         B
2001-01-02  0.939393 -2.524448
2001-01-03 -1.059561  0.104651
2001-01-04 -0.842478  1.033888
2001-01-05 -1.009903 -0.334782
2001-01-06 -0.452043 -0.382408
2001-01-07 -0.058516  1.162884
2001-01-08  0.660251  0.688290
2001-01-09  0.069637  0.366915

In [5]: df['timestamp'] = pd.Timestamp('20010103')

In [6]: df
Out[6]: 
                   A         B            timestamp
2001-01-02  0.939393 -2.524448  2001-01-03 00:00:00
2001-01-03 -1.059561  0.104651  2001-01-03 00:00:00
2001-01-04 -0.842478  1.033888  2001-01-03 00:00:00
2001-01-05 -1.009903 -0.334782  2001-01-03 00:00:00
2001-01-06 -0.452043 -0.382408  2001-01-03 00:00:00
2001-01-07 -0.058516  1.162884  2001-01-03 00:00:00
2001-01-08  0.660251  0.688290  2001-01-03 00:00:00
2001-01-09  0.069637  0.366915  2001-01-03 00:00:00

In [7]: df.get_dtype_counts()
Out[7]: 
float64    2
object     1

In [9]: df.ix[3,'timestamp'] = np.nan

In [10]: df
Out[10]: 
                   A         B            timestamp
2001-01-02  0.939393 -2.524448  2001-01-03 00:00:00
2001-01-03 -1.059561  0.104651  2001-01-03 00:00:00
2001-01-04 -0.842478  1.033888  2001-01-03 00:00:00
2001-01-05 -1.009903 -0.334782                  NaN
2001-01-06 -0.452043 -0.382408  2001-01-03 00:00:00
2001-01-07 -0.058516  1.162884  2001-01-03 00:00:00
2001-01-08  0.660251  0.688290  2001-01-03 00:00:00
2001-01-09  0.069637  0.366915  2001-01-03 00:00:00

In [11]: df.convert_objects().get_dtype_counts()
Out[11]: 
float64    2
object     1

This PR enables this syntax:

In [3]: df = pd.DataFrame(np.random.randn(8,2),pd.date_range('20010102',periods=8),columns=['A','B'])

In [4]: df['timestamp'] = pd.Timestamp('20010103')

# datetime64[ns] out of the box
In [5]: df.get_dtype_counts()
Out[5]: 
datetime64[ns]    1
float64           2

# use the traditional nan, which is mapped to iNaT internally
In [6]: df.ix[2:4,'timestamp'] = np.nan

In [7]: df
Out[7]: 
                   A         B           timestamp
2001-01-02  0.234354 -3.167626 2001-01-03 00:00:00
2001-01-03  0.141300 -0.548670 2001-01-03 00:00:00
2001-01-04  0.738884 -0.104497                 NaT
2001-01-05  0.873046  1.193593                 NaT
2001-01-06  1.174226  1.697421 2001-01-03 00:00:00
2001-01-07 -0.069534 -0.145611 2001-01-03 00:00:00
2001-01-08 -0.951426 -1.158884 2001-01-03 00:00:00
2001-01-09 -0.322041 -0.800181 2001-01-03 00:00:00

…es upon creation (in make_block) this obviates the need to convert_objects (mostly) in addition, enabled setting of NaT in datetime64[ns] columns via np.nan (on-the-fly-conversion)

…e (and allow np.nan) to be passed e.g. Series(np.nan,index=range(5),dtype='M8[ns]') bugfix in core/frame for applymap, handle dtype=M8[ns] series explicity (needed to cast datetim64 to Timestamp)

jreback · 2012-12-25T19:07:29Z

@wesm as an aside...a similar method to what I did here could be used to support the NaN concept directly in Int columns. If one were to create a NaT like object (maybe NaI) for ints...(which could be the maxint for that type...would have for int8,int16,int32,int64)....then this would be straightforward. Then the dtypes of the series (and underlying numpy array) would be correct (e.g. say int64 or whatever), the NaI would fit it; and just intercept calls to np.nan to translate both ways.

Not a big believer in explict type handling, but I think you have to, to get pure numpy arrays (rathern than object).

just a thought - not sure how much folks are clamoring for NaN in pure int columns

wesm · 2012-12-28T16:40:22Z

Merged. thanks. more generally all of the data management /NA handling needs to get revamped at some point to prevent a mountain of little hacks from coming around. NA values in int arrays is important to a lot of people but the problem is the reliance on numpy internally. pandas needs to take ownership of its own data. a big project

jreback · 2012-12-28T16:53:47Z

yep....this was a 'specific' fix.....:)

jreback added 3 commits December 24, 2012 23:33

BUG: provide for automatic conversion of object -> datetime64[ns] typ…

d225df3

…es upon creation (in make_block) this obviates the need to convert_objects (mostly) in addition, enabled setting of NaT in datetime64[ns] columns via np.nan (on-the-fly-conversion)

TST: fixed up a failing test

777b5fa

BUG/ENH: explicity support Series construction with a datetime64 dtyp…

add7ae7

…e (and allow np.nan) to be passed e.g. Series(np.nan,index=range(5),dtype='M8[ns]') bugfix in core/frame for applymap, handle dtype=M8[ns] series explicity (needed to cast datetim64 to Timestamp)

cleaner Timestamp boxing in frame/applymap

25cf4dc

wesm closed this Dec 28, 2012

jostheim mentioned this pull request Jan 18, 2013

Loading of large pickled dataframes fails #2705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: provide for automatic conversion of object -> datetime64 #2595

BUG: provide for automatic conversion of object -> datetime64 #2595

jreback commented Dec 25, 2012

jreback commented Dec 25, 2012

wesm commented Dec 28, 2012

jreback commented Dec 28, 2012

BUG: provide for automatic conversion of object -> datetime64 #2595

BUG: provide for automatic conversion of object -> datetime64 #2595

Conversation

jreback commented Dec 25, 2012

jreback commented Dec 25, 2012

wesm commented Dec 28, 2012

jreback commented Dec 28, 2012