Skip to content

BUG: provide for automatic conversion of object -> datetime64 #2595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Dec 25, 2012

  • construct series that have M8[ns] dtype even in the presence of NaN (rather than object)
    see below.
  • allow np.nan to be used in place of NaT in series construction (requires explicity dtype to be passed)
    e.g. Series(np.nan,index=range(5),dtype='M8[ns]')
  • bug fix in applymap (tseries/tests/test_timeseries/test_frame_apply_dont_convert_datetime64), because explicit
    conversion of the series to M8ns, now has the function operating on datetime64 values and ufuncs dont' work for some reason,
    explicity required conversion to Timestamp values make this work (same issue as in GH Localize a column of Timestamps #2591)
In [1]: import numpy as np

In [2]: import pandas as pd

This currently works

In [12]: df = pd.DataFrame(np.random.randn(8,2),pd.date_range('20010102',periods=8),columns=['A','B'])

In [13]: df['timestamp'] = pd.Timestamp('20010103')

In [14]: x = df.convert_objects()

In [15]: x
Out[15]: 
                   A         B           timestamp
2001-01-02 -0.228005 -1.376432 2001-01-03 00:00:00
2001-01-03 -0.555251  0.277356 2001-01-03 00:00:00
2001-01-04 -0.295289  0.534981 2001-01-03 00:00:00
2001-01-05 -0.332815 -1.436975 2001-01-03 00:00:00
2001-01-06 -1.848818  0.405773 2001-01-03 00:00:00
2001-01-07 -0.734903 -0.174140 2001-01-03 00:00:00
2001-01-08 -0.388864 -0.765329 2001-01-03 00:00:00
2001-01-09 -0.341015  0.170630 2001-01-03 00:00:00

In [16]: x.get_dtype_counts()
Out[16]: 
datetime64[ns]    1
float64           2

In [21]: x.ix[3,'timestamp'] = iNaT

In [22]: x
Out[22]: 
                   A         B           timestamp
2001-01-02 -0.228005 -1.376432 2001-01-03 00:00:00
2001-01-03 -0.555251  0.277356 2001-01-03 00:00:00
2001-01-04 -0.295289  0.534981 2001-01-03 00:00:00
2001-01-05 -0.332815 -1.436975                 NaT
2001-01-06 -1.848818  0.405773 2001-01-03 00:00:00
2001-01-07 -0.734903 -0.174140 2001-01-03 00:00:00
2001-01-08 -0.388864 -0.765329 2001-01-03 00:00:00
2001-01-09 -0.341015  0.170630 2001-01-03 00:00:00

This leaves you in limbo, as you have basically a datatime64 column, but with a float nan in it (so it has to stay object)

In [3]: df = pd.DataFrame(np.random.randn(8,2),pd.date_range('20010102',periods=8),columns=['A','B'])

In [4]: df
Out[4]: 
                   A         B
2001-01-02  0.939393 -2.524448
2001-01-03 -1.059561  0.104651
2001-01-04 -0.842478  1.033888
2001-01-05 -1.009903 -0.334782
2001-01-06 -0.452043 -0.382408
2001-01-07 -0.058516  1.162884
2001-01-08  0.660251  0.688290
2001-01-09  0.069637  0.366915

In [5]: df['timestamp'] = pd.Timestamp('20010103')

In [6]: df
Out[6]: 
                   A         B            timestamp
2001-01-02  0.939393 -2.524448  2001-01-03 00:00:00
2001-01-03 -1.059561  0.104651  2001-01-03 00:00:00
2001-01-04 -0.842478  1.033888  2001-01-03 00:00:00
2001-01-05 -1.009903 -0.334782  2001-01-03 00:00:00
2001-01-06 -0.452043 -0.382408  2001-01-03 00:00:00
2001-01-07 -0.058516  1.162884  2001-01-03 00:00:00
2001-01-08  0.660251  0.688290  2001-01-03 00:00:00
2001-01-09  0.069637  0.366915  2001-01-03 00:00:00

In [7]: df.get_dtype_counts()
Out[7]: 
float64    2
object     1

In [9]: df.ix[3,'timestamp'] = np.nan

In [10]: df
Out[10]: 
                   A         B            timestamp
2001-01-02  0.939393 -2.524448  2001-01-03 00:00:00
2001-01-03 -1.059561  0.104651  2001-01-03 00:00:00
2001-01-04 -0.842478  1.033888  2001-01-03 00:00:00
2001-01-05 -1.009903 -0.334782                  NaN
2001-01-06 -0.452043 -0.382408  2001-01-03 00:00:00
2001-01-07 -0.058516  1.162884  2001-01-03 00:00:00
2001-01-08  0.660251  0.688290  2001-01-03 00:00:00
2001-01-09  0.069637  0.366915  2001-01-03 00:00:00

In [11]: df.convert_objects().get_dtype_counts()
Out[11]: 
float64    2
object     1

This PR enables this syntax:

In [3]: df = pd.DataFrame(np.random.randn(8,2),pd.date_range('20010102',periods=8),columns=['A','B'])

In [4]: df['timestamp'] = pd.Timestamp('20010103')

# datetime64[ns] out of the box
In [5]: df.get_dtype_counts()
Out[5]: 
datetime64[ns]    1
float64           2

# use the traditional nan, which is mapped to iNaT internally
In [6]: df.ix[2:4,'timestamp'] = np.nan

In [7]: df
Out[7]: 
                   A         B           timestamp
2001-01-02  0.234354 -3.167626 2001-01-03 00:00:00
2001-01-03  0.141300 -0.548670 2001-01-03 00:00:00
2001-01-04  0.738884 -0.104497                 NaT
2001-01-05  0.873046  1.193593                 NaT
2001-01-06  1.174226  1.697421 2001-01-03 00:00:00
2001-01-07 -0.069534 -0.145611 2001-01-03 00:00:00
2001-01-08 -0.951426 -1.158884 2001-01-03 00:00:00
2001-01-09 -0.322041 -0.800181 2001-01-03 00:00:00

…es upon creation (in make_block)

     this obviates the need to convert_objects (mostly)
     in addition, enabled setting of NaT in datetime64[ns] columns via np.nan (on-the-fly-conversion)
…e (and allow np.nan) to be passed

         e.g. Series(np.nan,index=range(5),dtype='M8[ns]')
         bugfix in core/frame for applymap, handle dtype=M8[ns] series explicity (needed to cast datetim64 to Timestamp)
@jreback
Copy link
Contributor Author

jreback commented Dec 25, 2012

@wesm as an aside...a similar method to what I did here could be used to support the NaN concept directly in Int columns. If one were to create a NaT like object (maybe NaI) for ints...(which could be the maxint for that type...would have for int8,int16,int32,int64)....then this would be straightforward. Then the dtypes of the series (and underlying numpy array) would be correct (e.g. say int64 or whatever), the NaI would fit it; and just intercept calls to np.nan to translate both ways.

Not a big believer in explict type handling, but I think you have to, to get pure numpy arrays (rathern than object).

just a thought - not sure how much folks are clamoring for NaN in pure int columns

@wesm
Copy link
Member

wesm commented Dec 28, 2012

Merged. thanks. more generally all of the data management /NA handling needs to get revamped at some point to prevent a mountain of little hacks from coming around. NA values in int arrays is important to a lot of people but the problem is the reliance on numpy internally. pandas needs to take ownership of its own data. a big project

@wesm wesm closed this Dec 28, 2012
@jreback
Copy link
Contributor Author

jreback commented Dec 28, 2012

yep....this was a 'specific' fix.....:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants