Skip to content

Numpy dtype weirdness in TimedeltaIndex._simple_new #9462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ischwabacher opened this issue Feb 10, 2015 · 6 comments · Fixed by #23433
Closed

Numpy dtype weirdness in TimedeltaIndex._simple_new #9462

ischwabacher opened this issue Feb 10, 2015 · 6 comments · Fixed by #23433
Labels
Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation Timedelta Timedelta data type
Milestone

Comments

@ischwabacher
Copy link
Contributor

I was poking around in tdi.py with pdb and I noticed this code at line 255 of current master:

    @classmethod
    def _simple_new(cls, values, name=None, freq=None, **kwargs):
-->     if not getattr(values,'dtype',None):   # This doesn't do what it looks like it does
            values = np.array(values,copy=False)
        if values.dtype == np.object_:
            values = tslib.array_to_timedelta64(values)
        if values.dtype != _TD_DTYPE:
            values = com._ensure_int64(values).view(_TD_DTYPE)

        result = object.__new__(cls)
        result._data = values
        result.name = name
        result.freq = freq
        result._reset_identity()
        return result

This looks to me like it's testing whether values is already an ndarray, and if it isn't, it's taking a view into its values. But that's not what it's doing at all. If values is a Series, it has a dtype, so the view-making code should be skipped! Except that it isn't:

In [1]: import numpy as np

In [2]: map(np.dtype, [None, 'i8', 'i4,i4', ('m8[ns]', (2, 2)), ('S', 4),
   ...:                [('name', 'u1', (2, 2))],
   ...:                {'names': ['name'], 'formats': ['f8']},
   ...:                ('i4,i4', (2, 2))])
Out[2]: 
[dtype('float64'),                                # note that dtype(None) is float!
 dtype('int64'),                                  # fixed dtype
 dtype([('f0', '<i4'), ('f1', '<i4')]),           # record dtype
 dtype(('<m8[ns]', (2, 2))),                      # array dtype
 dtype('S4'),                                     # flexible dtype
 dtype([('name', 'u1', (2, 2))]),                 # record dtype with array fields
 dtype([('name', '<f8')]),                        # another record dtype
 dtype(([('f0', '<i4'), ('f1', '<i4')], (2, 2)))] # array dtype with record fields

In [3]: map(bool, _)
Out[3]: [False, False, True, False, False, True, True, False]   # madness

My tentative guess is that if the outermost layer of the dtype is a record type, then it is Truey, and otherwise it's Falsey. This behavior seems too pathological to depend on.

Unfortunately, that's exactly what's happening when constructing a TimedeltaIndex from a Series of Timedeltas:

In [1]: import pandas as pd

In [2]: import pdb

In [3]: pd.Series([pd.Timedelta(1, 'h')])
Out[3]: 
0   01:00:00
dtype: timedelta64[ns]

In [4]: pdb.run('pd.Index(_)')
> <string>(1)<module>()
(Pdb) b pd.TimedeltaIndex._simple_new
Breakpoint 1 at /Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/tseries/tdi.py:253
(Pdb) c
> /Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/tseries/tdi.py(255)_simple_new()
-> if not getattr(values,'dtype',None):
(Pdb) p values
0   01:00:00
dtype: timedelta64[ns]   # it's still a Series
(Pdb) n
> /Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/tseries/tdi.py(256)_simple_new()
-> values = np.array(values,copy=False)
(Pdb) n
> /Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/tseries/tdi.py(257)_simple_new()
-> if values.dtype == np.object_:
(Pdb) p values
array([3600000000000], dtype='timedelta64[ns]')   # now it's an ndarray

Everything seems to work, but what's actually going on is so different from what I would expect to be going on that it seems like a time bomb.

@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

this is just to make sure it's' m8[ns] dtyped as the data has to be s view (and not actual objects)
this is only if an index is constructed and is not hit with a series

so not sure what your issue is about

@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

the np.array(,...) coerces the series to its base numpy repr

@ischwabacher
Copy link
Contributor Author

So not getattr(values,'dtype',None) is expected to always be True?

@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

this is an internal routine
record types are meaningless here
this is only a simple dtype or scalar iirc

@ischwabacher
Copy link
Contributor Author

record types are meaningless here

...which means that bool(dtype) is always False, because numpy is nuts.

this is only a simple dtype or scalar iirc

The pdb session in the OP shows that calling Index on a Series of Timedeltas ends up calling this internal routine with the input Series as the values.

It looks like copy=False is a problem too:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: idx = pd.Index([pd.Timedelta(1, 'h')])

In [4]: idx[0] *= 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-7a30aa64d2ec> in <module>()
----> 1 idx[0] *= 2

/Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/core/index.pyc in __setitem__(self, key, value)
    894 
    895     def __setitem__(self, key, value):
--> 896         raise TypeError("Indexes does not support mutable operations")
    897 
    898     def __getitem__(self, key):

TypeError: Indexes does not support mutable operations

In [5]: val = idx.values

In [6]: idx2 = pd.Index(val)

In [7]: val
Out[7]: array([3600000000000], dtype='timedelta64[ns]')

In [8]: idx
Out[8]: 
<class 'pandas.tseries.tdi.TimedeltaIndex'>
['01:00:00']
Length: 1, Freq: None

In [9]: idx2
Out[9]: 
<class 'pandas.tseries.tdi.TimedeltaIndex'>
['01:00:00']
Length: 1, Freq: None

In [10]: val[:] *= 2

In [11]: idx
Out[11]: 
<class 'pandas.tseries.tdi.TimedeltaIndex'>
['02:00:00']   # This was supposed to be immutable!
Length: 1, Freq: None

In [12]: idx2
Out[12]: 
<class 'pandas.tseries.tdi.TimedeltaIndex'>
['02:00:00']
Length: 1, Freq: None

@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

The Index class is not meant to prevent what you are doing here. If the user grabs an internal accessor (.values) and then does stuff with it nothing can be done. In fact this property is used internally. It doesn't support mutable ops thru its accessors.

I am happy that you are looking at this class. Feel free to propose some fixes. I did rip off some code from tseries/index.py (so some of the same idioms are prob there).

There are certainly some edge cases that are not tested / caught.

@jreback jreback added Timedelta Timedelta data type Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation labels Feb 10, 2015
@jreback jreback added this to the 0.24.0 milestone Oct 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants