Numpy dtype weirdness in TimedeltaIndex._simple_new #9462

ischwabacher · 2015-02-10T22:46:50Z

I was poking around in tdi.py with pdb and I noticed this code at line 255 of current master:

    @classmethod
    def _simple_new(cls, values, name=None, freq=None, **kwargs):
-->     if not getattr(values,'dtype',None):   # This doesn't do what it looks like it does
            values = np.array(values,copy=False)
        if values.dtype == np.object_:
            values = tslib.array_to_timedelta64(values)
        if values.dtype != _TD_DTYPE:
            values = com._ensure_int64(values).view(_TD_DTYPE)

        result = object.__new__(cls)
        result._data = values
        result.name = name
        result.freq = freq
        result._reset_identity()
        return result

This looks to me like it's testing whether values is already an ndarray, and if it isn't, it's taking a view into its values. But that's not what it's doing at all. If values is a Series, it has a dtype, so the view-making code should be skipped! Except that it isn't:

In [1]: import numpy as np

In [2]: map(np.dtype, [None, 'i8', 'i4,i4', ('m8[ns]', (2, 2)), ('S', 4),
   ...:                [('name', 'u1', (2, 2))],
   ...:                {'names': ['name'], 'formats': ['f8']},
   ...:                ('i4,i4', (2, 2))])
Out[2]: 
[dtype('float64'),                                # note that dtype(None) is float!
 dtype('int64'),                                  # fixed dtype
 dtype([('f0', '<i4'), ('f1', '<i4')]),           # record dtype
 dtype(('<m8[ns]', (2, 2))),                      # array dtype
 dtype('S4'),                                     # flexible dtype
 dtype([('name', 'u1', (2, 2))]),                 # record dtype with array fields
 dtype([('name', '<f8')]),                        # another record dtype
 dtype(([('f0', '<i4'), ('f1', '<i4')], (2, 2)))] # array dtype with record fields

In [3]: map(bool, _)
Out[3]: [False, False, True, False, False, True, True, False]   # madness

My tentative guess is that if the outermost layer of the dtype is a record type, then it is Truey, and otherwise it's Falsey. This behavior seems too pathological to depend on.

Unfortunately, that's exactly what's happening when constructing a TimedeltaIndex from a Series of Timedeltas:

In [1]: import pandas as pd

In [2]: import pdb

In [3]: pd.Series([pd.Timedelta(1, 'h')])
Out[3]: 
0   01:00:00
dtype: timedelta64[ns]

In [4]: pdb.run('pd.Index(_)')
> <string>(1)<module>()
(Pdb) b pd.TimedeltaIndex._simple_new
Breakpoint 1 at /Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/tseries/tdi.py:253
(Pdb) c
> /Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/tseries/tdi.py(255)_simple_new()
-> if not getattr(values,'dtype',None):
(Pdb) p values
0   01:00:00
dtype: timedelta64[ns]   # it's still a Series
(Pdb) n
> /Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/tseries/tdi.py(256)_simple_new()
-> values = np.array(values,copy=False)
(Pdb) n
> /Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/tseries/tdi.py(257)_simple_new()
-> if values.dtype == np.object_:
(Pdb) p values
array([3600000000000], dtype='timedelta64[ns]')   # now it's an ndarray

Everything seems to work, but what's actually going on is so different from what I would expect to be going on that it seems like a time bomb.

The text was updated successfully, but these errors were encountered:

jreback · 2015-02-10T22:51:06Z

this is just to make sure it's' m8[ns] dtyped as the data has to be s view (and not actual objects)
this is only if an index is constructed and is not hit with a series

so not sure what your issue is about

jreback · 2015-02-10T22:52:42Z

the np.array(,...) coerces the series to its base numpy repr

ischwabacher · 2015-02-10T23:05:36Z

So not getattr(values,'dtype',None) is expected to always be True?

jreback · 2015-02-10T23:25:46Z

this is an internal routine
record types are meaningless here
this is only a simple dtype or scalar iirc

ischwabacher · 2015-02-10T23:46:27Z

record types are meaningless here

...which means that bool(dtype) is always False, because numpy is nuts.

this is only a simple dtype or scalar iirc

The pdb session in the OP shows that calling Index on a Series of Timedeltas ends up calling this internal routine with the input Series as the values.

It looks like copy=False is a problem too:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: idx = pd.Index([pd.Timedelta(1, 'h')])

In [4]: idx[0] *= 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-7a30aa64d2ec> in <module>()
----> 1 idx[0] *= 2

/Users/afni/ActivityMonitorProcessing/sojourns/source/lib/python2.7/site-packages/pandas/core/index.pyc in __setitem__(self, key, value)
    894 
    895     def __setitem__(self, key, value):
--> 896         raise TypeError("Indexes does not support mutable operations")
    897 
    898     def __getitem__(self, key):

TypeError: Indexes does not support mutable operations

In [5]: val = idx.values

In [6]: idx2 = pd.Index(val)

In [7]: val
Out[7]: array([3600000000000], dtype='timedelta64[ns]')

In [8]: idx
Out[8]: 
<class 'pandas.tseries.tdi.TimedeltaIndex'>
['01:00:00']
Length: 1, Freq: None

In [9]: idx2
Out[9]: 
<class 'pandas.tseries.tdi.TimedeltaIndex'>
['01:00:00']
Length: 1, Freq: None

In [10]: val[:] *= 2

In [11]: idx
Out[11]: 
<class 'pandas.tseries.tdi.TimedeltaIndex'>
['02:00:00']   # This was supposed to be immutable!
Length: 1, Freq: None

In [12]: idx2
Out[12]: 
<class 'pandas.tseries.tdi.TimedeltaIndex'>
['02:00:00']
Length: 1, Freq: None

jreback · 2015-02-10T23:50:29Z

The Index class is not meant to prevent what you are doing here. If the user grabs an internal accessor (.values) and then does stuff with it nothing can be done. In fact this property is used internally. It doesn't support mutable ops thru its accessors.

I am happy that you are looking at this class. Feel free to propose some fixes. I did rip off some code from tseries/index.py (so some of the same idioms are prob there).

There are certainly some edge cases that are not tested / caught.

jreback added Timedelta Timedelta data type Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation labels Feb 10, 2015

jreback mentioned this issue Jul 24, 2015

Drop & insert on subtypes of index return their subtypes #10620

Closed

jbrockmendel mentioned this issue Oct 31, 2018

REF: strictness and checks for Timedelta _simple_new #23433

Merged

4 tasks

jreback added this to the 0.24.0 milestone Oct 31, 2018

jreback closed this as completed in #23433 Nov 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numpy dtype weirdness in TimedeltaIndex._simple_new #9462

Numpy dtype weirdness in TimedeltaIndex._simple_new #9462

ischwabacher commented Feb 10, 2015

jreback commented Feb 10, 2015

jreback commented Feb 10, 2015

ischwabacher commented Feb 10, 2015

jreback commented Feb 10, 2015

ischwabacher commented Feb 10, 2015

jreback commented Feb 10, 2015

Numpy dtype weirdness in TimedeltaIndex._simple_new #9462

Numpy dtype weirdness in TimedeltaIndex._simple_new #9462

Comments

ischwabacher commented Feb 10, 2015

jreback commented Feb 10, 2015

jreback commented Feb 10, 2015

ischwabacher commented Feb 10, 2015

jreback commented Feb 10, 2015

ischwabacher commented Feb 10, 2015

jreback commented Feb 10, 2015