Skip to content

BUG: ensure_index inconsistently coerces np.nan to pd.NaT depending if timezones are present #27011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jschendel opened this issue Jun 23, 2019 · 1 comment · Fixed by #27556
Labels
Bug Index Related to the Index class or subclasses Timezones Timezone data dtype
Milestone

Comments

@jschendel
Copy link
Member

Code Sample, a copy-pastable example if possible

On master:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+783.g2b9b58dad'

In [2]: ts_list = [pd.Timestamp('2018-01-01'), np.nan] 
   ...: tstz_list = [pd.Timestamp('2018-01-01', tz='UTC'), np.nan]

In [3]: pd.core.indexes.base.ensure_index(ts_list)
Out[3]: DatetimeIndex(['2018-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)

In [4]: pd.core.indexes.base.ensure_index(tstz_list)
Out[4]: Index([2018-01-01 00:00:00+00:00, nan], dtype='object')

Problem description

Out[4] does not coerce np.nan to pd.NaT and results in an Index with object dtype instead of a DatetimeIndex.

This causes downstream issues with IntervalIndex/IntervalArray as it can cause a valid IntervalIndex/IntervalArray to not be roundtripable from it's equivalent list/np.array representation:

In [5]: left = pd.DatetimeIndex(['2018-01-01', pd.NaT], tz='UTC') 
   ...: right = pd.DatetimeIndex(['2018-01-02', pd.NaT], tz='UTC') 
   ...: ii = pd.IntervalIndex.from_arrays(left, right) 
   ...: ii
Out[5]: 
IntervalIndex([(2018-01-01, 2018-01-02], nan],
              closed='right',
              dtype='interval[datetime64[ns, UTC]]')

In [6]: pd.IntervalIndex(ii.tolist())
---------------------------------------------------------------------------
TypeError: category, object, and string subtypes are not supported for IntervalIndex

In [7]: pd.IntervalIndex(ii.to_numpy())
---------------------------------------------------------------------------
TypeError: category, object, and string subtypes are not supported for IntervalIndex

Under the hood the list/np.array is being converted to left/right components, which are then passed to ensure_index, resulting in an Index with object dtype, hence the error message.

Note that the equivalent roundtrip without a tz works fine, as expected based on the inconsistency noted in the ensure_index example:

In [8]: left = pd.DatetimeIndex(['2018-01-01', pd.NaT]) 
    ...: right = pd.DatetimeIndex(['2018-01-02', pd.NaT]) 
    ...: ii = pd.IntervalIndex.from_arrays(left, right) 
    ...: ii
Out[8]: 
IntervalIndex([(2018-01-01, 2018-01-02], nan],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [9]: pd.IntervalIndex(ii.tolist())
Out[9]: 
IntervalIndex([(2018-01-01, 2018-01-02], nan],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [10]: pd.IntervalIndex(ii.to_numpy())
Out[10]: 
IntervalIndex([(2018-01-01, 2018-01-02], nan],
              closed='right',
              dtype='interval[datetime64[ns]]')

Expected Output

I'd expect Out[4] to be coerced to a DatetimeIndex with pd.NaT and the appropriate tz:

Out[4]: DatetimeIndex(['2018-01-01 00:00:00+00:00', 'NaT'], dtype='datetime64[ns, UTC]', freq=None)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2b9b58d
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.14-041914-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0.dev0+783.g2b9b58dad
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 40.8.0
Cython : 0.29.10
pytest : 4.6.2
hypothesis : 4.23.6
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.3
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : 0.3.0
gcsfs : None
matplotlib : 3.1.0
numexpr : 2.6.9
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
s3fs : 0.2.1
scipy : 1.2.1
sqlalchemy : 1.3.4
tables : 3.5.2
xarray : 0.12.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

@jschendel jschendel added Bug Timezones Timezone data dtype Index Related to the Index class or subclasses labels Jun 23, 2019
@jschendel jschendel added this to the Contributions Welcome milestone Jun 23, 2019
@mroeschke
Copy link
Member

This hits the cython layer that tends not to play nicely with datetime with timezones (and I suspect a other Extension Arrays too).

def clean_index_list(obj: list):

But there may possibly be a fix in the inference layer of Index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses Timezones Timezone data dtype
Projects
None yet
3 participants