Skip to content

Minimum output differs between timezone-aware and timezone-naive datetimes #28768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
benmatwil opened this issue Oct 3, 2019 · 1 comment
Closed
Labels
Duplicate Report Duplicate issue or pull request

Comments

@benmatwil
Copy link

benmatwil commented Oct 3, 2019

Code Sample

import pandas as pd
import datetime

# setup a dataframe with some dates in
df_dates = pd.DataFrame({
           'a': [datetime.datetime(2010, 1, i) for i in range(2, 7)],
           'b': [datetime.datetime(2010, 1, i) for i in range(1, 6)]
            })

# set first date in first column to NaT
df_dates.a[0] = pd.np.nan

In [13]: df_dates.dtypes
Out[13]:
a    datetime64[ns]
b    datetime64[ns]
dtype: object

# pandas is happy to take the minimum as I'd like, even dealing with NaT
In [14]: df_dates.min(axis=1)
Out[14]:
0   2010-01-01
1   2010-01-02
2   2010-01-03
3   2010-01-04
4   2010-01-05
dtype: datetime64[ns]

In [15]: df_dates.min(axis=0)
Out[15]:
a   2010-01-03
b   2010-01-01
dtype: datetime64[ns]

# now set the datetimes to UTC
for col in df_dates.columns:
    df_dates[col] = df_dates[col].dt.tz_localize('utc')

# now pandas doesn't seem to be able to deal with NaT
# No minimum for first axis
In [19]: df_dates.min(axis=0)
Out[19]: Series([], dtype: float64)

# NaNs everywhere for second axis
In [20]: df_dates.min(axis=1)
Out[20]:
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
dtype: float64

# skipna kwarg doesn't seem to do anything
In [21]: df_dates.min(axis=1, skipna=True)
Out[21]:
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
dtype: float64

Problem description

I'd like to be able to take the minimum on timezone aware datetimes dataframes. Found this issue when a NaT was hidden in there. It works fine when there are no NaTs but the behaviour seems to change as above for timezone aware datetimes. The minimum method should output the same for both timezone aware and naive data? This seems to be an issue to me. I haven't tried .max()/.mean()/... etc.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 7
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : 0.29.13
pytest : 5.2.0
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.14.0
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.8
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

@mroeschke
Copy link
Member

Duplicate of #27794. Investigations welcome!

@mroeschke mroeschke added the Duplicate Report Duplicate issue or pull request label Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants