Skip to content

Min/max does not work for dates with timezones if there are missing values in the data frame #27794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Aug 7, 2019 · 10 comments
Labels
Bug Reduction Operations sum, mean, min, max, etc. Timezones Timezone data dtype

Comments

@ghost
Copy link

ghost commented Aug 7, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.concat([
  pd.date_range(start='1/1/2018', end='1/30/2018').to_series().dt.tz_localize("UTC"),
  pd.date_range(start='1/1/2018', end='1/20/2018').to_series().dt.tz_localize("UTC")
], axis=1)

df.min(axis=1)

Problem description

Without tz_localize, everything works as expected. Also, if both columns have the same length and time zones, it works just fine. However, the combination of time zones and different lengths results in:

2018-01-01   NaN
2018-01-02   NaN
2018-01-03   NaN
2018-01-04   NaN
2018-01-05   NaN
2018-01-06   NaN
2018-01-07   NaN
2018-01-08   NaN
2018-01-09   NaN
2018-01-10   NaN
2018-01-11   NaN
...

of type float64.

Expected Output

List of dates

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-17134-Microsoft
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.17.0
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.10
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : None
pymysql : None
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.3
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.1.8

I have found some related fixed issues, but not exactly this one.

@mroeschke mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations Timezones Timezone data dtype labels Aug 7, 2019
@MYUNGJE
Copy link

MYUNGJE commented Aug 15, 2019

may i take a look at it ?

@TomAugspurger
Copy link
Contributor

That'd be great!

@MaximeWeyl
Copy link

Here is a workaround I'm using, but it is slow :

def __column_max_without_nans(df):
    """Takes the min between two columns, avoiding a bug in pandas

    Link to github issue : https://github.com/pandas-dev/pandas/issues/27794
    """
    return df[df.notnull().all(axis=1)].max(axis=1).reindex(df.index)

@randomgambit
Copy link

great! My workaround: get rid of the timezones altogether. dt.tz_localize(None)... But this might not be ideal in certain cases

@jkukul
Copy link

jkukul commented Mar 9, 2020

@MaximeWeyl, your workaround only solves the issue partially, i.e. when all timestamps (with timezone) are present in an axis.

Consider the following example:

s1 = pd.Series([pd.Timestamp('2017-01-01T10', tz='UTC'), pd.Timestamp('2017-01-01T11', tz='UTC')])
s2 = pd.Series([pd.Timestamp('2017-01-01T12', tz='UTC'), pd.NaT])

df = pd.DataFrame(dict(s1=s1, s2=s2))

Expected output of df.min(axis=1) (without the bug) is:

0   2017-01-01 10:00:00+00:00
1   2017-01-01 11:00:00+00:00
dtype: datetime64[ns, UTC]

Applying your workaround:

df[df.notnull().all(axis=1)].max(axis=1).reindex(df.index)

gives:

0   2017-01-01 12:00:00+00:00
1                         NaT
dtype: datetime64[ns, UTC]

As you can see, your workaround helps for the 0th row, but returns an incorrect value for the 1st row which has an NaT value in one of the columns.

@jkukul
Copy link

jkukul commented Mar 9, 2020

Here's my workaround - instead of using:

df.min(axis=1)

I'm using:

df.apply(lambda s: s[s.notnull()].min(), axis=1)

@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020
@jbrockmendel
Copy link
Member

The underlying issue here is that df.min(axis=1) is going through a path that accesses df.values instead of operating block-wise.

@LanSaid
Copy link

LanSaid commented Mar 19, 2021

sad, its alive in 1.2.3, any idea can we fix it?

@jbrockmendel
Copy link
Member

part of the way there: passing numeric_only=False gives the expected result for #27794 (comment)

@jbrockmendel
Copy link
Member

Fixed by the removal of numeric_only=None, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reduction Operations sum, mean, min, max, etc. Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

8 participants