Min/max does not work for dates with timezones if there are missing values in the data frame #27794

ghost · 2019-08-07T06:41:53Z

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.concat([
  pd.date_range(start='1/1/2018', end='1/30/2018').to_series().dt.tz_localize("UTC"),
  pd.date_range(start='1/1/2018', end='1/20/2018').to_series().dt.tz_localize("UTC")
], axis=1)

df.min(axis=1)

Problem description

Without tz_localize, everything works as expected. Also, if both columns have the same length and time zones, it works just fine. However, the combination of time zones and different lengths results in:

2018-01-01   NaN
2018-01-02   NaN
2018-01-03   NaN
2018-01-04   NaN
2018-01-05   NaN
2018-01-06   NaN
2018-01-07   NaN
2018-01-08   NaN
2018-01-09   NaN
2018-01-10   NaN
2018-01-11   NaN
...

of type float64.

Expected Output

List of dates

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-17134-Microsoft
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.17.0
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.10
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : None
pymysql : None
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.3
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.1.8

I have found some related fixed issues, but not exactly this one.

The text was updated successfully, but these errors were encountered:

MYUNGJE · 2019-08-15T04:42:10Z

may i take a look at it ?

TomAugspurger · 2019-08-16T11:39:50Z

That'd be great!

MaximeWeyl · 2019-10-23T13:15:33Z

Here is a workaround I'm using, but it is slow :

def __column_max_without_nans(df):
    """Takes the min between two columns, avoiding a bug in pandas

    Link to github issue : https://github.com/pandas-dev/pandas/issues/27794
    """
    return df[df.notnull().all(axis=1)].max(axis=1).reindex(df.index)

randomgambit · 2020-01-22T16:45:01Z

great! My workaround: get rid of the timezones altogether. dt.tz_localize(None)... But this might not be ideal in certain cases

jkukul · 2020-03-09T20:24:13Z

@MaximeWeyl, your workaround only solves the issue partially, i.e. when all timestamps (with timezone) are present in an axis.

Consider the following example:

s1 = pd.Series([pd.Timestamp('2017-01-01T10', tz='UTC'), pd.Timestamp('2017-01-01T11', tz='UTC')])
s2 = pd.Series([pd.Timestamp('2017-01-01T12', tz='UTC'), pd.NaT])

df = pd.DataFrame(dict(s1=s1, s2=s2))

Expected output of df.min(axis=1) (without the bug) is:

0   2017-01-01 10:00:00+00:00
1   2017-01-01 11:00:00+00:00
dtype: datetime64[ns, UTC]

Applying your workaround:

df[df.notnull().all(axis=1)].max(axis=1).reindex(df.index)

gives:

0   2017-01-01 12:00:00+00:00
1                         NaT
dtype: datetime64[ns, UTC]

As you can see, your workaround helps for the 0th row, but returns an incorrect value for the 1st row which has an NaT value in one of the columns.

jkukul · 2020-03-09T20:28:52Z

Here's my workaround - instead of using:

df.min(axis=1)

I'm using:

df.apply(lambda s: s[s.notnull()].min(), axis=1)

jbrockmendel · 2020-09-24T03:11:33Z

The underlying issue here is that df.min(axis=1) is going through a path that accesses df.values instead of operating block-wise.

LanSaid · 2021-03-19T13:12:19Z

sad, its alive in 1.2.3, any idea can we fix it?

jbrockmendel · 2021-03-20T02:21:50Z

part of the way there: passing numeric_only=False gives the expected result for #27794 (comment)

jbrockmendel · 2023-03-03T20:45:45Z

Fixed by the removal of numeric_only=None, closing.

mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations Timezones Timezone data dtype labels Aug 7, 2019

mroeschke mentioned this issue Sep 16, 2019

DataFrame.min/max with axis=1, datetime64[ns, tz] types and NaT values returns NaNs #28468

Closed

This was referenced Oct 3, 2019

Minimum output differs between timezone-aware and timezone-naive datetimes #28768

Closed

Call .max method on two timezone datetime columns results in nan #28821

Closed

Maximum of time series with NaT returns a dtype float64 with only NaN's #28868

Closed

mroeschke mentioned this issue Jan 22, 2020

min(axis = 1) propagates NAN when encountering missing timestamps #31214

Closed

jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 21, 2020

mroeschke added Bug and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 10, 2021

jbrockmendel mentioned this issue Oct 26, 2021

BUG: .max(axis=1) is incorrect when DataFrame contains tz-aware timestamps and NaT #44196

Closed

3 tasks

CloseChoice mentioned this issue Oct 28, 2021

BUG: Min/max does not work for dates with timezones if there are missing values in the data frame #44222

Closed

4 tasks

jbrockmendel closed this as completed Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Min/max does not work for dates with timezones if there are missing values in the data frame #27794

Min/max does not work for dates with timezones if there are missing values in the data frame #27794

ghost commented Aug 7, 2019 •

edited by jbrockmendel

Loading

INSTALLED VERSIONS

MYUNGJE commented Aug 15, 2019

TomAugspurger commented Aug 16, 2019

MaximeWeyl commented Oct 23, 2019

randomgambit commented Jan 22, 2020

jkukul commented Mar 9, 2020

jkukul commented Mar 9, 2020

jbrockmendel commented Sep 24, 2020

LanSaid commented Mar 19, 2021

jbrockmendel commented Mar 20, 2021

jbrockmendel commented Mar 3, 2023

Min/max does not work for dates with timezones if there are missing values in the data frame #27794

Min/max does not work for dates with timezones if there are missing values in the data frame #27794

Comments

ghost commented Aug 7, 2019 • edited by jbrockmendel Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

MYUNGJE commented Aug 15, 2019

TomAugspurger commented Aug 16, 2019

MaximeWeyl commented Oct 23, 2019

randomgambit commented Jan 22, 2020

jkukul commented Mar 9, 2020

jkukul commented Mar 9, 2020

jbrockmendel commented Sep 24, 2020

LanSaid commented Mar 19, 2021

jbrockmendel commented Mar 20, 2021

jbrockmendel commented Mar 3, 2023

ghost commented Aug 7, 2019 •

edited by jbrockmendel

Loading

Output of `pd.show_versions()`