BUG: Min/max does not work for dates with timezones if there are missing values in the data frame #44222

CloseChoice · 2021-10-28T21:07:37Z

closes BUG: .max(axis=1) is incorrect when DataFrame contains tz-aware timestamps and NaT #44196 and Min/max does not work for dates with timezones if there are missing values in the data frame #27794
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This is an attempt at fixing the bug of wrong min/max aggregation with timezone aware data and at least one pd.NaT in the data to aggregate. While I still see quite a lot of weaknesses in my approach I want to elaborate why I did what I did, what the weaknesses of this approach are, how this might be handled better and why I didn't follow that path.

Problem description

To fully understand the problem at hand, let's look at three different cases:

import pandas as pd
# CASE 1
# No Problem without NaT
rng_with_tz = pd.date_range(start='2021-10-01T12:00:00+02:00', end='2021-10-02T12:00:00+02:00', freq='4H')
df_with_tz = pd.DataFrame(data={'A': rng_with_tz, 'B': rng_with_tz + pd.Timedelta(minutes=20)})
df_with_tz.max(axis=1)

# No problem with timezone naive dataframe
# CASE 2
rng_tz_naive = pd.date_range(start='2021-10-01T12:00:00', end='2021-10-02T12:00:00', freq='4H')
df_tz_naive = pd.DataFrame(data={'A': rng_tz_naive, 'B': rng_tz_naive + pd.Timedelta(minutes=20)})
df_tz_naive.iloc[2, 1] = pd.NaT
df_tz_naive.max(axis=1)

# Incorrect result and warning with timezone aware df
# CASE 1
rng_with_tz = pd.date_range(start='2021-10-01T12:00:00+02:00', end='2021-10-02T12:00:00+02:00', freq='4H')
df_with_tz = pd.DataFrame(data={'A': rng_with_tz, 'B': rng_with_tz + pd.Timedelta(minutes=20)})
df_with_tz.iloc[2, 1] = pd.NaT
df_with_tz.max(axis=1)

When one drills down into the first case no problems arise. The max call is handled down to pandas/core/nanops.py:reduction and is running through fine. In the second case one finds that in pandas/core/nanops.py:reduction right after the call to pandas/core/nanops.py:_get_values one realizes that all values are cast to integers, even the pd.NaT.
In the third case one can see that the values aren't casted and pd.NaT is replaced by -np.inf which comes from this line. This replacement is the reason for our error:

TypeError: '>=' not supported between instances of 'Timestamp' and 'float'

when we try to apply the max from numpy
which results from comparing pd.Timestamps with floats (np.inf). This error is caught and all elements are converted into np.nan.
The reason behind the error is, that

>>> import pandas as pd

>>> rng_with_tz = pd.date_range(start='2021-10-01T12:00:00+02:00', end='2021-10-02T12:00:00+02:00', freq='4H')
>>> rng_with_tz
DatetimeIndex(['2021-10-01 12:00:00+02:00', '2021-10-01 16:00:00+02:00',
               '2021-10-01 20:00:00+02:00', '2021-10-02 00:00:00+02:00',
               '2021-10-02 04:00:00+02:00', '2021-10-02 08:00:00+02:00',
               '2021-10-02 12:00:00+02:00'],
              dtype='datetime64[ns, pytz.FixedOffset(120)]', freq='4H')
>>> rng_with_tz.values.dtype
dtype('<M8[ns]')
 
>>> rng_with_tz = pd.date_range(start='2021-10-01T12:00:00+02:00', end='2021-10-02T12:00:00+02:00', freq='4H')
>>> df_with_tz = pd.DataFrame(data={'A': rng_with_tz, 'B': rng_with_tz + pd.Timedelta(minutes=20)})
>>> df_with_tz.values.dtype
dtype('O')

the dtype changes when using timezone aware data, though the dtypes of the single series is still dtype('<M8[ns]'). This has serious implications: In pandas/core/nanops.py:_get_values the if condition in this line (

pandas/pandas/core/nanops.py

Line 313 in 66ce5de

if needs_i8_conversion(values.dtype):

) is False and no cast to happens but still the pd.NaT is replaced by np.inf which is the reason for the error stated above.

Possible improvements on this PR

If we could change the dtype of the timezone aware DataFrame with missing values to the numpy dtype: dtype('<M8[ns]') we should be fine. But I have no idea how we can achieve that, since these are not pandas but numpy types we are talking about and I didn't really understand how numpy determines it`s types.

Weaknesses of this PR

The assumption made while catching the exceptions aren't well tested (though all tests are passed) and there might be a better and more generic solution.

jbrockmendel · 2021-10-28T21:48:32Z

pandas/tests/groupby/aggregate/test_other.py

+    tm.assert_series_equal(result, expected)
+
+
+def test_min_max_timezone_nat2():


these will go in tests.frame.test_reductions

jbrockmendel · 2021-10-28T21:49:25Z

pandas/core/nanops.py

+            except TypeError as e:
+                # the only case when this can happein is for timezone aware
+                # Timestamps  where a NaT value is casted to # np.inf
+                from pandas.core.dtypes.common import is_float


just import this at the top

CloseChoice · 2021-10-29T05:28:19Z

@jbrockmendel one more thing: I don't quite understand how working blockwise could have solved this problem as you hinted. I checked the blocks and one of the blocks also contains the pd.NaT value with dtype('O') and the same problem will arise here. Could you give me an idea how this might work?

jbrockmendel · 2021-10-29T23:45:22Z

Could you give me an idea how this might work?

bc we don't fully support 2D EAs (@jreback: drink!), calling .values on a DataFrame with dt64tz dtypes upcasts to object, at which point things go off the rails. If we were to operate directly on the blocks (as happens in the axis=0 case when we call df._mgr.reduce) this would work correctly.

Three options come to mind here:

recognize that the numeric_only=None case is deprecated so this use case will go away completely in 2.0. Live with this bug in the interim.
Patch the affected nanops function to handle NaT correctly when it sees object-dtype (best guess is this will involve passing initial to np.minimum.reduce` circa nanops L1042
Re-write DataFrame._reduce axis=1 case to first iterate over self._mgr.arrays calling blk_func, then do the reduction on the resulting column arrays.

jbrockmendel · 2021-10-29T23:47:30Z

pandas/core/nanops.py

+                # Timestamps  where a NaT value is casted to # np.inf
+                from pandas.core.dtypes.common import is_float
+
+                vfunc = np.vectorize(lambda x: is_float(x))


i think we'd be better off fixing _get_values so as to not incorrectly insert a float

…ccording to PR discussion

pep8speaks · 2021-11-06T20:29:56Z

Hello @CloseChoice! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-07 16:08:07 UTC

CloseChoice · 2021-11-06T20:30:37Z

@github-actions pre-commit

CloseChoice · 2021-11-06T20:38:47Z

@jbrockmendel: I rewrote the reduce function for the min and max cases but had to adjust some tests, since we weren't consistent before, e.g.

import pandas as pd
from pandas import Timedelta, Timestamp

# test_oerators_timedelta64
df = pd.DataFrame({'A': {0: Timedelta('0 days 00:05:05'), 1: Timedelta('1 days 00:05:05'), 2: Timedelta('2 days 00:05:05')},
                   'B': {0: Timedelta('-1 days +00:00:00'), 1: Timedelta('-1 days +00:00:00'), 2: Timedelta('-1 days +00:00:00')},
                   'C': {0: 'foo', 1: 'foo', 2: 'foo'},
                   'D': {0: 1, 1: 1, 2: 1},
                   'E': {0: 1.0, 1: 1.0, 2: 1.0},
                   'F': {0: Timestamp('2013-01-01 00:00:00'), 1: Timestamp('2013-01-01 00:00:00'), 2: Timestamp('2013-01-01 00:00:00')}})
# the following two lines should result in the same output but this isn't the case on master
df.max(axis=1)
df.T.max(axis=0)

This would also be fixed by the current state of the PR. Not sure it would be possible to rewrite all reduction operations like this but I could give it a try. Currently one test will fail which is connected issue xref #44336

pandas/core/frame.py

jbrockmendel · 2021-11-06T21:59:42Z

pandas/tests/frame/test_reductions.py

+        [
+            pd.date_range(start="2018-01-01", end="2018-01-03")
+            .to_series()
+            .dt.tz_localize("UTC"),


can just pass 'tz="UTC"` to date_range

pandas/tests/frame/test_reductions.py

jbrockmendel · 2021-11-06T22:02:55Z

pandas/tests/frame/test_reductions.py

+    tm.assert_series_equal(result, expected)
+
+
+def test_timezone_min_max_both_axis():


same bugs probably also affect timedelta64 and PeriodDtype? can you test those too?

I added tests for this cases but am not sure if I understood you correctly. This bug just appears for timezone aware data, afaik timedelta and PeriodDtype don't have timezone information, therefore I don't expect the bug to exist. Do you mean adding these types to a datetime column?

…s into FIX-44196-agg-with-nat

CloseChoice · 2021-11-07T16:11:23Z

@github-actions pre-commit

jbrockmendel · 2021-11-07T18:28:37Z

have you tried looking at core.nanops._get_values? If I disable the masking L328-334 for object dtype, the motivating case works. (15 other tests fail, so there's some more kinks to work out)

jbrockmendel · 2021-11-07T18:29:54Z

pandas/tests/frame/test_reductions.py

+    df = DataFrame(
+        [
+            [PeriodDtype(freq="20m"), PeriodDtype(freq="1h"), PeriodDtype(freq="1d")],
+            [PeriodDtype(freq="25m"), PeriodDtype(freq="2h"), pd.NaT],


Try period_range, or even just take the frame from the dt64tz case and to a .to_period("D") on it

github-actions · 2021-12-08T00:18:34Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jbrockmendel · 2021-12-09T18:29:24Z

have you tried looking at core.nanops._get_values? If I disable the masking L328-334 for object dtype, the motivating case works. (15 other tests fail, so there's some more kinks to work out)

can you try this

jreback · 2022-01-16T19:13:48Z

iirc this was close if you can merge main and move the note to 1.5

mroeschke · 2022-01-25T01:46:18Z

Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch, address the review, and we can reopen.

CloseChoice added 5 commits October 28, 2021 18:48

WIP: marking all important things with pdb

f952745

add whatsnew, remove debug statements, add tests and fix typo in test

c6e5078

remove unnecessary statements

3a88582

Merge branch 'master' into FIX-44196-agg-with-nat

abb2249

reintroduce empty line

330a76b

jbrockmendel reviewed Oct 28, 2021

View reviewed changes

jreback changed the title ~~Fix 44196 agg with nat~~ BUG: Min/max does not work for dates with timezones if there are missing values in the data frame Oct 29, 2021

jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 29, 2021

jbrockmendel reviewed Oct 29, 2021

View reviewed changes

CloseChoice added 5 commits November 6, 2021 16:57

WIP: let min/max use block reduce but tests are failing, move tests a…

6b17d3d

…ccording to PR discussion

WIP: tests failing

3e6c88f

revert nanops changes

bc64980

Merge branch 'master' into FIX-44196-agg-with-nat

145f768

WIP: fix tests

4b06297

Fixes from pre-commit [automated commit]

f096c6f