Skip to content

BUG: Min/max does not work for dates with timezones if there are missing values in the data frame #44222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

CloseChoice
Copy link
Member

@CloseChoice CloseChoice commented Oct 28, 2021

This is an attempt at fixing the bug of wrong min/max aggregation with timezone aware data and at least one pd.NaT in the data to aggregate. While I still see quite a lot of weaknesses in my approach I want to elaborate why I did what I did, what the weaknesses of this approach are, how this might be handled better and why I didn't follow that path.

Problem description

To fully understand the problem at hand, let's look at three different cases:

import pandas as pd
# CASE 1
# No Problem without NaT
rng_with_tz = pd.date_range(start='2021-10-01T12:00:00+02:00', end='2021-10-02T12:00:00+02:00', freq='4H')
df_with_tz = pd.DataFrame(data={'A': rng_with_tz, 'B': rng_with_tz + pd.Timedelta(minutes=20)})
df_with_tz.max(axis=1)

# No problem with timezone naive dataframe
# CASE 2
rng_tz_naive = pd.date_range(start='2021-10-01T12:00:00', end='2021-10-02T12:00:00', freq='4H')
df_tz_naive = pd.DataFrame(data={'A': rng_tz_naive, 'B': rng_tz_naive + pd.Timedelta(minutes=20)})
df_tz_naive.iloc[2, 1] = pd.NaT
df_tz_naive.max(axis=1)

# Incorrect result and warning with timezone aware df
# CASE 1
rng_with_tz = pd.date_range(start='2021-10-01T12:00:00+02:00', end='2021-10-02T12:00:00+02:00', freq='4H')
df_with_tz = pd.DataFrame(data={'A': rng_with_tz, 'B': rng_with_tz + pd.Timedelta(minutes=20)})
df_with_tz.iloc[2, 1] = pd.NaT
df_with_tz.max(axis=1)

When one drills down into the first case no problems arise. The max call is handled down to pandas/core/nanops.py:reduction and is running through fine. In the second case one finds that in pandas/core/nanops.py:reduction right after the call to pandas/core/nanops.py:_get_values one realizes that all values are cast to integers, even the pd.NaT.
In the third case one can see that the values aren't casted and pd.NaT is replaced by -np.inf which comes from this line. This replacement is the reason for our error:

TypeError: '>=' not supported between instances of 'Timestamp' and 'float'

when we try to apply the max from numpy
which results from comparing pd.Timestamps with floats (np.inf). This error is caught and all elements are converted into np.nan.
The reason behind the error is, that

>>> import pandas as pd

>>> rng_with_tz = pd.date_range(start='2021-10-01T12:00:00+02:00', end='2021-10-02T12:00:00+02:00', freq='4H')
>>> rng_with_tz
DatetimeIndex(['2021-10-01 12:00:00+02:00', '2021-10-01 16:00:00+02:00',
               '2021-10-01 20:00:00+02:00', '2021-10-02 00:00:00+02:00',
               '2021-10-02 04:00:00+02:00', '2021-10-02 08:00:00+02:00',
               '2021-10-02 12:00:00+02:00'],
              dtype='datetime64[ns, pytz.FixedOffset(120)]', freq='4H')
>>> rng_with_tz.values.dtype
dtype('<M8[ns]')
 
>>> rng_with_tz = pd.date_range(start='2021-10-01T12:00:00+02:00', end='2021-10-02T12:00:00+02:00', freq='4H')
>>> df_with_tz = pd.DataFrame(data={'A': rng_with_tz, 'B': rng_with_tz + pd.Timedelta(minutes=20)})
>>> df_with_tz.values.dtype
dtype('O')

the dtype changes when using timezone aware data, though the dtypes of the single series is still dtype('<M8[ns]'). This has serious implications: In pandas/core/nanops.py:_get_values the if condition in this line (

if needs_i8_conversion(values.dtype):
) is False and no cast to happens but still the pd.NaT is replaced by np.inf which is the reason for the error stated above.

Possible improvements on this PR

If we could change the dtype of the timezone aware DataFrame with missing values to the numpy dtype: dtype('<M8[ns]') we should be fine. But I have no idea how we can achieve that, since these are not pandas but numpy types we are talking about and I didn't really understand how numpy determines it`s types.

Weaknesses of this PR

The assumption made while catching the exceptions aren't well tested (though all tests are passed) and there might be a better and more generic solution.

tm.assert_series_equal(result, expected)


def test_min_max_timezone_nat2():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these will go in tests.frame.test_reductions

except TypeError as e:
# the only case when this can happein is for timezone aware
# Timestamps where a NaT value is casted to # np.inf
from pandas.core.dtypes.common import is_float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just import this at the top

@CloseChoice
Copy link
Member Author

@jbrockmendel one more thing: I don't quite understand how working blockwise could have solved this problem as you hinted. I checked the blocks and one of the blocks also contains the pd.NaT value with dtype('O') and the same problem will arise here. Could you give me an idea how this might work?

@jreback jreback changed the title Fix 44196 agg with nat BUG: Min/max does not work for dates with timezones if there are missing values in the data frame Oct 29, 2021
@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 29, 2021
@jbrockmendel
Copy link
Member

Could you give me an idea how this might work?

bc we don't fully support 2D EAs (@jreback: drink!), calling .values on a DataFrame with dt64tz dtypes upcasts to object, at which point things go off the rails. If we were to operate directly on the blocks (as happens in the axis=0 case when we call df._mgr.reduce) this would work correctly.

Three options come to mind here:

  1. recognize that the numeric_only=None case is deprecated so this use case will go away completely in 2.0. Live with this bug in the interim.

  2. Patch the affected nanops function to handle NaT correctly when it sees object-dtype (best guess is this will involve passing initial to np.minimum.reduce` circa nanops L1042

  3. Re-write DataFrame._reduce axis=1 case to first iterate over self._mgr.arrays calling blk_func, then do the reduction on the resulting column arrays.

# Timestamps where a NaT value is casted to # np.inf
from pandas.core.dtypes.common import is_float

vfunc = np.vectorize(lambda x: is_float(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we'd be better off fixing _get_values so as to not incorrectly insert a float

@pep8speaks
Copy link

pep8speaks commented Nov 6, 2021

Hello @CloseChoice! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-07 16:08:07 UTC

@CloseChoice
Copy link
Member Author

@github-actions pre-commit

@CloseChoice
Copy link
Member Author

@jbrockmendel: I rewrote the reduce function for the min and max cases but had to adjust some tests, since we weren't consistent before, e.g.

import pandas as pd
from pandas import Timedelta, Timestamp

# test_oerators_timedelta64
df = pd.DataFrame({'A': {0: Timedelta('0 days 00:05:05'), 1: Timedelta('1 days 00:05:05'), 2: Timedelta('2 days 00:05:05')},
                   'B': {0: Timedelta('-1 days +00:00:00'), 1: Timedelta('-1 days +00:00:00'), 2: Timedelta('-1 days +00:00:00')},
                   'C': {0: 'foo', 1: 'foo', 2: 'foo'},
                   'D': {0: 1, 1: 1, 2: 1},
                   'E': {0: 1.0, 1: 1.0, 2: 1.0},
                   'F': {0: Timestamp('2013-01-01 00:00:00'), 1: Timestamp('2013-01-01 00:00:00'), 2: Timestamp('2013-01-01 00:00:00')}})
# the following two lines should result in the same output but this isn't the case on master
df.max(axis=1)
df.T.max(axis=0)

This would also be fixed by the current state of the PR. Not sure it would be possible to rewrite all reduction operations like this but I could give it a try. Currently one test will fail which is connected issue xref #44336

[
pd.date_range(start="2018-01-01", end="2018-01-03")
.to_series()
.dt.tz_localize("UTC"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just pass 'tz="UTC"` to date_range

tm.assert_series_equal(result, expected)


def test_timezone_min_max_both_axis():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same bugs probably also affect timedelta64 and PeriodDtype? can you test those too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests for this cases but am not sure if I understood you correctly. This bug just appears for timezone aware data, afaik timedelta and PeriodDtype don't have timezone information, therefore I don't expect the bug to exist. Do you mean adding these types to a datetime column?

@CloseChoice
Copy link
Member Author

@github-actions pre-commit

@jbrockmendel
Copy link
Member

have you tried looking at core.nanops._get_values? If I disable the masking L328-334 for object dtype, the motivating case works. (15 other tests fail, so there's some more kinks to work out)

df = DataFrame(
[
[PeriodDtype(freq="20m"), PeriodDtype(freq="1h"), PeriodDtype(freq="1d")],
[PeriodDtype(freq="25m"), PeriodDtype(freq="2h"), pd.NaT],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try period_range, or even just take the frame from the dt64tz case and to a .to_period("D") on it

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 8, 2021
@jbrockmendel
Copy link
Member

have you tried looking at core.nanops._get_values? If I disable the masking L328-334 for object dtype, the motivating case works. (15 other tests fail, so there's some more kinks to work out)

can you try this

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

iirc this was close if you can merge main and move the note to 1.5

@mroeschke
Copy link
Member

Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch, address the review, and we can reopen.

@mroeschke mroeschke closed this Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: .max(axis=1) is incorrect when DataFrame contains tz-aware timestamps and NaT
5 participants