-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Min/max does not work for dates with timezones if there are missing values in the data frame #44222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_min_max_timezone_nat2(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these will go in tests.frame.test_reductions
pandas/core/nanops.py
Outdated
except TypeError as e: | ||
# the only case when this can happein is for timezone aware | ||
# Timestamps where a NaT value is casted to # np.inf | ||
from pandas.core.dtypes.common import is_float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just import this at the top
@jbrockmendel one more thing: I don't quite understand how working blockwise could have solved this problem as you hinted. I checked the blocks and one of the blocks also contains the |
bc we don't fully support 2D EAs (@jreback: drink!), calling .values on a DataFrame with dt64tz dtypes upcasts to object, at which point things go off the rails. If we were to operate directly on the blocks (as happens in the axis=0 case when we call Three options come to mind here:
|
pandas/core/nanops.py
Outdated
# Timestamps where a NaT value is casted to # np.inf | ||
from pandas.core.dtypes.common import is_float | ||
|
||
vfunc = np.vectorize(lambda x: is_float(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we'd be better off fixing _get_values so as to not incorrectly insert a float
Hello @CloseChoice! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-11-07 16:08:07 UTC |
@github-actions pre-commit |
@jbrockmendel: I rewrote the reduce function for the import pandas as pd
from pandas import Timedelta, Timestamp
# test_oerators_timedelta64
df = pd.DataFrame({'A': {0: Timedelta('0 days 00:05:05'), 1: Timedelta('1 days 00:05:05'), 2: Timedelta('2 days 00:05:05')},
'B': {0: Timedelta('-1 days +00:00:00'), 1: Timedelta('-1 days +00:00:00'), 2: Timedelta('-1 days +00:00:00')},
'C': {0: 'foo', 1: 'foo', 2: 'foo'},
'D': {0: 1, 1: 1, 2: 1},
'E': {0: 1.0, 1: 1.0, 2: 1.0},
'F': {0: Timestamp('2013-01-01 00:00:00'), 1: Timestamp('2013-01-01 00:00:00'), 2: Timestamp('2013-01-01 00:00:00')}})
# the following two lines should result in the same output but this isn't the case on master
df.max(axis=1)
df.T.max(axis=0) This would also be fixed by the current state of the PR. Not sure it would be possible to rewrite all reduction operations like this but I could give it a try. Currently one test will fail which is connected issue xref #44336 |
[ | ||
pd.date_range(start="2018-01-01", end="2018-01-03") | ||
.to_series() | ||
.dt.tz_localize("UTC"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just pass 'tz="UTC"` to date_range
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_timezone_min_max_both_axis(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same bugs probably also affect timedelta64 and PeriodDtype? can you test those too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added tests for this cases but am not sure if I understood you correctly. This bug just appears for timezone aware data, afaik timedelta and PeriodDtype don't have timezone information, therefore I don't expect the bug to exist. Do you mean adding these types to a datetime column?
…s into FIX-44196-agg-with-nat
@github-actions pre-commit |
have you tried looking at core.nanops._get_values? If I disable the masking L328-334 for object dtype, the motivating case works. (15 other tests fail, so there's some more kinks to work out) |
df = DataFrame( | ||
[ | ||
[PeriodDtype(freq="20m"), PeriodDtype(freq="1h"), PeriodDtype(freq="1d")], | ||
[PeriodDtype(freq="25m"), PeriodDtype(freq="2h"), pd.NaT], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try period_range, or even just take the frame from the dt64tz case and to a .to_period("D") on it
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
can you try this |
iirc this was close if you can merge main and move the note to 1.5 |
Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch, address the review, and we can reopen. |
This is an attempt at fixing the bug of wrong
min
/max
aggregation with timezone aware data and at least onepd.NaT
in the data to aggregate. While I still see quite a lot of weaknesses in my approach I want to elaborate why I did what I did, what the weaknesses of this approach are, how this might be handled better and why I didn't follow that path.Problem description
To fully understand the problem at hand, let's look at three different cases:
When one drills down into the first case no problems arise. The
max
call is handled down topandas/core/nanops.py:reduction
and is running through fine. In the second case one finds that inpandas/core/nanops.py:reduction
right after the call topandas/core/nanops.py:_get_values
one realizes that all values are cast to integers, even thepd.NaT
.In the third case one can see that the values aren't casted and
pd.NaT
is replaced by-np.inf
which comes from this line. This replacement is the reason for our error:when we try to apply the max from numpy
which results from comparing
pd.Timestamp
s withfloat
s (np.inf
). This error is caught and all elements are converted intonp.nan
.The reason behind the error is, that
the
dtype
changes when using timezone aware data, though thedtype
s of the single series is stilldtype('<M8[ns]')
. This has serious implications: Inpandas/core/nanops.py:_get_values
theif
condition in this line (pandas/pandas/core/nanops.py
Line 313 in 66ce5de
False
and no cast to happens but still thepd.NaT
is replaced bynp.inf
which is the reason for the error stated above.Possible improvements on this PR
If we could change the
dtype
of the timezone aware DataFrame with missing values to thenumpy
dtype:dtype('<M8[ns]')
we should be fine. But I have no idea how we can achieve that, since these are notpandas
butnumpy
types we are talking about and I didn't really understand hownumpy
determines it`s types.Weaknesses of this PR
The assumption made while catching the exceptions aren't well tested (though all tests are passed) and there might be a better and more generic solution.