-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: make min/max on empty datetime df consistent with datetime serie… #33911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: make min/max on empty datetime df consistent with datetime serie… #33911
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew note as well, 1.1 bug fixes datetimelike section
pandas/core/nanops.py
Outdated
@@ -384,8 +384,7 @@ def _na_for_min_count( | |||
else: | |||
assert axis is not None # assertion to make mypy happy | |||
result_shape = values.shape[:axis] + values.shape[axis + 1 :] | |||
result = np.empty(result_shape, dtype=values.dtype) | |||
result.fill(fill_value) | |||
result = np.full(result_shape, fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this not have a dtype parameter? does it matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result parameter here is what is making all the trouble. We set the dtype to datetime but our NaT
is treated as NaN
which numpy internally tries to cast into an integer which yields the error:
ValueError: cannot convert float NaN to integer
.
The conversion happens somewhere inside the c-part of numpy, which I'm not familiar with.
But to come back to your question: for our result, specifying the dtype is not necessary. The given outputs are all corresponding to the type of the input. But good point, I should probably add this check to my tests.
def test_sum_empty_df_series(): | ||
# Calling the following defined sum function returned an error for dataframes but | ||
# returned NaT for series. # Check that the API is consistent in this sense when | ||
# operating on empty Series/DataFrames. See GH:33704 for more information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frame tests go here (you can rename slightly); these should eventually be moved into test_reductions.py
class TestDataFrameReductions:
def test_min_max_dt64_with_NaT
series are here:
tests/series/test_reductions.py
in tests/frame/test_analytics.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# returned NaT for series. # Check that the API is consistent in this sense when | ||
# operating on empty Series/DataFrames. See GH:33704 for more information | ||
df = pd.DataFrame(dict(x=pd.to_datetime([]))) | ||
series = pd.Series(pd.to_datetime([])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for frame also need to test both axis=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Hello @CloseChoice! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-05-07 22:16:22 UTC |
assert (df.max(axis=0).x is NaT) == (expected_dt_series.max() is NaT) | ||
|
||
# check axis 1 | ||
tm.assert_series_equal(df.min(axis=1), expected_dt_series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to mention that calling min
on axis=1
of an empty dataframe does not return the same as calling it on a Series, since this it currently not possible (and I don't think it is desired) with the implementation of reduce-operations. This test checks that when reducing over the index a Dataframe behaves like a series.
…I-consistent-empty-datetime-max-or-min
done |
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -557,6 +557,7 @@ Datetimelike | |||
- Bug in :meth:`DatetimeIndex.intersection` losing ``freq`` and timezone in some cases (:issue:`33604`) | |||
- Bug in :class:`DatetimeIndex` addition and subtraction with some types of :class:`DateOffset` objects incorrectly retaining an invalid ``freq`` attribute (:issue:`33779`) | |||
- Bug in :class:`DatetimeIndex` where setting the ``freq`` attribute on an index could silently change the ``freq`` attribute on another index viewing the same data (:issue:`33552`) | |||
- Bug in :meth:`nanops._na_for_min_count` when called with empty :class:`DataFrame` of ``timedelta64`` dtype (:issue:`33911`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make a userfacing note. IOW a user wants to know, what changed. you are referring to an internal routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# calling np.full with dtype parameter throws an ValueError when called | ||
# with np.datetime64 and pd.NaT | ||
try: | ||
result = np.full(result_shape, fill_value, dtype=values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what hits each of these branches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when np.full
is called with the parameter dtype=np.datetime64
and fill_value=pd.NaT
a ValueError ValueError: cannot convert float NaN to integer
. In this case I don't call np.full
without an explicit dtype
. Giving the array result
an explicit dtype does not have an effect on the DataFrame column but I thought it is cleaner if I at least try to give it the correct one. I hoped the comment explains this. But if not, I'm gonna make it gonna improve the comment.
@@ -1258,3 +1258,26 @@ def test_min_max_dt64_with_NaT(self): | |||
res = df.max() | |||
exp = pd.Series([pd.NaT], index=["foo"]) | |||
tm.assert_series_equal(res, exp) | |||
|
|||
# Calling the following sum functions returned an error for dataframes but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a new test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
thanks @CloseChoice |
…s (#33704)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Fixes the following issue:
throws a
ValueError
butresults in
NaT
.With this change, calling an DataFrame on en empty
pd.to_datetime([])
, results in