-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: incorrect EA casting in groubpy.agg #38254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: incorrect EA casting in groubpy.agg #38254
Conversation
pandas/core/groupby/ops.py
Outdated
@@ -725,7 +726,19 @@ def _aggregate_series_pure_python(self, obj: Series, func: F): | |||
result[label] = res | |||
|
|||
result = lib.maybe_convert_objects(result, try_float=0) | |||
result = maybe_cast_result(result, obj, numeric_only=True) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you doing this here, rather than inside maybe_cast_results itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm trying to get rid of maybe_cast_result altogether. It uses maybe_cast_to_extension_array
whereas we should be casting yoda-style (do or do not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as mentioned in the OP, id actually rather remove this chunk of code entirely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure but I'd rather NOT do it in groupby at all (any casting like this), and instead push it to dtypes/cast.py this seems like going backwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you're good with just ripping this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and instead push it to dtypes/cast.py this seems like going backwards.
depends on how we think of dtypes.cast. I think of it as "low-level helper functions related to casting" NOT "anything related to casting". I don't want to put groupby-specific casting code in there. (i also dont like having DataFrame.reset_index code in there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and instead push it to dtypes/cast.py this seems like going backwards.
depends on how we think of dtypes.cast. I think of it as "low-level helper functions related to casting" NOT "anything related to casting". I don't want to put groupby-specific casting code in there. (i also dont like having DataFrame.reset_index code in there)
i don't think virtually any casting code should be in groupby / frame. but we have to put it somewhere (and of course ideally there isn't any special casing on the class of the parent container).
so i think its the lesser of evils to keep it all together in dtypes/cast.py potentially allowing for re-use / refactoring.
…f-python_agg_general
updated to remove the incorrect casting without special-casing pandas-internal EAs |
@jbrockmendel Can you show an example of what currently gives a wrong output? |
ok code / tests look ok, can you add a whatsnew note in the EA section (or groupby) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs further discussion first
|
…f-python_agg_general
I would personally focus on fixing the actual cause (stricter way to cast back), instead of doing this "fix" (because it's fixing period, but at a cost of causing regressions for other dtypes) |
how much of your opinion is based on IntegerArray/FloatingArray/BooleanArray? We could special-case those similar to what we do in _ea_wrap_cython_operation. For general EAs (and DecimalArray) I think "sometimes giving completely incorrect results" is a much bigger problem than "sometimes giving correct results in a sub-optimal container" |
Yes, quite a bit ;). Also based on geopandas, but our Now, I don't think this is a critical bug fix for 1.2, so assuming it's for 1.3 (I think the rc will be cut any time now), that gives us more time to fix this, in which case I would prefer to focus on getting #38315 done. We could also still merge this, and revert most of the test changes in #38315, but not sure what buys us that (apart from a bigger diff in #38315) |
I'd be OK with re-implementing part of an earlier commit that special-cases those dtypes. That would fix the Period-like bug and only change the Decimal-like cases, which are relatively straightforward conceptually. Would you be OK with that? |
…f-python_agg_general
@@ -454,6 +457,8 @@ def test_agg_tzaware_non_datetime_result(): | |||
result = gb["b"].agg(lambda x: x.iloc[-1] - x.iloc[0]) | |||
expected = Series([pd.Timedelta(days=1), pd.Timedelta(days=1)], name="b") | |||
expected.index.name = "a" | |||
if as_period: | |||
expected = expected.astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the correct expected result? When subtracting Periods, you get offset objects, not Timedelta objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. probably not great that tm.assert_series_equal can't tell the difference either
I still don't fully understand what would buy us that, since #38315 is also fixing the Period-like bug, but without regressing on the decimal case. So I would just need to revert most of this PR in #38315 (except the period test case, which I can add there), making the diff there larger. |
…f-python_agg_general
But this PR is also not "solving the problem" .. |
@@ -750,7 +746,7 @@ def _aggregate_series_pure_python(self, obj: Series, func: F): | |||
result[label] = res | |||
|
|||
result = lib.maybe_convert_objects(result, try_float=0) | |||
result = maybe_cast_result(result, obj, numeric_only=True) | |||
# TODO: cast to EA once _from_sequence is reliably strict GH#38254 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside maybe_cast_result
, we already have a special case check for not casting back when having datetime or categorical data. You could add periods to that check, and then this line can be left as is, while still fixing the period issue.
Then we can add the period test, without needing to change the decimal tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside maybe_cast_result, we already have a special case check for not casting back when having datetime or categorical data.
That's actually what I find most objectionable about maybe_cast_result. It is a kludge that special-cases pandas-internal EAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as I said below, it would at least be a consistent "kludge"... ;)
Not calling the method here that was exactly written to be called here, it not making it any less kludgy IMO.
So I think it would be good to add the period test case here, and then focus on #38315 to properly fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not making it any less kludgy
This is the only remaining usage of that kludge, so getting rid of it would certainly make it less kludgy.
and then focus on #38315 to properly fix it.
It's going to be a while before 1.3, so I'm fine sticking a pin in this to see if a better solution is implemented between now and then. If it doesn't, we should do this for 1.3.
I think updating |
…f-python_agg_general
…f-python_agg_general
…f-python_agg_general
…f-python_agg_general
…f-python_agg_general
…f-python_agg_general
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The existing code is implicitly assuming that _from_sequence is strict, so that maybe_cast_to_extension_array will only return an EA when correct. Since that assumption is untrue, the current code will return incorrect results, as in the test this adds.
If we just removed L729-L741 in groupby.ops, we would have 12ish test failures. A few of those would be for float64 result failing to cast back to Float64, and the rest would be for
ndarray[Decimal objects]
failing to cast back to DecimalArray. Until_from_sequence
is reliable, I would rather remove these few lines and return correct-but-suboptimally-casted results than have these kludges.cc @jorisvandenbossche