Skip to content

BUG: incorrect EA casting in groubpy.agg #38254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Dec 3, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

The existing code is implicitly assuming that _from_sequence is strict, so that maybe_cast_to_extension_array will only return an EA when correct. Since that assumption is untrue, the current code will return incorrect results, as in the test this adds.

If we just removed L729-L741 in groupby.ops, we would have 12ish test failures. A few of those would be for float64 result failing to cast back to Float64, and the rest would be for ndarray[Decimal objects] failing to cast back to DecimalArray. Until _from_sequence is reliable, I would rather remove these few lines and return correct-but-suboptimally-casted results than have these kludges.

cc @jorisvandenbossche

@@ -725,7 +726,19 @@ def _aggregate_series_pure_python(self, obj: Series, func: F):
result[label] = res

result = lib.maybe_convert_objects(result, try_float=0)
result = maybe_cast_result(result, obj, numeric_only=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing this here, rather than inside maybe_cast_results itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm trying to get rid of maybe_cast_result altogether. It uses maybe_cast_to_extension_array whereas we should be casting yoda-style (do or do not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as mentioned in the OP, id actually rather remove this chunk of code entirely)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but I'd rather NOT do it in groupby at all (any casting like this), and instead push it to dtypes/cast.py this seems like going backwards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you're good with just ripping this out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and instead push it to dtypes/cast.py this seems like going backwards.

depends on how we think of dtypes.cast. I think of it as "low-level helper functions related to casting" NOT "anything related to casting". I don't want to put groupby-specific casting code in there. (i also dont like having DataFrame.reset_index code in there)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and instead push it to dtypes/cast.py this seems like going backwards.

depends on how we think of dtypes.cast. I think of it as "low-level helper functions related to casting" NOT "anything related to casting". I don't want to put groupby-specific casting code in there. (i also dont like having DataFrame.reset_index code in there)

i don't think virtually any casting code should be in groupby / frame. but we have to put it somewhere (and of course ideally there isn't any special casing on the class of the parent container).

so i think its the lesser of evils to keep it all together in dtypes/cast.py potentially allowing for re-use / refactoring.

@jbrockmendel
Copy link
Member Author

updated to remove the incorrect casting without special-casing pandas-internal EAs

@jorisvandenbossche
Copy link
Member

@jbrockmendel Can you show an example of what currently gives a wrong output?
(looking at the diff it seems something with Period? )

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Groupby labels Dec 4, 2020
@jreback jreback added this to the 1.2 milestone Dec 4, 2020
@jreback
Copy link
Contributor

jreback commented Dec 4, 2020

ok code / tests look ok, can you add a whatsnew note in the EA section (or groupby)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs further discussion first

@jbrockmendel
Copy link
Member Author

Can you show an example of what currently gives a wrong output?
(looking at the diff it seems something with Period? )

dti = pd.date_range("2012-01-01", periods=4)
pi = dti.to_period("D")

df = DataFrame({"a": [0, 0, 1, 1], "b": pi})
gb = df.groupby("a")

>>> gb["b"].agg(lambda x: x.iloc[0].year)
a
0    2012-01-01
1    2012-01-01
Name: b, dtype: period[D]

@jorisvandenbossche
Copy link
Member

I would personally focus on fixing the actual cause (stricter way to cast back), instead of doing this "fix" (because it's fixing period, but at a cost of causing regressions for other dtypes)

@jreback jreback removed this from the 1.2 milestone Dec 4, 2020
@jbrockmendel
Copy link
Member Author

(because it's fixing period, but at a cost of causing regressions for other dtypes)

how much of your opinion is based on IntegerArray/FloatingArray/BooleanArray? We could special-case those similar to what we do in _ea_wrap_cython_operation.

For general EAs (and DecimalArray) I think "sometimes giving completely incorrect results" is a much bigger problem than "sometimes giving correct results in a sub-optimal container"

@jorisvandenbossche
Copy link
Member

how much of your opinion is based on IntegerArray/FloatingArray/BooleanArray?

Yes, quite a bit ;). Also based on geopandas, but our _from_sequence is already pretty strict there anyway, so we don't really have this problem.

Now, I don't think this is a critical bug fix for 1.2, so assuming it's for 1.3 (I think the rc will be cut any time now), that gives us more time to fix this, in which case I would prefer to focus on getting #38315 done.
(if we do that, will move your period test there to ensure that is fixed)

We could also still merge this, and revert most of the test changes in #38315, but not sure what buys us that (apart from a bigger diff in #38315)

@jbrockmendel
Copy link
Member Author

Yes, quite a bit ;).

I'd be OK with re-implementing part of an earlier commit that special-cases those dtypes. That would fix the Period-like bug and only change the Decimal-like cases, which are relatively straightforward conceptually. Would you be OK with that?

@@ -454,6 +457,8 @@ def test_agg_tzaware_non_datetime_result():
result = gb["b"].agg(lambda x: x.iloc[-1] - x.iloc[0])
expected = Series([pd.Timedelta(days=1), pd.Timedelta(days=1)], name="b")
expected.index.name = "a"
if as_period:
expected = expected.astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct expected result? When subtracting Periods, you get offset objects, not Timedelta objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. probably not great that tm.assert_series_equal can't tell the difference either

@jorisvandenbossche
Copy link
Member

That would fix the Period-like bug and only change the Decimal-like cases, which are relatively straightforward conceptually. Would you be OK with that?

I still don't fully understand what would buy us that, since #38315 is also fixing the Period-like bug, but without regressing on the decimal case. So I would just need to revert most of this PR in #38315 (except the period test case, which I can add there), making the diff there larger.
Your input on the different questions I laid out in #38315 is very welcome, to ensure we can move it forward.

@jbrockmendel
Copy link
Member Author

since #38315 is also fixing the Period-like bug, but without regressing on the decimal case

#38315 is very ambitious and potentially addressing multiple intertwined issues. I'm not ready to consider it a "solved problem".

@jreback jreback modified the milestone: 1.3 Dec 12, 2020
@jorisvandenbossche
Copy link
Member

I'm not ready to consider it a "solved problem".

But this PR is also not "solving the problem" ..
It's IMO rather complicating the situation by incorrecting part of the tests. I am fine if the PR would just focus on fixing the Period case though, then it already fixes this specific case and adds a test, making #38315 actually easier instead of more complex.

@@ -750,7 +746,7 @@ def _aggregate_series_pure_python(self, obj: Series, func: F):
result[label] = res

result = lib.maybe_convert_objects(result, try_float=0)
result = maybe_cast_result(result, obj, numeric_only=True)
# TODO: cast to EA once _from_sequence is reliably strict GH#38254
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside maybe_cast_result, we already have a special case check for not casting back when having datetime or categorical data. You could add periods to that check, and then this line can be left as is, while still fixing the period issue.
Then we can add the period test, without needing to change the decimal tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside maybe_cast_result, we already have a special case check for not casting back when having datetime or categorical data.

That's actually what I find most objectionable about maybe_cast_result. It is a kludge that special-cases pandas-internal EAs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as I said below, it would at least be a consistent "kludge"... ;)
Not calling the method here that was exactly written to be called here, it not making it any less kludgy IMO.

So I think it would be good to add the period test case here, and then focus on #38315 to properly fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not making it any less kludgy

This is the only remaining usage of that kludge, so getting rid of it would certainly make it less kludgy.

and then focus on #38315 to properly fix it.

It's going to be a while before 1.3, so I'm fine sticking a pin in this to see if a better solution is implemented between now and then. If it doesn't, we should do this for 1.3.

@jorisvandenbossche
Copy link
Member

I think updating maybe_cast_result to special case periods as well is actually a nice short term solution to get this fix in (we already do that for other dtypes as well, so while it is not the nicest thing to do (but one of the goals of #38315 will be to clean that up), it's at least consistent ..).

@jbrockmendel jbrockmendel deleted the ref-python_agg_general branch January 24, 2021 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants