BUG: incorrect EA casting in groubpy.agg #38254

jbrockmendel · 2020-12-03T01:38:03Z

The existing code is implicitly assuming that _from_sequence is strict, so that maybe_cast_to_extension_array will only return an EA when correct. Since that assumption is untrue, the current code will return incorrect results, as in the test this adds.

If we just removed L729-L741 in groupby.ops, we would have 12ish test failures. A few of those would be for float64 result failing to cast back to Float64, and the rest would be for ndarray[Decimal objects] failing to cast back to DecimalArray. Until _from_sequence is reliable, I would rather remove these few lines and return correct-but-suboptimally-casted results than have these kludges.

cc @jorisvandenbossche

jreback · 2020-12-03T15:11:24Z

pandas/core/groupby/ops.py

@@ -725,7 +726,19 @@ def _aggregate_series_pure_python(self, obj: Series, func: F):
            result[label] = res

        result = lib.maybe_convert_objects(result, try_float=0)
-        result = maybe_cast_result(result, obj, numeric_only=True)
+


why are you doing this here, rather than inside maybe_cast_results itself?

Because I'm trying to get rid of maybe_cast_result altogether. It uses maybe_cast_to_extension_array whereas we should be casting yoda-style (do or do not)

(as mentioned in the OP, id actually rather remove this chunk of code entirely)

sure but I'd rather NOT do it in groupby at all (any casting like this), and instead push it to dtypes/cast.py this seems like going backwards.

so you're good with just ripping this out?

and instead push it to dtypes/cast.py this seems like going backwards.

depends on how we think of dtypes.cast. I think of it as "low-level helper functions related to casting" NOT "anything related to casting". I don't want to put groupby-specific casting code in there. (i also dont like having DataFrame.reset_index code in there)

and instead push it to dtypes/cast.py this seems like going backwards.

depends on how we think of dtypes.cast. I think of it as "low-level helper functions related to casting" NOT "anything related to casting". I don't want to put groupby-specific casting code in there. (i also dont like having DataFrame.reset_index code in there)

i don't think virtually any casting code should be in groupby / frame. but we have to put it somewhere (and of course ideally there isn't any special casing on the class of the parent container).

so i think its the lesser of evils to keep it all together in dtypes/cast.py potentially allowing for re-use / refactoring.

…f-python_agg_general

jbrockmendel · 2020-12-04T01:37:48Z

updated to remove the incorrect casting without special-casing pandas-internal EAs

jorisvandenbossche · 2020-12-04T12:49:46Z

@jbrockmendel Can you show an example of what currently gives a wrong output?
(looking at the diff it seems something with Period? )

jreback · 2020-12-04T14:18:32Z

ok code / tests look ok, can you add a whatsnew note in the EA section (or groupby)

jorisvandenbossche

Needs further discussion first

jbrockmendel · 2020-12-04T15:12:11Z

Can you show an example of what currently gives a wrong output?
(looking at the diff it seems something with Period? )

dti = pd.date_range("2012-01-01", periods=4)
pi = dti.to_period("D")

df = DataFrame({"a": [0, 0, 1, 1], "b": pi})
gb = df.groupby("a")

>>> gb["b"].agg(lambda x: x.iloc[0].year)
a
0    2012-01-01
1    2012-01-01
Name: b, dtype: period[D]

…f-python_agg_general

jorisvandenbossche · 2020-12-04T18:48:53Z

I would personally focus on fixing the actual cause (stricter way to cast back), instead of doing this "fix" (because it's fixing period, but at a cost of causing regressions for other dtypes)

jbrockmendel · 2020-12-04T21:12:34Z

(because it's fixing period, but at a cost of causing regressions for other dtypes)

how much of your opinion is based on IntegerArray/FloatingArray/BooleanArray? We could special-case those similar to what we do in _ea_wrap_cython_operation.

For general EAs (and DecimalArray) I think "sometimes giving completely incorrect results" is a much bigger problem than "sometimes giving correct results in a sub-optimal container"

jorisvandenbossche · 2020-12-05T17:18:45Z

how much of your opinion is based on IntegerArray/FloatingArray/BooleanArray?

Yes, quite a bit ;). Also based on geopandas, but our _from_sequence is already pretty strict there anyway, so we don't really have this problem.

Now, I don't think this is a critical bug fix for 1.2, so assuming it's for 1.3 (I think the rc will be cut any time now), that gives us more time to fix this, in which case I would prefer to focus on getting #38315 done.
(if we do that, will move your period test there to ensure that is fixed)

We could also still merge this, and revert most of the test changes in #38315, but not sure what buys us that (apart from a bigger diff in #38315)

jbrockmendel · 2020-12-05T20:57:05Z

Yes, quite a bit ;).

I'd be OK with re-implementing part of an earlier commit that special-cases those dtypes. That would fix the Period-like bug and only change the Decimal-like cases, which are relatively straightforward conceptually. Would you be OK with that?

…f-python_agg_general

jorisvandenbossche · 2020-12-10T21:51:25Z

pandas/tests/groupby/aggregate/test_other.py

@@ -454,6 +457,8 @@ def test_agg_tzaware_non_datetime_result():
    result = gb["b"].agg(lambda x: x.iloc[-1] - x.iloc[0])
    expected = Series([pd.Timedelta(days=1), pd.Timedelta(days=1)], name="b")
    expected.index.name = "a"
+    if as_period:
+        expected = expected.astype(object)


Is this the correct expected result? When subtracting Periods, you get offset objects, not Timedelta objects?

good point. probably not great that tm.assert_series_equal can't tell the difference either

jorisvandenbossche · 2020-12-10T21:52:26Z

That would fix the Period-like bug and only change the Decimal-like cases, which are relatively straightforward conceptually. Would you be OK with that?

I still don't fully understand what would buy us that, since #38315 is also fixing the Period-like bug, but without regressing on the decimal case. So I would just need to revert most of this PR in #38315 (except the period test case, which I can add there), making the diff there larger.
Your input on the different questions I laid out in #38315 is very welcome, to ensure we can move it forward.

…f-python_agg_general

jbrockmendel · 2020-12-11T18:14:03Z

since #38315 is also fixing the Period-like bug, but without regressing on the decimal case

#38315 is very ambitious and potentially addressing multiple intertwined issues. I'm not ready to consider it a "solved problem".

jorisvandenbossche · 2020-12-13T10:06:56Z

I'm not ready to consider it a "solved problem".

But this PR is also not "solving the problem" ..
It's IMO rather complicating the situation by incorrecting part of the tests. I am fine if the PR would just focus on fixing the Period case though, then it already fixes this specific case and adds a test, making #38315 actually easier instead of more complex.

jorisvandenbossche · 2020-12-13T10:08:51Z

pandas/core/groupby/ops.py

@@ -750,7 +746,7 @@ def _aggregate_series_pure_python(self, obj: Series, func: F):
            result[label] = res

        result = lib.maybe_convert_objects(result, try_float=0)
-        result = maybe_cast_result(result, obj, numeric_only=True)
+        # TODO: cast to EA once _from_sequence is reliably strict GH#38254


Inside maybe_cast_result, we already have a special case check for not casting back when having datetime or categorical data. You could add periods to that check, and then this line can be left as is, while still fixing the period issue.
Then we can add the period test, without needing to change the decimal tests.

Inside maybe_cast_result, we already have a special case check for not casting back when having datetime or categorical data.

That's actually what I find most objectionable about maybe_cast_result. It is a kludge that special-cases pandas-internal EAs.

But as I said below, it would at least be a consistent "kludge"... ;)
Not calling the method here that was exactly written to be called here, it not making it any less kludgy IMO.

So I think it would be good to add the period test case here, and then focus on #38315 to properly fix it.

not making it any less kludgy

This is the only remaining usage of that kludge, so getting rid of it would certainly make it less kludgy.

and then focus on #38315 to properly fix it.

It's going to be a while before 1.3, so I'm fine sticking a pin in this to see if a better solution is implemented between now and then. If it doesn't, we should do this for 1.3.

jorisvandenbossche · 2020-12-13T10:11:06Z

I think updating maybe_cast_result to special case periods as well is actually a nice short term solution to get this fix in (we already do that for other dtypes as well, so while it is not the nicest thing to do (but one of the goals of #38315 will be to clean that up), it's at least consistent ..).

…f-python_agg_general

jbrockmendel added 6 commits December 2, 2020 16:03

REF: consolidate casting

99073b8

lighter-weight casting

1ed9d8a

simplify casting

21b4f0f

REF: minimize python_agg_general groupby casting

56b42bb

Handle Float64

e01487f

TST: PeriodDtype

cf5bd53

jreback requested changes Dec 3, 2020

View reviewed changes

jbrockmendel added 2 commits December 3, 2020 15:54

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

02bffdb

…f-python_agg_general

dont cast

f950c75

jreback added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Groupby labels Dec 4, 2020

jreback added this to the 1.2 milestone Dec 4, 2020

jorisvandenbossche requested changes Dec 4, 2020

View reviewed changes

jbrockmendel added 2 commits December 4, 2020 07:29

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

0544c8b

…f-python_agg_general

whatsnew

5a30b93

jbrockmendel mentioned this pull request Dec 4, 2020

REF: avoid try/except in wrapping in cython_agg_blocks #38164

Merged

jreback removed this from the 1.2 milestone Dec 4, 2020

jorisvandenbossche mentioned this pull request Dec 6, 2020

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

Closed

2 tasks

jbrockmendel added 3 commits December 8, 2020 08:48

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

3410102

…f-python_agg_general

move whatsnew to 1.3.0

b481049

CLN: remove unused import

2047f3c

jbrockmendel added 2 commits December 8, 2020 12:51

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

b96835f

…f-python_agg_general

retain Float64

7bc78eb

jorisvandenbossche reviewed Dec 10, 2020

View reviewed changes

jbrockmendel added 2 commits December 11, 2020 10:10

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

643fab6

…f-python_agg_general

fix test

eddb089

jreback modified the milestone: 1.3 Dec 12, 2020

jorisvandenbossche reviewed Dec 13, 2020

View reviewed changes

jbrockmendel added 6 commits January 3, 2021 18:57

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

2572dda

…f-python_agg_general

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

5590ad7

…f-python_agg_general

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

5b73850

…f-python_agg_general

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

896a97a

…f-python_agg_general

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

5e611b2

…f-python_agg_general

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

961304f

…f-python_agg_general

jbrockmendel mentioned this pull request Jan 23, 2021

BUG: incorrect casting ints to Period in GroupBy.agg #39362

Merged

4 tasks

jreback closed this in #39362 Jan 24, 2021

jbrockmendel deleted the ref-python_agg_general branch January 24, 2021 22:02

Uh oh!

BUG: incorrect EA casting in groubpy.agg #38254

BUG: incorrect EA casting in groubpy.agg #38254

Conversation

jbrockmendel commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Dec 4, 2020

Uh oh!

jorisvandenbossche commented Dec 4, 2020

Uh oh!

jreback commented Dec 4, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Dec 4, 2020

Uh oh!

jorisvandenbossche commented Dec 4, 2020

Uh oh!

jbrockmendel commented Dec 4, 2020

Uh oh!

jorisvandenbossche commented Dec 5, 2020

Uh oh!

jbrockmendel commented Dec 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Dec 10, 2020

Uh oh!

jbrockmendel commented Dec 11, 2020

Uh oh!

jorisvandenbossche commented Dec 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Dec 13, 2020

Uh oh!

Uh oh!

jbrockmendel commented Dec 3, 2020 •

edited

Loading