Skip to content

BUG: incorrect EA casting in groubpy.agg #38254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
99073b8
REF: consolidate casting
jbrockmendel Dec 1, 2020
1ed9d8a
lighter-weight casting
jbrockmendel Dec 1, 2020
21b4f0f
simplify casting
jbrockmendel Dec 1, 2020
56b42bb
REF: minimize python_agg_general groupby casting
jbrockmendel Dec 2, 2020
e01487f
Handle Float64
jbrockmendel Dec 3, 2020
cf5bd53
TST: PeriodDtype
jbrockmendel Dec 3, 2020
02bffdb
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 3, 2020
f950c75
dont cast
jbrockmendel Dec 4, 2020
0544c8b
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 4, 2020
5a30b93
whatsnew
jbrockmendel Dec 4, 2020
3410102
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 8, 2020
b481049
move whatsnew to 1.3.0
jbrockmendel Dec 8, 2020
2047f3c
CLN: remove unused import
jbrockmendel Dec 8, 2020
b96835f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 8, 2020
7bc78eb
retain Float64
jbrockmendel Dec 8, 2020
643fab6
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Dec 11, 2020
eddb089
fix test
jbrockmendel Dec 11, 2020
2572dda
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 4, 2021
5590ad7
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 5, 2021
5b73850
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 12, 2021
896a97a
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 16, 2021
5e611b2
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 19, 2021
961304f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Jan 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -718,7 +718,7 @@ def _aggregate_series_pure_python(self, obj: Series, func: F):
result[label] = res

result = lib.maybe_convert_objects(result, try_float=0)
result = maybe_cast_result(result, obj, numeric_only=True)
# TODO: cast to EA once _from_sequence is reliably strict GH#38254
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside maybe_cast_result, we already have a special case check for not casting back when having datetime or categorical data. You could add periods to that check, and then this line can be left as is, while still fixing the period issue.
Then we can add the period test, without needing to change the decimal tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside maybe_cast_result, we already have a special case check for not casting back when having datetime or categorical data.

That's actually what I find most objectionable about maybe_cast_result. It is a kludge that special-cases pandas-internal EAs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as I said below, it would at least be a consistent "kludge"... ;)
Not calling the method here that was exactly written to be called here, it not making it any less kludgy IMO.

So I think it would be good to add the period test case here, and then focus on #38315 to properly fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not making it any less kludgy

This is the only remaining usage of that kludge, so getting rid of it would certainly make it less kludgy.

and then focus on #38315 to properly fix it.

It's going to be a while before 1.3, so I'm fine sticking a pin in this to see if a better solution is implemented between now and then. If it doesn't, we should do this for 1.3.


return result, counts

Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/arrays/floating/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,8 +157,9 @@ def test_preserve_dtypes(op):
# groupby
result = getattr(df.groupby("A"), op)()

# GH#38254 until _from_sequence is reliably strict, we cannot retain Float64
expected = pd.DataFrame(
{"B": np.array([1.0, 3.0]), "C": pd.array([0.1, 3], dtype="Float64")},
{"B": np.array([1.0, 3.0]), "C": pd.array([0.1, 3], dtype="float64")},
index=pd.Index(["a", "b"], name="A"),
)
tm.assert_frame_equal(result, expected)
Expand Down
15 changes: 11 additions & 4 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -445,7 +445,8 @@ def test_groupby_agg():
)

# single key, selected column
expected = pd.Series(to_decimal([data[0], data[3]]))
# GH#38254 until _from_sequence is reliably strict, we cant retain dtype
expected = pd.Series(to_decimal([data[0], data[3]])).astype(object)
result = df.groupby("id1")["decimals"].agg(lambda x: x.iloc[0])
tm.assert_series_equal(result, expected, check_names=False)
result = df["decimals"].groupby(df["id1"]).agg(lambda x: x.iloc[0])
Expand All @@ -455,14 +456,16 @@ def test_groupby_agg():
expected = pd.Series(
to_decimal([data[0], data[1], data[3]]),
index=pd.MultiIndex.from_tuples([(0, 0), (0, 1), (1, 1)]),
)
).astype(object)
result = df.groupby(["id1", "id2"])["decimals"].agg(lambda x: x.iloc[0])
tm.assert_series_equal(result, expected, check_names=False)
result = df["decimals"].groupby([df["id1"], df["id2"]]).agg(lambda x: x.iloc[0])
tm.assert_series_equal(result, expected, check_names=False)

# multiple columns
expected = pd.DataFrame({"id2": [0, 1], "decimals": to_decimal([data[0], data[3]])})
expected = pd.DataFrame(
{"id2": [0, 1], "decimals": to_decimal([data[0], data[3]]).astype(object)}
)
result = df.groupby("id1").agg(lambda x: x.iloc[0])
tm.assert_frame_equal(result, expected, check_names=False)

Expand All @@ -478,7 +481,11 @@ def DecimalArray__my_sum(self):

data = make_data()[:5]
df = pd.DataFrame({"id": [0, 0, 0, 1, 1], "decimals": DecimalArray(data)})
expected = pd.Series(to_decimal([data[0] + data[1] + data[2], data[3] + data[4]]))

# GH#38254 until _from_sequence is reliably strict, we cant retain dtype
expected = pd.Series(
to_decimal([data[0] + data[1] + data[2], data[3] + data[4]])
).astype(object)

result = df.groupby("id")["decimals"].agg(lambda x: x.values.my_sum())
tm.assert_series_equal(result, expected, check_names=False)
Expand Down
10 changes: 8 additions & 2 deletions pandas/tests/groupby/aggregate/test_other.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,10 +432,13 @@ def test_agg_over_numpy_arrays():
tm.assert_frame_equal(result, expected)


def test_agg_tzaware_non_datetime_result():
@pytest.mark.parametrize("as_period", [True, False])
def test_agg_tzaware_non_datetime_result(as_period):
# discussed in GH#29589, fixed in GH#29641, operating on tzaware values
# with function that is not dtype-preserving
dti = pd.date_range("2012-01-01", periods=4, tz="UTC")
if as_period:
dti = dti.tz_localize(None).to_period("D")
df = DataFrame({"a": [0, 0, 1, 1], "b": dti})
gb = df.groupby("a")

Expand All @@ -454,6 +457,8 @@ def test_agg_tzaware_non_datetime_result():
result = gb["b"].agg(lambda x: x.iloc[-1] - x.iloc[0])
expected = Series([pd.Timedelta(days=1), pd.Timedelta(days=1)], name="b")
expected.index.name = "a"
if as_period:
expected = expected.astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct expected result? When subtracting Periods, you get offset objects, not Timedelta objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. probably not great that tm.assert_series_equal can't tell the difference either

tm.assert_series_equal(result, expected)


Expand Down Expand Up @@ -627,7 +632,8 @@ def test_groupby_agg_err_catching(err_cls):
{"id1": [0, 0, 0, 1, 1], "id2": [0, 1, 0, 1, 1], "decimals": DecimalArray(data)}
)

expected = Series(to_decimal([data[0], data[3]]))
# GH#38254 until _from_sequence is strict, we cannot reliably cast agg results
expected = Series(to_decimal([data[0], data[3]])).astype(object)

def weird_func(x):
# weird function that raise something other than TypeError or IndexError
Expand Down