-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Convert result of group by agg to pyarrow if input is pyarrow #58129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 31 commits
9faa460
969d5b1
66114f3
b0290ed
20c8fa0
97b3d54
932d737
82ddeb5
d510052
62a31d9
a54bf58
64330f0
0647711
affde38
842f561
93b5bf3
6f35c0e
abd0adf
bebc442
bb6343b
3a3f2a2
6dc40f5
4ef96f7
c6a98c0
9181eaf
612d7d0
3b6696b
680e238
3a8597e
6496b15
712c36a
e1ccef6
0ce083d
a1d73f5
57845a8
fa257b0
0a9b83f
139319a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1125,6 +1125,38 @@ def test_comp_masked_numpy(self, masked_dtype, comparison_op): | |
expected = pd.Series(exp, dtype=ArrowDtype(pa.bool_())) | ||
tm.assert_series_equal(result, expected) | ||
|
||
def test_groupby_agg_extension(self, data_for_grouping): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this test should behave the same as the one in the base class. If that's the case, this can be removed. Can you confirm? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. resolved |
||
# GH#38980 groupby agg on extension type fails for non-numeric types | ||
df = pd.DataFrame({"A": [1, 1, 2, 2, 3, 3, 1, 4], "B": data_for_grouping}) | ||
|
||
def expected_inferred_result_dtype(dtype): | ||
pa_dtype = dtype.pyarrow_dtype | ||
if pa.types.is_date(pa_dtype): | ||
return "date32[day][pyarrow]" | ||
elif pa.types.is_time(pa_dtype): | ||
return "time64[us][pyarrow]" | ||
elif pa.types.is_decimal(pa_dtype): | ||
return ArrowDtype(pa.decimal128(4, 3)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On closer look, I think this is a bug being introduced here. This test is using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. resolved |
||
else: | ||
return dtype | ||
|
||
expected_df = pd.DataFrame( | ||
{"A": [1, 1, 2, 2, 3, 3, 1, 4], "B": data_for_grouping} | ||
) | ||
expected = expected_df.iloc[[0, 2, 4, 7]] | ||
expected = expected.set_index("A") | ||
expected_dtype = expected_inferred_result_dtype(expected["B"].dtype) | ||
expected["B"] = expected["B"].astype(expected_dtype) | ||
|
||
result = df.groupby("A").agg({"B": "first"}) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = df.groupby("A").agg("first") | ||
tm.assert_frame_equal(result, expected) | ||
|
||
result = df.groupby("A").first() | ||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
class TestLogicalOps: | ||
"""Various Series and DataFrame logical ops methods.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test that hits this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved, the
test_agg_lambda_pyarrow_struct_to_object_dtype_conversion
test hits thisThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - I was surprised
maybe_cast_pointwise_result
was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go indtypes.cast._maybe_cast_to_extension_array
in a followup. Any thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? this confuses me.
_maybe_cast_to_extension_array is only used in maybe_cast_pointwise_result, so not a huge deal either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - sorry for the noise, I was not aware we could support struct dtypes. I think everything is okay here.
@undermyumbrella1 - why go with NumPy object dtype instead of struct dtypes here?