-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Groupby not keeping string dtype for empty objects #55619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Groupby not keeping string dtype for empty objects #55619
Conversation
npvalues = lib.maybe_convert_objects(result, try_float=False) | ||
if preserve_dtype: | ||
out = maybe_cast_pointwise_result(npvalues, obj.dtype, numeric_only=True) | ||
if len(obj) == 0 and len(result) == 0 and isinstance(obj.dtype, ExtensionDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on an empty list of categoricals with observed=False
? I think this is the only case where len(obj) == 0
but len(result) > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't track his down specifically, but the test that was mentioned in the comment is not passing by here anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running on this PR as-is:
func = "min"
dtype = "string[pyarrow_numpy]"
df = DataFrame({"a": ["a"], "b": "a", "c": "a"}, dtype=dtype)
df["a"] = pd.Categorical([1], categories=[1, 2, 3])
df = df.iloc[:0]
result = getattr(df.groupby("a", observed=False), func)()
print(result.dtypes)
# b float64
# c float64
# dtype: object
If you remove the condition len(result) == 0
in the line highlighted above, you get string[pyarrow_numpy]
for the dtypes instead. However, the values in the DataFrame are still the float NaN. I didn't realize this was possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new string dtype uses nan as missing value representation to keep numpy semantics, so it will work if you feed it an array with all NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still needs fixed (but okay not for 2.1.2)
thanks @phofl. |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
…5619) * BUG: Groupby not keeping string dtype for empty objects * Fix --------- Co-authored-by: Thomas Li <[email protected]> (cherry picked from commit 8afd868)
…type for empty objects) (#55705) BUG: Groupby not keeping string dtype for empty objects (#55619) * BUG: Groupby not keeping string dtype for empty objects * Fix --------- Co-authored-by: Thomas Li <[email protected]> (cherry picked from commit 8afd868) Co-authored-by: Patrick Hoefler <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.cc @rhshadrach