-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: transform with nunique should have dtype int64 #35130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -319,6 +319,9 @@ def maybe_cast_result_dtype(dtype: DtypeObj, how: str) -> DtypeObj: | |||
return np.dtype(np.int64) | |||
elif how in ["add", "cumsum", "sum"] and isinstance(dtype, BooleanDtype): | |||
return Int64Dtype() | |||
elif how == "nunique": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know where in nunique this is getting cast to date time? Not sure this generic function is the best place to handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual casting is done on the last line of this block:
pandas/pandas/core/dtypes/cast.py
Lines 156 to 166 in 7453810
if dtype.kind in ["M", "m"] and result.dtype.kind in ["i", "f"]: | |
if hasattr(dtype, "tz"): | |
# not a numpy dtype | |
if dtype.tz: | |
# convert to datetime and change timezone | |
from pandas import to_datetime | |
result = to_datetime(result).tz_localize("utc") | |
result = result.tz_convert(dtype.tz) | |
else: | |
result = result.astype(dtype) |
The function maybe_cast_result_dtype was added to allow overrides to the default casting behavior based on input datatype and the operation being performed, which I think is what we want to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that being hit through _transform_general
? If so seems like something else if off as this should be a reduction function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - I think I understand your comment better now: perhaps there is some case in this block that is casting but shouldn't be. In the case at hand, dtype.kind
is "M"
and result.dtype.kind
is "i"
. Here is an example from the tests where we want to execute the last line:
df = pd.DataFrame(dict(a=[1], b=pd.to_datetime(["2001"])))
df.groupby("a")["b"].cummin()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; ideally we don't have to special case here after we've already cast things. IIUC something seems off with the code path in the first place if we are going through transform machinery to get to this function, as nunique should be a reduction not a transformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function maybe_cast_result
is being called by _transform_fast
for both frames and series:
pandas/pandas/core/groupby/generic.py
Lines 1496 to 1502 in 7453810
for i, _ in enumerate(result.columns): | |
res = algorithms.take_1d(result.iloc[:, i].values, ids) | |
# TODO: we have no test cases that get here with EA dtypes; | |
# maybe_cast_result may not be needed if EAs never get here | |
if cast: | |
res = maybe_cast_result(res, obj.iloc[:, i], how=func_nm) | |
output.append(res) |
pandas/pandas/core/groupby/generic.py
Lines 549 to 552 in 7453810
cast = self._transform_should_cast(func_nm) | |
out = algorithms.take_1d(result._values, ids) | |
if cast: | |
out = maybe_cast_result(out, self.obj, how=func_nm) |
It seems that the typical behavior is to have transform cast: _transform_should_cast
returns true any time the result is nonempty and the function name is not in base.cython_cast_blocklist
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your point is well taken though - if we're taking the result of a reduction and broadcasting to the entire frame/series, why do we need to cast? Taking these casts out, the only test that fails is
groupby.transform.test_transform.test_categorical_and_not_categorical_key
.
I'm going to take a deeper look into this case and see what's going on; removing the casting would be a much better solution. Thanks for the comments.
87854e7
to
4700e77
Compare
superseded by #35152 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff