-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby transform doesn't respect Series index anymore #45648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I suspect the Series result is aligning to the index of the input. If that is the case, it isn't clear to me whether this should be the expected behavior. I haven't been able to find any indication in the docs. |
I suspect we may not be seeing your full intended op here, but if you want to sort within groups, you can do |
Thanks for looking into this Richard! Wouldn't you expect
and
to return the same result (besides the obvious Series vs DataFrame type difference)? Regardless, I'd argue that transform is more useful when it does not realign the index. In other words, the old behavior was more useful than the current behavior. |
Sure, but unfortunately that doesn't help determine which one is correct!
I don't necessarily agree here, but the docs don't seem to clarify, e.g.:
It appears to me this is written assuming a non-series or frame result (e.g. tuple, list, or NumPy array), in which case no alignment is (or even, can be) done. Indeed, you can use this to accomplish your goal:
But I think the behavior when the result is a Series should be decided, fixed, and better documented. |
Yah this seems ambiguous. In general I expect .transform to return an object indexed the same as the original, while for sort_values I expect within-groups for the index to be changed. |
@jbrockmendel Thanks for chiming in. Can you clarify what you mean by "
do you think that
should return option 1 (current behavior)
or option 2 (old behavior)
My arguments for option 2 are
|
Agreed on both counts. At least one place in the code says that getting back the same index is the definition of a transform, and that's the definition used in My expectation is for pandas to align whenever possible, and so getting a Series result and then not aligning to the input index would be jarring. To me, this outweighs the fact that aligning would make this particular operation a no-op. "Special cases aren't special enough to break the rules." @ben519 - thanks for the feedback here. Responding to your points immediately above. (1) This isn't an argument for option 2, as it applies equally as well to option 1. (2) I disagree here, not aligning is unintuitive. (3) agreed option 2 is more useful in this case, but that is not a strong enough to create inconsistencies in behavior. |
Can you explain this? (My point is, |
When you have two operations that don't agree, you can change either one to make them consistent. |
Ah, so you're advocating for changing the current behavior of |
Yes. |
Clarifying my original comment "In general I expect .transform to return an object indexed the same as the original" When I think of a "transform", the prototypical examples that come to mind are shift, diff, cumsum. But the sort_values example you give is a fine example where that intuition breaks down. AFAICT this is a Hard Problem (tm) that @rhshadrach is on top of. Which is good, because I don't have any bright ideas. |
As expressed above, I do think pandas should default to aligning when possible and that seems to me to be the case here. On top of this, If one puts null values in the groupers, there are already examples where transform is aligning:
|
labelling as regression as (undocumented) change of behavior from pandas-1.2.5 |
@simonjayhawkins In 1.2.5 both DataFrame and Series did not align. Currently DataFrame does not align and Series aligns (the regression here). In #47244 I deprecated the DataFrame behavior in favor of the Series behavior so that both will align in the future. Is that sufficient to close this issue (assuming there is consensus on the deprecation), or should some other action also be taken? |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Suppose I have the following DataFrame
I used to be able to sort
C
's values within the groups defined byA
like this (proof):(This was sometime around Pandas 1.0.0)
However pandas 1.4.0 produces the following
It's as if the
sort_values()
function isn't even being applied.Note that
df.groupby('A')[['C']].transform(pd.Series.sort_values)
works properly.Expected Behavior
I would expect
df.groupby('A')['C'].transform(pd.Series.sort_values)
to actually sort C's values within the groups defined by A (as it once did). My expected output in this example would be a Series like this.Installed Versions
INSTALLED VERSIONS
commit : bb1f651
python : 3.10.1.final.0
python-bits : 64
OS : Darwin
OS-release : 21.2.0
Version : Darwin Kernel Version 21.2.0: Sun Nov 28 20:28:54 PST 2021; root:xnu-8019.61.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.4.0
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.2
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
The text was updated successfully, but these errors were encountered: