BUG: groupby transform doesn't respect Series index anymore #45648

ben519 · 2022-01-26T22:58:16Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()

print(df)
     A    C
0  foo  2.1
1  bar  1.9
2  foo  3.6
3  bar  4.0
4  bar  1.9
5  foo  7.8
6  foo  2.8

# sort C within groups defined by A
df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  3.6
3  4.0
4  1.9
5  7.8
6  2.8

Issue Description

Suppose I have the following DataFrame

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()

print(df)
     A    C
0  foo  2.1
1  bar  1.9
2  foo  3.6
3  bar  4.0
4  bar  1.9
5  foo  7.8
6  foo  2.8

I used to be able to sort C's values within the groups defined by A like this (proof):

df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  2.8
3  1.9
4  4.0
5  3.6
6  7.8

(This was sometime around Pandas 1.0.0)

However pandas 1.4.0 produces the following

df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  3.6
3  4.0
4  1.9
5  7.8
6  2.8

It's as if the sort_values() function isn't even being applied.

Note that df.groupby('A')[['C']].transform(pd.Series.sort_values) works properly.

df.groupby('A')[['C']].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  2.8
3  1.9
4  4.0
5  3.6
6  7.8

Expected Behavior

I would expect df.groupby('A')['C'].transform(pd.Series.sort_values) to actually sort C's values within the groups defined by A (as it once did). My expected output in this example would be a Series like this.

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.10.1.final.0
python-bits : 64
OS : Darwin
OS-release : 21.2.0
Version : Darwin Kernel Version 21.2.0: Sun Nov 28 20:28:54 PST 2021; root:xnu-8019.61.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.4.0
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.2
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-01-27T04:20:00Z

60f5d8b26a30fb74171d95d7b032ffea1220dc1d is the first bad commit
commit 60f5d8b26a30fb74171d95d7b032ffea1220dc1d
Author: jbrockmendel <[email protected]>
Date:   Mon May 17 12:18:11 2021 -0700

    REF: document casting behavior in groupby (#41376)

cc @jbrockmendel

I suspect the Series result is aligning to the index of the input. If that is the case, it isn't clear to me whether this should be the expected behavior. I haven't been able to find any indication in the docs.

rhshadrach · 2022-01-27T04:22:02Z

I suspect we may not be seeing your full intended op here, but if you want to sort within groups, you can do df.sort_values(['A', C'])['C']. If you want a range index on this, then add .reset_index(drop=True).

ben519 · 2022-01-27T04:28:32Z

Thanks for looking into this Richard! Wouldn't you expect

df.groupby('A')['C'].transform(pd.Series.sort_values)

and

df.groupby('A')[['C']].transform(pd.Series.sort_values)

to return the same result (besides the obvious Series vs DataFrame type difference)?

Regardless, I'd argue that transform is more useful when it does not realign the index. In other words, the old behavior was more useful than the current behavior.

rhshadrach · 2022-01-27T04:44:34Z

Wouldn't you expect...to return the same result

Sure, but unfortunately that doesn't help determine which one is correct!

Regardless, I'd argue that transform is more useful when it does not realign the index. In other words, the old behavior was more useful than the current behavior.

I don't necessarily agree here, but the docs don't seem to clarify, e.g.:

The transform method returns an object that is indexed the same (same size) as the one being grouped.

It appears to me this is written assuming a non-series or frame result (e.g. tuple, list, or NumPy array), in which case no alignment is (or even, can be) done. Indeed, you can use this to accomplish your goal:

df.groupby('A')['C'].transform(lambda x: x.sort_values().to_numpy())

But I think the behavior when the result is a Series should be decided, fixed, and better documented.

jbrockmendel · 2022-01-30T00:46:20Z

But I think the behavior when the result is a Series should be decided, fixed, and better documented.

Yah this seems ambiguous. In general I expect .transform to return an object indexed the same as the original, while for sort_values I expect within-groups for the index to be changed.

ben519 · 2022-01-30T01:22:39Z

@jbrockmendel Thanks for chiming in. Can you clarify what you mean by ".transform should return an object indexed the same as the original"? For example, given this DataFrame,

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()

do you think that

df.groupby('A')['C'].transform(pd.Series.sort_values)

should return

option 1 (current behavior)

or option 2 (old behavior)

My arguments for option 2 are

df.groupby('A')['C'].transform(pd.Series.sort_values) should produce the same result as df.groupby('A')[['C']].transform(pd.Series.sort_values), just in Series form.
option 2 is more intuitive. I've always interpreted transform as applying a function to each sub-Series. In other words, df.groupby('A')['C'].transform(pd.Series.sort_values) should be interpreted as applying transform to each C sub-Series in the groups defined by the A
option 2 is more useful. There are more things you can do with this behavior.

rhshadrach · 2022-01-30T16:29:06Z

@jbrockmendel

In general I expect .transform to return an object indexed the same as the original, while for sort_values I expect within-groups for the index to be changed.

Agreed on both counts. At least one place in the code says that getting back the same index is the definition of a transform, and that's the definition used in apply to determine if an operation is a transform. But all the public docs I've been able to find say "indexed the same (same size)" which to me is (a) not well written (parenthetical remarks should not change meaning) and (b) not correct.

My expectation is for pandas to align whenever possible, and so getting a Series result and then not aligning to the input index would be jarring. To me, this outweighs the fact that aligning would make this particular operation a no-op. "Special cases aren't special enough to break the rules."

@ben519 - thanks for the feedback here. Responding to your points immediately above. (1) This isn't an argument for option 2, as it applies equally as well to option 1. (2) I disagree here, not aligning is unintuitive. (3) agreed option 2 is more useful in this case, but that is not a strong enough to create inconsistencies in behavior.

ben519 · 2022-01-30T16:33:18Z

@rhshadrach

Responding to your points immediately above. (1) This isn't an argument for option 2, as it applies equally as well to option 1.

Can you explain this?

(My point is, df.groupby('A')['C'].transform(func) should equal df.groupby('A')[['C']].transform(func).C as I think this is most intuitive and "natural".)

rhshadrach · 2022-01-30T16:57:05Z

When you have two operations that don't agree, you can change either one to make them consistent.

ben519 · 2022-01-30T16:58:29Z

Ah, so you're advocating for changing the current behavior of df.groupby('A')[['C']].transform(func).?

rhshadrach · 2022-01-30T16:59:56Z

Yes.

jbrockmendel · 2022-02-19T00:47:07Z

Can you clarify what you mean by ".transform should return an object indexed the same as the original"? For example, given this DataFrame [...]

Clarifying my original comment "In general I expect .transform to return an object indexed the same as the original"

When I think of a "transform", the prototypical examples that come to mind are shift, diff, cumsum. But the sort_values example you give is a fine example where that intuition breaks down.

AFAICT this is a Hard Problem (tm) that @rhshadrach is on top of. Which is good, because I don't have any bright ideas.

rhshadrach · 2022-06-05T17:05:36Z

As expressed above, I do think pandas should default to aligning when possible and that seems to me to be the case here. On top of this, If one puts null values in the groupers, there are already examples where transform is aligning:

def func(x):
    result = x.sum()
    print(result)
    return result

df = pd.DataFrame({'a': [1, np.nan, 2, 2], 'b': [1, 2, 3, 4]})
gb = df.groupby('a', dropna=True)
result = gb.transform(func)
print(result)

# Produces
# b    1
# dtype: int64
# b    7
# dtype: int64
#      b
# 0  1.0
# 1  NaN
# 2  7.0
# 3  7.0

simonjayhawkins · 2022-06-10T12:58:18Z

REF: document casting behavior in groupby (#41376)

labelling as regression as (undocumented) change of behavior from pandas-1.2.5

rhshadrach · 2022-06-10T21:22:59Z

@simonjayhawkins In 1.2.5 both DataFrame and Series did not align. Currently DataFrame does not align and Series aligns (the regression here). In #47244 I deprecated the DataFrame behavior in favor of the Series behavior so that both will align in the future. Is that sufficient to close this issue (assuming there is consensus on the deprecation), or should some other action also be taken?

ben519 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2022

rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby labels Jan 27, 2022

rhshadrach added API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 27, 2022

rhshadrach mentioned this issue Jun 5, 2022

DEPR: groupby.transform not aligning with input index #47244

Merged

4 tasks

rhshadrach added this to the 1.5 milestone Jun 5, 2022

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jun 10, 2022

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 11, 2022

code sample for pandas-dev#45648

815938b

jreback closed this as completed in #47244 Jun 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby transform doesn't respect Series index anymore #45648

BUG: groupby transform doesn't respect Series index anymore #45648

ben519 commented Jan 26, 2022 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Jan 27, 2022

rhshadrach commented Jan 27, 2022

ben519 commented Jan 27, 2022

rhshadrach commented Jan 27, 2022 •

edited

Loading

jbrockmendel commented Jan 30, 2022

ben519 commented Jan 30, 2022 •

edited

Loading

rhshadrach commented Jan 30, 2022

ben519 commented Jan 30, 2022 •

edited

Loading

rhshadrach commented Jan 30, 2022

ben519 commented Jan 30, 2022

rhshadrach commented Jan 30, 2022

jbrockmendel commented Feb 19, 2022

rhshadrach commented Jun 5, 2022 •

edited

Loading

simonjayhawkins commented Jun 10, 2022 •

edited

Loading

rhshadrach commented Jun 10, 2022

BUG: groupby transform doesn't respect Series index anymore #45648

BUG: groupby transform doesn't respect Series index anymore #45648

Comments

ben519 commented Jan 26, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Jan 27, 2022

rhshadrach commented Jan 27, 2022

ben519 commented Jan 27, 2022

rhshadrach commented Jan 27, 2022 • edited Loading

jbrockmendel commented Jan 30, 2022

ben519 commented Jan 30, 2022 • edited Loading

rhshadrach commented Jan 30, 2022

ben519 commented Jan 30, 2022 • edited Loading

rhshadrach commented Jan 30, 2022

ben519 commented Jan 30, 2022

rhshadrach commented Jan 30, 2022

jbrockmendel commented Feb 19, 2022

rhshadrach commented Jun 5, 2022 • edited Loading

simonjayhawkins commented Jun 10, 2022 • edited Loading

rhshadrach commented Jun 10, 2022

ben519 commented Jan 26, 2022 •

edited

Loading

rhshadrach commented Jan 27, 2022 •

edited

Loading

ben519 commented Jan 30, 2022 •

edited

Loading

ben519 commented Jan 30, 2022 •

edited

Loading

rhshadrach commented Jun 5, 2022 •

edited

Loading

simonjayhawkins commented Jun 10, 2022 •

edited

Loading