Skip to content

BUG: groupby transform doesn't respect Series index anymore #45648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
ben519 opened this issue Jan 26, 2022 · 15 comments · Fixed by #47244
Closed
3 tasks done

BUG: groupby transform doesn't respect Series index anymore #45648

ben519 opened this issue Jan 26, 2022 · 15 comments · Fixed by #47244
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Discussion Requires discussion from core team before further action Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@ben519
Copy link

ben519 commented Jan 26, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()

print(df)
     A    C
0  foo  2.1
1  bar  1.9
2  foo  3.6
3  bar  4.0
4  bar  1.9
5  foo  7.8
6  foo  2.8

# sort C within groups defined by A
df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  3.6
3  4.0
4  1.9
5  7.8
6  2.8

Issue Description

Suppose I have the following DataFrame

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()

print(df)
     A    C
0  foo  2.1
1  bar  1.9
2  foo  3.6
3  bar  4.0
4  bar  1.9
5  foo  7.8
6  foo  2.8

I used to be able to sort C's values within the groups defined by A like this (proof):

df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  2.8
3  1.9
4  4.0
5  3.6
6  7.8

(This was sometime around Pandas 1.0.0)

However pandas 1.4.0 produces the following

df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  3.6
3  4.0
4  1.9
5  7.8
6  2.8

It's as if the sort_values() function isn't even being applied.

Note that df.groupby('A')[['C']].transform(pd.Series.sort_values) works properly.

df.groupby('A')[['C']].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  2.8
3  1.9
4  4.0
5  3.6
6  7.8

Expected Behavior

I would expect df.groupby('A')['C'].transform(pd.Series.sort_values) to actually sort C's values within the groups defined by A (as it once did). My expected output in this example would be a Series like this.

     C
0  2.1
1  1.9
2  2.8
3  1.9
4  4.0
5  3.6
6  7.8

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.10.1.final.0
python-bits : 64
OS : Darwin
OS-release : 21.2.0
Version : Darwin Kernel Version 21.2.0: Sun Nov 28 20:28:54 PST 2021; root:xnu-8019.61.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.4.0
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.2
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@ben519 ben519 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2022
@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby labels Jan 27, 2022
@rhshadrach
Copy link
Member

60f5d8b26a30fb74171d95d7b032ffea1220dc1d is the first bad commit
commit 60f5d8b26a30fb74171d95d7b032ffea1220dc1d
Author: jbrockmendel <[email protected]>
Date:   Mon May 17 12:18:11 2021 -0700

    REF: document casting behavior in groupby (#41376)

cc @jbrockmendel

I suspect the Series result is aligning to the index of the input. If that is the case, it isn't clear to me whether this should be the expected behavior. I haven't been able to find any indication in the docs.

@rhshadrach
Copy link
Member

I suspect we may not be seeing your full intended op here, but if you want to sort within groups, you can do df.sort_values(['A', C'])['C']. If you want a range index on this, then add .reset_index(drop=True).

@ben519
Copy link
Author

ben519 commented Jan 27, 2022

Thanks for looking into this Richard! Wouldn't you expect

df.groupby('A')['C'].transform(pd.Series.sort_values)

and

df.groupby('A')[['C']].transform(pd.Series.sort_values)

to return the same result (besides the obvious Series vs DataFrame type difference)?

Regardless, I'd argue that transform is more useful when it does not realign the index. In other words, the old behavior was more useful than the current behavior.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 27, 2022

Wouldn't you expect...to return the same result

Sure, but unfortunately that doesn't help determine which one is correct!

Regardless, I'd argue that transform is more useful when it does not realign the index. In other words, the old behavior was more useful than the current behavior.

I don't necessarily agree here, but the docs don't seem to clarify, e.g.:

The transform method returns an object that is indexed the same (same size) as the one being grouped.

It appears to me this is written assuming a non-series or frame result (e.g. tuple, list, or NumPy array), in which case no alignment is (or even, can be) done. Indeed, you can use this to accomplish your goal:

df.groupby('A')['C'].transform(lambda x: x.sort_values().to_numpy())

But I think the behavior when the result is a Series should be decided, fixed, and better documented.

@rhshadrach rhshadrach added API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 27, 2022
@jbrockmendel
Copy link
Member

But I think the behavior when the result is a Series should be decided, fixed, and better documented.

Yah this seems ambiguous. In general I expect .transform to return an object indexed the same as the original, while for sort_values I expect within-groups for the index to be changed.

@ben519
Copy link
Author

ben519 commented Jan 30, 2022

@jbrockmendel Thanks for chiming in. Can you clarify what you mean by ".transform should return an object indexed the same as the original"? For example, given this DataFrame,

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()

do you think that

df.groupby('A')['C'].transform(pd.Series.sort_values)

should return

option 1 (current behavior)

     C
0  2.1
1  1.9
2  3.6
3  4.0
4  1.9
5  7.8
6  2.8

or option 2 (old behavior)

     C
0  2.1
1  1.9
2  2.8
3  1.9
4  4.0
5  3.6
6  7.8

My arguments for option 2 are

  1. df.groupby('A')['C'].transform(pd.Series.sort_values) should produce the same result as df.groupby('A')[['C']].transform(pd.Series.sort_values), just in Series form.
  2. option 2 is more intuitive. I've always interpreted transform as applying a function to each sub-Series. In other words, df.groupby('A')['C'].transform(pd.Series.sort_values) should be interpreted as applying transform to each C sub-Series in the groups defined by the A
  3. option 2 is more useful. There are more things you can do with this behavior.

@rhshadrach
Copy link
Member

@jbrockmendel

In general I expect .transform to return an object indexed the same as the original, while for sort_values I expect within-groups for the index to be changed.

Agreed on both counts. At least one place in the code says that getting back the same index is the definition of a transform, and that's the definition used in apply to determine if an operation is a transform. But all the public docs I've been able to find say "indexed the same (same size)" which to me is (a) not well written (parenthetical remarks should not change meaning) and (b) not correct.

My expectation is for pandas to align whenever possible, and so getting a Series result and then not aligning to the input index would be jarring. To me, this outweighs the fact that aligning would make this particular operation a no-op. "Special cases aren't special enough to break the rules."

@ben519 - thanks for the feedback here. Responding to your points immediately above. (1) This isn't an argument for option 2, as it applies equally as well to option 1. (2) I disagree here, not aligning is unintuitive. (3) agreed option 2 is more useful in this case, but that is not a strong enough to create inconsistencies in behavior.

@ben519
Copy link
Author

ben519 commented Jan 30, 2022

@rhshadrach

Responding to your points immediately above. (1) This isn't an argument for option 2, as it applies equally as well to option 1.

Can you explain this?

(My point is, df.groupby('A')['C'].transform(func) should equal df.groupby('A')[['C']].transform(func).C as I think this is most intuitive and "natural".)

@rhshadrach
Copy link
Member

When you have two operations that don't agree, you can change either one to make them consistent.

@ben519
Copy link
Author

ben519 commented Jan 30, 2022

Ah, so you're advocating for changing the current behavior of df.groupby('A')[['C']].transform(func).?

@rhshadrach
Copy link
Member

Yes.

@jbrockmendel
Copy link
Member

Can you clarify what you mean by ".transform should return an object indexed the same as the original"? For example, given this DataFrame [...]

Clarifying my original comment "In general I expect .transform to return an object indexed the same as the original"

When I think of a "transform", the prototypical examples that come to mind are shift, diff, cumsum. But the sort_values example you give is a fine example where that intuition breaks down.

AFAICT this is a Hard Problem (tm) that @rhshadrach is on top of. Which is good, because I don't have any bright ideas.

@rhshadrach
Copy link
Member

rhshadrach commented Jun 5, 2022

As expressed above, I do think pandas should default to aligning when possible and that seems to me to be the case here. On top of this, If one puts null values in the groupers, there are already examples where transform is aligning:

def func(x):
    result = x.sum()
    print(result)
    return result

df = pd.DataFrame({'a': [1, np.nan, 2, 2], 'b': [1, 2, 3, 4]})
gb = df.groupby('a', dropna=True)
result = gb.transform(func)
print(result)

# Produces
# b    1
# dtype: int64
# b    7
# dtype: int64
#      b
# 0  1.0
# 1  NaN
# 2  7.0
# 3  7.0

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Jun 10, 2022

REF: document casting behavior in groupby (#41376)

labelling as regression as (undocumented) change of behavior from pandas-1.2.5

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jun 10, 2022
@rhshadrach
Copy link
Member

@simonjayhawkins In 1.2.5 both DataFrame and Series did not align. Currently DataFrame does not align and Series aligns (the regression here). In #47244 I deprecated the DataFrame behavior in favor of the Series behavior so that both will align in the future. Is that sufficient to close this issue (assuming there is consensus on the deprecation), or should some other action also be taken?

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Discussion Requires discussion from core team before further action Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants