Skip to content

ENH: Bring back the observed argument for groupby on Categorical columns #55237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
n-splv opened this issue Sep 22, 2023 · 3 comments
Closed
1 of 3 tasks
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@n-splv
Copy link

n-splv commented Sep 22, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Hi!
The reasons for deprecation of this parameter are nowhere to be found, so I'm curious whether a discussion took place.

Feature Description

After upgrading to 2.1 I have to replace my old code

sample_distribution = df.sample(n).groupby(categorical_cols).size()

with this:

main_distribution = df.groupby(categorical_cols).size()
sample_distribution = df.sample(n).groupby(categorical_cols).size().reindex(main_distribution.index, fill_value=0)

Alternative Solutions

Changing the default value of observed to True is fine I guess, but the ability to use False was indeed convenient. Maybe we should bring it back?)

Additional Context

No response

@n-splv n-splv added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 22, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Sep 22, 2023

There are no plans to remove the argument, it is only changing the default.

Is there something in the message that made you think it was going to be removed?

@n-splv
Copy link
Author

n-splv commented Sep 22, 2023

@rhshadrach Apologies, the argument is working as intended.
Do you know what made me think that is doesn't? Watch:

data = {
    'col1': ['a', 'b', 'c'],
    'col2': [1, 2, 3],
}
df_ = pd.DataFrame(data)

df_.loc[:, 'col1'] = df_['col1'].astype('category')
df_.loc[:, 'col2'] = df_['col2'].astype('category')

df_.iloc[:2].groupby(['col1', 'col2'], observed=False).size()

col1  col2
a     1       1
      2       0
      3       0
b     1       0
      2       1
      3       0

The category c is missing, and I blamed the observed argument for it. But the real source of the problem is:

df_.dtypes

col1      object
col2    category
dtype: object

For some reason the .astype('category') syntax doesn't convert the object columns, and the worst part is that no warning is raised. Should I open a separate issue? As far as I remember this worked just fine in 1.5.3.

@rhshadrach
Copy link
Member

I see - this is then a duplicate of #52593. In the meantime, when changing an entire column, everything works if you don't use .loc.

data = {
    'col1': ['a', 'b', 'c'],
    'col2': [1, 2, 3],
}
df_ = pd.DataFrame(data)

df_['col1'] = df_['col1'].astype('category')
df_.loc[:, 'col2'] = df_['col2'].astype('category')

result = df_.iloc[:2].groupby(['col1', 'col2'], observed=False).size()
print(result)
# col1  col2
# a     1       1
#       2       0
# b     1       0
#       2       1
# c     1       0
#       2       0
# dtype: int64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants