Skip to content

BUG: CategoricalDtype is not refresh after index categories set #46820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
Yikun opened this issue Apr 21, 2022 · 10 comments
Open
3 tasks done

BUG: CategoricalDtype is not refresh after index categories set #46820

Yikun opened this issue Apr 21, 2022 · 10 comments
Labels
Bug Categorical Categorical Data Type Index Related to the Index class or subclasses

Comments

@Yikun
Copy link
Contributor

Yikun commented Apr 21, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pdf = pd.DataFrame(
          {
              "a": pd.Categorical([1, 2, 3, 1, 2, 3]),
          },
          index=pd.Categorical([10, 20, 30, 20, 30, 10], categories=[30, 10, 20], ordered=True),
      )
pidx = pdf.index
pidx.categories = ["z", "y", "x"]
pidx.dtype

Issue Description

categories set failed

>>> pidx.dtype
CategoricalDtype(categories=[30, 10, 20], ordered=True)

Expected Behavior

>>> pidx.dtype
CategoricalDtype(categories=['z', 'y', 'x'], ordered=True)

(also a behavior before 1.4.x)

Installed Versions

1.4.0+

@Yikun Yikun added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 21, 2022
@Yikun
Copy link
Contributor Author

Yikun commented Apr 21, 2022

First bad commit: 126a19d

also cc @jbrockmendel

@samukweku
Copy link
Contributor

@Yikun it seems to work with set_categories right?

@Yikun
Copy link
Contributor Author

Yikun commented Apr 21, 2022

@samukweku Yes, I can make it work as workaround:

pidx = pidx.set_categories(pidx.categories)
pdf.index = pidx

But I didn't need to do by this way before 1.4.0

@jreback
Copy link
Contributor

jreback commented Apr 21, 2022

you are trying to modify an immutable object - it's possible it accidentally worked before but in no way is this correct

@Yikun
Copy link
Contributor Author

Yikun commented Apr 26, 2022

@jreback Would we consider supporting index categories setters in pandas as a way to simplify set_categories in the future ?

@Yikun
Copy link
Contributor Author

Yikun commented Apr 29, 2022

@jreback @jbrockmendel @samukweku Any thoughts? Thank!

@jbrockmendel
Copy link
Member

we're actually looking at deprecating all the in-place category-setting behaviors xref #37643, so unless im misunderstanding what you're asking for, im -1 on adding this setter

@Yikun
Copy link
Contributor Author

Yikun commented May 24, 2022

@jbrockmendel @jreback Thanks! And sorry for late reply.

Should we consider deprecating this setter or raise unsupported error? Because when the category is set (Yep, this is unexpected set to immutable object), the dtype is not refreshed, we are also no plan to fix it.

[1] https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/categorical.py#L742-L743

HyukjinKwon pushed a commit to apache/spark that referenced this issue May 25, 2022
### What changes were proposed in this pull request?
Since pandas-dev/pandas@126a19d, pandas changes behavior.

Before pandas 1.4, the pandas will refresh dtypes according to categories, since panda 1.4, `categories.setter` dtype refresh will not work. According to pandas-dev/pandas#46820 , the complete support of `categories.setter` will never back.

And also only categories is refreshed (but dtype not) is useless behavior so we'd better to only fix test and keep current PS behavior, then remove this setter support when we remove all deprecated methods.

### Why are the changes needed?
Make CI passed with pandas 1.4.x

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
test_categories_setter passed with 1.3.X and also 1.4.x

Closes #36355 from Yikun/SPARK-38982.

Authored-by: Yikun Jiang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@simonjayhawkins
Copy link
Member

simonjayhawkins commented May 28, 2022

you are trying to modify an immutable object

yes. it appears that the index is updated (which should probably raise) but the dtype attribute is not updated.

pidx = pdf.index
print(pidx)
pidx.categories = ["z", "y", "x"]
print(pidx)
print(repr(pidx.dtype))
CategoricalIndex([10, 20, 30, 20, 30, 10], categories=[30, 10, 20], ordered=True, dtype='category')
CategoricalIndex(['y', 'x', 'z', 'x', 'z', 'y'], categories=['z', 'y', 'x'], ordered=True, dtype='category')
CategoricalDtype(categories=[30, 10, 20], ordered=True)

First bad commit: 126a19d

what code sample was used and what was the behavior change? I'm seeing the same behavior with the snippet above back to pandas 1.0.5

in pandas 0.25.3, it raised AttributeError: can't set attribute

@simonjayhawkins simonjayhawkins added Categorical Categorical Data Type Index Related to the Index class or subclasses and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 28, 2022
@Yikun
Copy link
Contributor Author

Yikun commented Jun 1, 2022

what code sample was used and what was the behavior change?

@simonjayhawkins

import pandas as pd
import pandas._testing as tm
import numpy as np

pdf = pd.DataFrame(
          {
              "a": pd.Categorical([1, 2, 3, 1, 2, 3]),
          },
          index=pd.Categorical([10, 20, 30, 20, 30, 10], categories=[30, 10, 20], ordered=True),
      )
pidx = pdf.index
pidx.categories = ["z", "y", "x"]
# Check `pidx.dtype.categories` is refreshed or not
tm.assert_index_equal(pidx.dtype.categories, pidx.categories)
tm.assert_numpy_array_equal(pidx.dtype.categories._data, pidx.categories._data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Index Related to the Index class or subclasses
Projects
None yet
Development

No branches or pull requests

5 participants