Skip to content

BUG(?): Allow setitem to category/sparse of the same underlying dtype? #59627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
mroeschke opened this issue Aug 27, 2024 · 4 comments
Open
3 tasks done
Labels
Bug PDEP6-related related to PDEP6 (not upcasting during setitem-like Series operations)

Comments

@mroeschke
Copy link
Member

mroeschke commented Aug 27, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [7]: df = pd.DataFrame(range(2))

In [8]: df.iloc[:, 0] = df.iloc[:, 0].astype("category")

TypeError: Invalid value '0    0
1    1
Name: 0, dtype: category
Categories (2, int64): [0, 1]' for dtype 'int64'

Issue Description

In the PDEP 6 discussions (#39584 #50424), I can't find any discussion really of whether setting values with a different "representation" (e.g. to category or sparse) of the dtype should be disallowed. Technically casting to e.g. category, sparse, ArrowDtype of the same underlying type doesn't upcast, so should that be allowed?

cc @MarcoGorelli

Expected Behavior

Setitem with category/sparse/ArrowDtype that doesn't change the underlying type should be allowed

Installed Versions

Replace this line with the output of pd.show_versions()

@mroeschke mroeschke added Bug PDEP6-related related to PDEP6 (not upcasting during setitem-like Series operations) labels Aug 27, 2024
@jbrockmendel
Copy link
Member

For Categorical the exact categories are part of the dtype. I feel pretty strongly that the dtype object should be immutable.

@MarcoGorelli
Copy link
Member

@jorisvandenbossche do you have an opinion on this one?

@jorisvandenbossche
Copy link
Member

I think the simpler rule is that a setitem operation simply nevers changes the dtype (which is how PDEP 6 describes). Changing from int64 to category[int64] is clearly changing the dtype IMO (and also meaning of the dtype in this case, and the underlying memory representation as well).

Now, in the context of "logical dtypes", the question is maybe a bit more tricky (should int64 and Int64 be regarded as the same dtype?)
However, at that point, we are setting into an existing array, and if you are setting compatible values with a different dtype into an existing array, I would just expect this setitem operation to work?

@jorisvandenbossche
Copy link
Member

For example, setting int64 values into an Int64 series still preserves that dtype:

In [16]: s = pd.Series([1, 2, 3], dtype="Int64")

In [17]: s
Out[17]: 
0    1
1    2
2    3
dtype: Int64

In [18]: s.iloc[:] = pd.Series([1, 2, 3], dtype="int64")

In [19]: s
Out[19]: 
0    1
1    2
2    3
dtype: Int64

Same for setting ArrowDtype into it:

In [20]: s.iloc[:] = pd.Series([1, 2, 3], dtype=pd.ArrowDtype(pa.int64()))

In [21]: s
Out[21]: 
0    1
1    2
2    3
dtype: Int64

And the same is also true for category data with integer categories.

Now, the above example uses a Series, and apparently that is working differently as a DataFrame ... ? (and specifically in the case of categorical data when the dtype is int64 and not Int64, for any other cases also the DataFrame variant preserves the column's dtype)

For example, this works fine:

In [38]: s = pd.Series([1, 2, 3], dtype="Int64")

In [39]: df = pd.DataFrame({"col" : s})

In [40]: df.iloc[:, 0] = pd.Series([1, 2, 3], dtype="category")

In [41]: df
Out[41]: 
   col
0    1
1    2
2    3

In [42]: df.dtypes
Out[42]: 
col    Int64
dtype: object

But when you start with int64 dtype, then it fails. That's maybe rather a bug and we should coerce the categorical data to int64?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug PDEP6-related related to PDEP6 (not upcasting during setitem-like Series operations)
Projects
None yet
Development

No branches or pull requests

4 participants