Skip to content

QST: Categorical NaN moving to strings. Unexpected behavior? #51282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
frbelotto opened this issue Feb 10, 2023 · 5 comments
Closed
2 tasks done

QST: Categorical NaN moving to strings. Unexpected behavior? #51282

frbelotto opened this issue Feb 10, 2023 · 5 comments
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Triage Issue that has not been reviewed by a pandas team member Usage Question

Comments

@frbelotto
Copy link

frbelotto commented Feb 10, 2023

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/75406055/weird-behavior-on-string-vs-categorical-dtypes/75406177#75406177

Question about pandas

Hello guys,
I´ve been struggling on a single code Where I had to move an categorical column back to a string column. After discussing it on stackoverflow they helped me to understand the behavior of such command:

With category the type of nan value is 'float' (as a typical nan) and you can use it in comparison, so: data.partner !='A' will be True for all the rows with NaN.

When converting the type to string, the type of nan value is converted to pandas._libs.missing.NAType, and you can't use this in comparison, so now: data.partner !='A' returns which is not True and the result differs.

My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?

Thanks. Sorry for bothering you. Just want to understand the reasons behind currently behavior so I won't miss it again!

@frbelotto frbelotto added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Feb 10, 2023
@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type labels Feb 11, 2023
@jbrockmendel
Copy link
Member

@jorisvandenbossche @Dr-Irv getting pd.NA is causing this user confusion. can you help out

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 13, 2023

My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?

There are two representations of "missing values" in pandas. For "traditional" dtypes (float, int, object, Categorical), the value np.nan is used to represent a missing value. For the "nullable" dtypes (StringDtype, Int64Dtype, etc.), the value pd.NA is used to represent a missing value. Since np.nan is used with Categorical, when you convert that to a string using StringDtype, then np.nan is converted to pd.NA. If you wanted to preserve the np.nan value, then you should use object dtype. See below:

In [1]: import pandas as pd

In [2]: s = pd.Series(pd.Categorical(["a", "b", "c", "a", "a", "b"]))

In [3]: s
Out[3]:
0    a
1    b
2    c
3    a
4    a
5    b
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [4]: s2 = s.shift(1)

In [5]: s2
Out[5]:
0    NaN
1      a
2      b
3      c
4      a
5      a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [6]: s2.iloc[0]
Out[6]: nan

In [7]: s3 = s2.astype(pd.StringDtype())

In [8]: s3.iloc[0]
Out[8]: <NA>

In [9]: s4 = s2.astype(object)

In [10]: s4.iloc[0]
Out[10]: nan

See #50711 for how this may change in the future.

@frbelotto
Copy link
Author

My question is. Shouldn't be more "common sense" or expected that a categorical NaN being parsed to a string just become a blank/empty string instead of a "pandas._libs.missing.NAType"?

There are two representations of "missing values" in pandas. For "traditional" dtypes (float, int, object, Categorical), the value np.nan is used to represent a missing value. For the "nullable" dtypes (StringDtype, Int64Dtype, etc.), the value pd.NA is used to represent a missing value. Since np.nan is used with Categorical, when you convert that to a string using StringDtype, then np.nan is converted to pd.NA. If you wanted to preserve the np.nan value, then you should use object dtype. See below:

In [1]: import pandas as pd

In [2]: s = pd.Series(pd.Categorical(["a", "b", "c", "a", "a", "b"]))

In [3]: s
Out[3]:
0    a
1    b
2    c
3    a
4    a
5    b
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [4]: s2 = s.shift(1)

In [5]: s2
Out[5]:
0    NaN
1      a
2      b
3      c
4      a
5      a
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [6]: s2.iloc[0]
Out[6]: nan

In [7]: s3 = s2.astype(pd.StringDtype())

In [8]: s3.iloc[0]
Out[8]: <NA>

In [9]: s4 = s2.astype(object)

In [10]: s4.iloc[0]
Out[10]: nan

See #50711 for how this may change in the future.

Thanks. You explanation is great!

One last point to discuss, despite my np.nan categorical becoming an pd.na, when I try to compare it to a simple "string" value won´t work because "it is not comparable", but if you try something that not comparable, maybe the "default" value should be not equals, right?

@frbelotto
Copy link
Author

Anyway. Thats fine. Thanks.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 14, 2023

One last point to discuss, despite my np.nan categorical becoming an pd.na, when I try to compare it to a simple "string" value won´t work because "it is not comparable", but if you try something that not comparable, maybe the "default" value should be not equals, right?

See discussion here: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#propagation-in-arithmetic-and-comparison-operations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Triage Issue that has not been reviewed by a pandas team member Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants