Skip to content

BUG: pd.Categorical turns all values into NaN #43334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
mfstabile opened this issue Aug 31, 2021 · 7 comments · Fixed by #43597
Closed
2 of 3 tasks

BUG: pd.Categorical turns all values into NaN #43334

mfstabile opened this issue Aug 31, 2021 · 7 comments · Fixed by #43597
Labels
Bug Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@mfstabile
Copy link

mfstabile commented Aug 31, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

import pandas as pd
data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)

print(data.head(3))

Problem description

This code sample reads the popular Kaggle titanic file. When reading the titanic.xlsx file, the following Data Set is generated:

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S

When I execute the code above, the result displayed on the terminal is as follows:

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 NaN 3 ... 7.2500 NaN S
1 2 NaN 1 ... 71.2833 C85 C
2 3 NaN 3 ... 7.9250 NaN S

As can be seen, all values in the "Survived" Series are now NaN. The expected behavior, however, would be for the values to become "Yes" or "No". Strangely, if I invert the penultimate and anti-penultimate lines generating the following code sample:

import pandas as pd
data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data.head(3))

The result generated by the code sample above is the expected one, as shown below.

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 No 3 ... 7.2500 NaN S
1 2 Yes 1 ... 71.2833 C85 C
2 3 Yes 3 ... 7.9250 NaN S

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : pt_BR.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 1.3.2
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.15
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

@mfstabile mfstabile added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2021
@phofl
Copy link
Member

phofl commented Sep 1, 2021

Hi, could you please post a minimal and reproducible example? See https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@simonjayhawkins simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Sep 1, 2021
@mfstabile
Copy link
Author

Hello. I hope the following code sample is helpful.

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': [0, 1, 1]
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)

print(data)

The code above produces the following output:

Survived Sex
0 NaN female
1 NaN male
2 NaN male

If I invert the pd.Categorical lines:

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': [0, 1, 1]
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data)

The code above produces the following output:

Survived Sex
0 Yes NaN
1 No NaN
2 Yes NaN

Changing the Sex Series from int to string:

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': ['female', 'male', 'male']
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data)

The code above produces the expected output:

Survived Sex
0 Yes female
1 No male
2 Yes male

@thisisamardeep
Copy link

Hi All,

I am picking up this task will update soon.Around end of sep.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Sep 4, 2021
@simonjayhawkins simonjayhawkins removed the Needs Info Clarification about behavior needed to assess issue label Sep 4, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Sep 4, 2021
@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2021
@simonjayhawkins
Copy link
Member

simonjayhawkins commented Sep 4, 2021

code sample worked in 1.2.5

first bad commit: [c68b605] PERF: cache_readonly for Block properties (#40620)

cc @jbrockmendel

xref #43232 for similar issue

@simonjayhawkins
Copy link
Member

fix suggested in #43232 (comment) also fixes this issue

diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py
index e3fcff1557..ee38b4ce56 100644
--- a/pandas/core/internals/blocks.py
+++ b/pandas/core/internals/blocks.py
@@ -331,7 +331,6 @@ class Block(PandasObject):
     def shape(self) -> Shape:
         return self.values.shape
 
-    @final
     @cache_readonly
     def dtype(self) -> DtypeObj:
         return self.values.dtype
@@ -1867,6 +1866,10 @@ class CategoricalBlock(ExtensionBlock):
     # this Block type is kept for backwards-compatibility
     __slots__ = ()
 
+    @property
+    def dtype(self) -> DtypeObj:
+        return self.values.dtype
+
 
 # -----------------------------------------------------------------
 # Constructor Helpers

@Ljupco7
Copy link

Ljupco7 commented Jul 7, 2023

Can anyone confirm this bug has been fixed?
I am introducing category as a dtype to a few of my dataframe columns to improve performance.
I have the following code and I keep getting nan values for actual strings in a dataframe column.
This is the way I set a column dtype in a dataframe to be of category type (I did not find another way to do this. If there is a better solution please let me know).

acc_domain_values = ['ACTIVE, INACTIVE', 'BLOCKED', 'UNKNOWN']
acc_code_Dtype = pd.api.types.CategoricalDtype(categories= acc_domain_values , ordered=False)
df = pd.read_csv(file_name,
                 engine='c',
                 delimiter=';', 
                 on_bad_lines='warn',
                 low_memory=True,
                 dtype={
                              'ACC_CODE': acc_code_Dtype ,
}

My purpose in following code is to get every value in that column that does not belong to: acc_domain_values.

acc_code_incorrect_val_list = []
acc_code_incorrect_val_list = df.query('ACC_CODE not in @acc_domain_values')

What is weird is that I get the following results in the console when I print the contents of acc_code_incorrect_val_list like this:

vari = df['ACC_CODE'].unique()

for element in vari :
    print(element)

Results:

nan
BLOCKED
UNKNOWN

It seems as though pandas looks at these two from the list above acc_domain_values, as NaN (nan) values:
'ACTIVE, INACTIVE'

@jbrockmendel
Copy link
Member

The bug in the OP should be fixed, yes. If you think the problem persists, please open a new issue with a reproducible example (i.e. without read_csv)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants