BUG: pd.Categorical turns all values into NaN #43334

mfstabile · 2021-08-31T20:28:43Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample

import pandas as pd
data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)

print(data.head(3))

Problem description

This code sample reads the popular Kaggle titanic file. When reading the titanic.xlsx file, the following Data Set is generated:

	PassengerId	Survived	Pclass	...	Fare	Cabin	Embarked
0	1	0	3	...	7.2500	NaN	S
1	2	1	1	...	71.2833	C85	C
2	3	1	3	...	7.9250	NaN	S

When I execute the code above, the result displayed on the terminal is as follows:

	PassengerId	Survived	Pclass	...	Fare	Cabin	Embarked
0	1	NaN	3	...	7.2500	NaN	S
1	2	NaN	1	...	71.2833	C85	C
2	3	NaN	3	...	7.9250	NaN	S

As can be seen, all values in the "Survived" Series are now NaN. The expected behavior, however, would be for the values to become "Yes" or "No". Strangely, if I invert the penultimate and anti-penultimate lines generating the following code sample:

import pandas as pd
data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data.head(3))

The result generated by the code sample above is the expected one, as shown below.

	PassengerId	Survived	Pclass	...	Fare	Cabin	Embarked
0	1	No	3	...	7.2500	NaN	S
1	2	Yes	1	...	71.2833	C85	C
2	3	Yes	3	...	7.9250	NaN	S

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : pt_BR.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 1.3.2
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.15
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-09-01T07:57:13Z

Hi, could you please post a minimal and reproducible example? See https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

mfstabile · 2021-09-01T13:54:17Z

Hello. I hope the following code sample is helpful.

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': [0, 1, 1]
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)

print(data)

The code above produces the following output:

	Survived	Sex
0	NaN	female
1	NaN	male
2	NaN	male

If I invert the pd.Categorical lines:

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': [0, 1, 1]
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data)

The code above produces the following output:

	Survived	Sex
0	Yes	NaN
1	No	NaN
2	Yes	NaN

Changing the Sex Series from int to string:

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': ['female', 'male', 'male']
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data)

The code above produces the expected output:

	Survived	Sex
0	Yes	female
1	No	male
2	Yes	male

thisisamardeep · 2021-09-04T07:27:13Z

Hi All,

I am picking up this task will update soon.Around end of sep.

simonjayhawkins · 2021-09-04T16:35:17Z

code sample worked in 1.2.5

first bad commit: [c68b605] PERF: cache_readonly for Block properties (#40620)

cc @jbrockmendel

xref #43232 for similar issue

simonjayhawkins · 2021-09-11T09:00:01Z

fix suggested in #43232 (comment) also fixes this issue

diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py
index e3fcff1557..ee38b4ce56 100644
--- a/pandas/core/internals/blocks.py
+++ b/pandas/core/internals/blocks.py
@@ -331,7 +331,6 @@ class Block(PandasObject):
     def shape(self) -> Shape:
         return self.values.shape
 
-    @final
     @cache_readonly
     def dtype(self) -> DtypeObj:
         return self.values.dtype
@@ -1867,6 +1866,10 @@ class CategoricalBlock(ExtensionBlock):
     # this Block type is kept for backwards-compatibility
     __slots__ = ()
 
+    @property
+    def dtype(self) -> DtypeObj:
+        return self.values.dtype
+
 
 # -----------------------------------------------------------------
 # Constructor Helpers

Ljupco7 · 2023-07-07T15:08:58Z

Can anyone confirm this bug has been fixed?
I am introducing category as a dtype to a few of my dataframe columns to improve performance.
I have the following code and I keep getting nan values for actual strings in a dataframe column.
This is the way I set a column dtype in a dataframe to be of category type (I did not find another way to do this. If there is a better solution please let me know).

acc_domain_values = ['ACTIVE, INACTIVE', 'BLOCKED', 'UNKNOWN']
acc_code_Dtype = pd.api.types.CategoricalDtype(categories= acc_domain_values , ordered=False)
df = pd.read_csv(file_name,
                 engine='c',
                 delimiter=';', 
                 on_bad_lines='warn',
                 low_memory=True,
                 dtype={
                              'ACC_CODE': acc_code_Dtype ,
}

My purpose in following code is to get every value in that column that does not belong to: acc_domain_values.

acc_code_incorrect_val_list = []
acc_code_incorrect_val_list = df.query('ACC_CODE not in @acc_domain_values')

What is weird is that I get the following results in the console when I print the contents of acc_code_incorrect_val_list like this:

vari = df['ACC_CODE'].unique()

for element in vari :
    print(element)

Results:

nan
BLOCKED
UNKNOWN

It seems as though pandas looks at these two from the list above acc_domain_values, as NaN (nan) values:
'ACTIVE, INACTIVE'

jbrockmendel · 2023-07-07T19:36:59Z

The bug in the OP should be fixed, yes. If you think the problem persists, please open a new issue with a reproducible example (i.e. without read_csv)

mfstabile added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2021

simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Sep 1, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Sep 4, 2021

code sample for pandas-dev#43334

a9afce6

simonjayhawkins removed the Needs Info Clarification about behavior needed to assess issue label Sep 4, 2021

simonjayhawkins added this to the 1.3.3 milestone Sep 4, 2021

simonjayhawkins added Regression Functionality that used to work in a prior pandas version Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2021

simonjayhawkins mentioned this issue Sep 11, 2021

BUG: reorder_categories with inplace=True is not changing the dtype.categories #43232

Closed

3 tasks

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

jbrockmendel mentioned this issue Sep 16, 2021

BUG: .cat changing dtype inplace #43597

Merged

5 tasks

jreback closed this as completed in #43597 Sep 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.Categorical turns all values into NaN #43334

BUG: pd.Categorical turns all values into NaN #43334

mfstabile commented Aug 31, 2021 •

edited

Loading

INSTALLED VERSIONS

phofl commented Sep 1, 2021

mfstabile commented Sep 1, 2021

thisisamardeep commented Sep 4, 2021

simonjayhawkins commented Sep 4, 2021 •

edited

Loading

simonjayhawkins commented Sep 11, 2021

Ljupco7 commented Jul 7, 2023 •

edited

Loading

jbrockmendel commented Jul 7, 2023

BUG: pd.Categorical turns all values into NaN #43334

BUG: pd.Categorical turns all values into NaN #43334

Comments

mfstabile commented Aug 31, 2021 • edited Loading

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

phofl commented Sep 1, 2021

mfstabile commented Sep 1, 2021

thisisamardeep commented Sep 4, 2021

simonjayhawkins commented Sep 4, 2021 • edited Loading

simonjayhawkins commented Sep 11, 2021

Ljupco7 commented Jul 7, 2023 • edited Loading

jbrockmendel commented Jul 7, 2023

mfstabile commented Aug 31, 2021 •

edited

Loading

Output of `pd.show_versions()`

simonjayhawkins commented Sep 4, 2021 •

edited

Loading

Ljupco7 commented Jul 7, 2023 •

edited

Loading