-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pd.Categorical turns all values into NaN #43334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, could you please post a minimal and reproducible example? See https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
Hello. I hope the following code sample is helpful. import pandas as pd
data = pd.DataFrame({'Survived': [1, 0, 1],
'Sex': [0, 1, 1]
})
data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')
data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
print(data) The code above produces the following output:
If I invert the pd.Categorical lines: import pandas as pd
data = pd.DataFrame({'Survived': [1, 0, 1],
'Sex': [0, 1, 1]
})
data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')
data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
print(data) The code above produces the following output:
Changing the Sex Series from int to string: import pandas as pd
data = pd.DataFrame({'Survived': [1, 0, 1],
'Sex': ['female', 'male', 'male']
})
data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')
data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
print(data) The code above produces the expected output:
|
Hi All, I am picking up this task will update soon.Around end of sep. |
fix suggested in #43232 (comment) also fixes this issue diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py
index e3fcff1557..ee38b4ce56 100644
--- a/pandas/core/internals/blocks.py
+++ b/pandas/core/internals/blocks.py
@@ -331,7 +331,6 @@ class Block(PandasObject):
def shape(self) -> Shape:
return self.values.shape
- @final
@cache_readonly
def dtype(self) -> DtypeObj:
return self.values.dtype
@@ -1867,6 +1866,10 @@ class CategoricalBlock(ExtensionBlock):
# this Block type is kept for backwards-compatibility
__slots__ = ()
+ @property
+ def dtype(self) -> DtypeObj:
+ return self.values.dtype
+
# -----------------------------------------------------------------
# Constructor Helpers |
Can anyone confirm this bug has been fixed? acc_domain_values = ['ACTIVE, INACTIVE', 'BLOCKED', 'UNKNOWN']
acc_code_Dtype = pd.api.types.CategoricalDtype(categories= acc_domain_values , ordered=False)
df = pd.read_csv(file_name,
engine='c',
delimiter=';',
on_bad_lines='warn',
low_memory=True,
dtype={
'ACC_CODE': acc_code_Dtype ,
} My purpose in following code is to get every value in that column that does not belong to: acc_domain_values. acc_code_incorrect_val_list = []
acc_code_incorrect_val_list = df.query('ACC_CODE not in @acc_domain_values') What is weird is that I get the following results in the console when I print the contents of acc_code_incorrect_val_list like this: vari = df['ACC_CODE'].unique()
for element in vari :
print(element) Results:
It seems as though pandas looks at these two from the list above acc_domain_values, as NaN (nan) values: |
The bug in the OP should be fixed, yes. If you think the problem persists, please open a new issue with a reproducible example (i.e. without read_csv) |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample
Problem description
This code sample reads the popular Kaggle titanic file. When reading the titanic.xlsx file, the following Data Set is generated:
When I execute the code above, the result displayed on the terminal is as follows:
As can be seen, all values in the "Survived" Series are now NaN. The expected behavior, however, would be for the values to become "Yes" or "No". Strangely, if I invert the penultimate and anti-penultimate lines generating the following code sample:
The result generated by the code sample above is the expected one, as shown below.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : pt_BR.UTF-8
LOCALE : pt_BR.UTF-8
pandas : 1.3.2
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.15
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: