-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Force boolean column to category while reading a csv #20498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@svgsponer should the line Hmm perhaps not, because that won't get your expected output. Regardless, something seems to be going strange here. |
Ahh, note the difference: In [15]: pd.read_csv(StringIO("A\nTrue\nFalse\nTrue"), dtype={"A": pd.api.types.CategoricalDtype(['True', 'False'])})
Out[15]:
A
0 True
1 False
2 True
In [16]: pd.read_csv(StringIO("A\nTrue\nFalse\nTrue"), dtype={"A": pd.api.types.CategoricalDtype([True, False])})
Out[16]:
A
0 NaN
1 NaN
2 NaN We do some checking for numeric dtypes, I'm not sure about booleans. |
The issue is likely in pandas/pandas/core/arrays/categorical.py Lines 490 to 508 in 687cbe2
Everything read by the CSV parser is a string. We do some checking for whether those strings should be converted to a specialized type (numeric, datetime-like). Pandas considers In pandas/pandas/core/arrays/categorical.py Line 516 in 687cbe2
if dtype.categories.is_boolean() .
I'll get to this next week unless someone beats me to it. |
Actually, I might just claim this issue to use for a talk I'm giving next week on how to contribute to pandas. |
@TomAugspurger Sorry, forgot to add the 'c' column here after I edited my test code. It is updated now. |
Sorry, dropped the ball on this. Will try to resurrect my PR. |
Previously, was being parsed as object instead of boolean. Closes pandas-devgh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.
Previously, was being parsed as object instead of boolean. Closes gh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.
Previously, was being parsed as object instead of boolean. Closes pandas-devgh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.
Previously, was being parsed as object instead of boolean. Closes pandas-devgh-20498. Original Author: @TomAugspurger Rebased by @gfyoung due to merge conflicts.
Problem description
Reading a csv file with dtypes specified in a dictionary produces NaNs for boolean columns read as category. It seems the output of dtypes changes from version 0.20 to 0.21 so that the below code produces NaNs for the second column.
This might not really be a bug but it seems quite unintuitive to me as I would expect that output of df.dtypes should be useable to read it correctly again and I would expect the interpretation of booleans as a category as valid.
Code to reproduce the issue
Produced Output
Expected Output
Output of df.dtypes version 0.22
{'a': dtype('int64'), 'b': CategoricalDtype(categories=[False, True], ordered=False), 'c': CategoricalDtype(categories=['A', 'B', 'C'], ordered=False)}
Output of df.dtypes version 0.20
{'a': dtype('int64'), 'b': category, 'c': category}
Version
Pands version 0.22
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.6-040806-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
LOCALE: en_IE.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.2
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: