-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: read_csv converts NaN to 1.0 in certain circumstances #42808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The problem is related to the C parser and it is present on master (01076c6). from io import StringIO
import pandas as pd
data = '''0
""
True
False
'''
c = pd.read_csv(StringIO(data), dtype='float', engine='c', skip_blank_lines=False)
p = pd.read_csv(StringIO(data), dtype='float', engine='python', skip_blank_lines=False) The expected result can be obtained from the Python parser:
|
Thanks @tristanlmiller and @dicristina! Further investigations and a PR to fix are most welcome! |
The C parser has the same behavior when from io import StringIO
import pandas as pd
data = '''0
""
True
False
'''
c = pd.read_csv(StringIO(data), dtype='int', engine='c', skip_blank_lines=False) Instead we get:
|
Thanks for pointing this out @dicristina, I've added it to my fix linked in the PR above. This problem comes up in an older issue #34120. The issue is caused by some funky handling of bools-with-nans in the C parser. I've added tests for converting to bool, int, and float. Can you think of any other dtypes that could behave poorly here? |
@joelgibson I could not find any more examples where the NA values were misrepresented in the result but I found that for Now I have some doubts as to whether this is something that should work at all. At first glance the answer seems to be affirmative because we think of boolean values as a subset of the integers so it is reasonable to be able to do this kind of upcasting. But is this something that should be triggered by our choice of the |
@dicristina I don't know what the right answer is - this change brings the C parser in-line with what the Python parser is already doing, which. As to whether the actual behaviour is sensible, I find it hard to tell because I have no real mental model of the casting rules when reading from a CSV. For example I find the following behaviour bizarre: import pandas as pd
from io import StringIO
csv_data = """0
NaN
True
False
"""
df1 = pd.read_csv(StringIO(csv_data))
print(repr(df1["0"].values))
df2 = pd.read_csv(StringIO(csv_data), dtype=object)
print(repr(df2["0"].values)) Which prints out
I would expect that explicitly passing the same type which is eventually inferred would always get the same result, but it doesn't. When you say "is this something that should be triggered by our choice of the dtype parameter of pd.read_csv?" I don't know the answer, since I can't find the actual logic for how the dtype parameter is supposed to work written down anywhere (and examples like the above show it might be complicated, or perhaps simple with some exceptions). Now that true nullable types exist in pandas, the CSV parsing code (especially the dtype conversions) could potentially be simplified a lot, but it's hard to do without some mental model of how the conversions are meant to work. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Output:
Problem description
When reading a csv file containing NaNs mixed with booleans, and when told to convert to a float, pandas converts NaN to 1.0.
Expected Output
I'd expect NaN to be read as NaN. It should behave the same as when I do this:
Output:
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 2cb9652
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.232-123.381.amzn1.x86_64
Version : #1 SMP Wed May 19 18:00:09 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.04.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.0
pyxlsb : None
s3fs : 2021.04.0
scipy : 1.6.3
sqlalchemy : 1.4.11
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1
The text was updated successfully, but these errors were encountered: