-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: python and c engines for read_csv treat blank spaces differently, when using converters #13576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess. you are doing something really really odd here by converting an explicitly float column. To be honest the cc @gfyoung |
Fair enough. I don't think it is pressing bug, but probably a bug nonetheless.... I came across it when I was trying to add converters to |
No I haven't. I'm not exactly sure what the best fix would be. @gfyoung, please go ahead. |
For future reference, here is a (more) minimal example to reproduce this: >>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\n,1'
>>> read_csv(StringIO(data), converters={0: str}, engine='c').values # correct
array([['', 1]], dtype=object)
>>> read_csv(StringIO(data), converters={0: str}, engine='python').values # incorrect
array([[nan, 1]], dtype=object) |
Sigh...I thought it was simple. Then I thought about it again (and saw tests fail), and I see now that it is another manifestation of the major discrepancy between the Python and the C engines. I thought the Python engine was bugged, but now I realise that it isn't. The reason the logic is written that way is because we want to fully control which values are So why the C engine does not convert to Arguably, this is a dupe of my issue here. @jreback, you can be the judge whether this one here can be closed in light of this other issue. |
Code Sample, a copy-pastable example if possible
Expected Output
Notice that the output for the python engine and the c engine are different. I am not sure which one is preferable/expected.
output of
pd.show_versions()
Details
The culpruit seems to be this function: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1299
Before that call,
results
andvalues
both contain the''
string in the python engine version of the parser. That call changes the''
' value invalues
tonan
and thus changes the value ofresults
as well. Specifically, the change for the python engine happens here: https://github.com/pydata/pandas/blob/0c6226cbbc319ec22cf4c957bdcc055eaa7aea99/pandas/src/inference.pyx#L1009It seems that
na_values
contains''
by default for certain parsing calls. This means that whenval == ''
triggers thenan
conversion whether or notconvert_empty
is true.I haven't tracked down the change in the
c
engine.The text was updated successfully, but these errors were encountered: