Skip to content

BUG: python and c engines for read_csv treat blank spaces differently, when using converters #13576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gte620v opened this issue Jul 7, 2016 · 7 comments
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv

Comments

@gte620v
Copy link
Contributor

gte620v commented Jul 7, 2016

Code Sample, a copy-pastable example if possible

In [109]:
data = np.array([[ 'c1', 'c2'],
                [ '', 0.285],
                [ 10.1, 0.285]], dtype=object)

In [110]:
pd.DataFrame(data).to_csv('test.csv',header=False,index=False)

In [111]:
!cat test.csv
Out[111]:
c1,c2
,0.285
10.1,0.285

In [113]:
pd.read_csv('test.csv',converters={'c1':str},engine='c').values
Out[113]:
array([['', 0.285],
       ['10.1', 0.285]], dtype=object)

In [114]:
pd.read_csv('test.csv',converters={'c1':str},engine='python').values
Out[114]:
array([[nan, 0.285],
       ['10.1', 0.285]], dtype=object)

Expected Output

Notice that the output for the python engine and the c engine are different. I am not sure which one is preferable/expected.

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 13.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0

Details

The culpruit seems to be this function: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1299

Before that call, results and values both contain the '' string in the python engine version of the parser. That call changes the ''' value in values to nan and thus changes the value of results as well. Specifically, the change for the python engine happens here: https://github.com/pydata/pandas/blob/0c6226cbbc319ec22cf4c957bdcc055eaa7aea99/pandas/src/inference.pyx#L1009

if (convert_empty and val == '') or (val in na_values)

It seems that na_values contains '' by default for certain parsing calls. This means that when val == '' triggers the nan conversion whether or not convert_empty is true.

I haven't tracked down the change in the c engine.

@jreback
Copy link
Contributor

jreback commented Jul 7, 2016

I guess. you are doing something really really odd here by converting an explicitly float column. To be honest the converter argument is not very idiomatic and non-performant. These types of conversions (if you really wanted to do it) are much better performed after reading/parsing.

cc @gfyoung

@jreback jreback added Bug IO CSV read_csv, to_csv Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate labels Jul 7, 2016
@jreback jreback added this to the Next Major Release milestone Jul 7, 2016
@jreback jreback changed the title BUG: python and c engines for read_csv treat blank spaces differently BUG: python and c engines for read_csv treat blank spaces differently, when using converters Jul 7, 2016
@gte620v
Copy link
Contributor Author

gte620v commented Jul 8, 2016

Fair enough. I don't think it is pressing bug, but probably a bug nonetheless.... I came across it when I was trying to add converters to read_html.

@gfyoung
Copy link
Member

gfyoung commented Jul 8, 2016

This is definitely a bug in the Python engine (C engine looks fine AFAICT) and can be easily fixed as @gte620v pointed out. PR should be on the way unless @gte620v you've already started.

@gte620v
Copy link
Contributor Author

gte620v commented Jul 8, 2016

No I haven't. I'm not exactly sure what the best fix would be. @gfyoung, please go ahead.

@gfyoung
Copy link
Member

gfyoung commented Jul 8, 2016

For future reference, here is a (more) minimal example to reproduce this:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = 'a,b\n,1'
>>> read_csv(StringIO(data), converters={0: str}, engine='c').values  # correct
array([['', 1]], dtype=object)
>>> read_csv(StringIO(data), converters={0: str}, engine='python').values  # incorrect
array([[nan, 1]], dtype=object)

@gfyoung
Copy link
Member

gfyoung commented Jul 8, 2016

Sigh...I thought it was simple. Then I thought about it again (and saw tests fail), and I see now that it is another manifestation of the major discrepancy between the Python and the C engines.

I thought the Python engine was bugged, but now I realise that it isn't. The reason the logic is written that way is because we want to fully control which values are NaN when we pass in na_values, which is why we pass in convert_empty=False.

So why the C engine does not convert to NaN is because of this issue with converters here. Notice that no NaN processing is done if you have a converter.

Arguably, this is a dupe of my issue here. @jreback, you can be the judge whether this one here can be closed in light of this other issue.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2016

@gfyoung yes I agree, #13302 is a generalization of this. I dont' know why that continue is there in the c-engine, but I suspect it will be non-trivial to remove :>

@jreback jreback closed this as completed Jul 8, 2016
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Jul 8, 2016
@jreback jreback modified the milestones: No action, Next Major Release Jul 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants