Skip to content

DOC: document the perils of reading mixed dtypes / how to handle #13746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pkch opened this issue Jul 22, 2016 · 4 comments
Closed

DOC: document the perils of reading mixed dtypes / how to handle #13746

pkch opened this issue Jul 22, 2016 · 4 comments
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Milestone

Comments

@pkch
Copy link

pkch commented Jul 22, 2016

Code Sample, a copy-pastable example if possible

See for example http://stackoverflow.com/questions/18471859/pandas-read-csv-dtype-inference-issue
(the behavior didn't since 3 years ago, except a warning is now issued)

import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA') # warning issued
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique() # ['1' 'X' 1]
for a in df2['a'][262140:262150]:
    print repr(a) # switching between '1' and 1

Similar examples: the column that contains '1' and 1 is inferred as string; one that contains 1 and '1' is inferred as numeric.

import io
df = pd.read_csv(io.StringIO('1, "1"\n"1",1'), header=None) # no warning
df.dtypes # int and str

Furthermore, pd.to_numeric(df[1]) won't actually work in the above case.

While it's an understandable behavior, it is unexpected to many users and isn't even documented (beyond the general warnings that type inference isn't perfect). Given the number of people who use pandas for reading csv files, without knowing the history of the library or the mechanics of type inference, this results in a lot of wasted time identifying and fixing a problem that could have been prevented.

I suggest at least adding clear documentation on this, but preferably changing the default behavior to either fail with an error stating that type inference failed and that dtypes should be explicitly provided (I don't think it's possible to use two passes since the input data may be a generator that is exhausted after the first read).

Expected Output

the column

output of pd.show_versions()

@chris-b1
Copy link
Contributor

I could see an argument for the first case raising - although it's been this way forever and the warning is decent and gives explicit corrective action?

df2 = pd.read_csv('test', sep='\t')
C:\Users\Chris\Anaconda\lib\site-packages\IPython\core\interactiveshell.py:2705: DtypeWarning: 
Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

Case 2 is a different issue altogether - you have a space in your data and need to pass skipinitialspace=True

In [26]: from pandas.compat import StringIO
    ...: df = pd.read_csv(StringIO('1, "1"\n"1",1'), header=None, skipinitialspace=True)
    ...: df
Out[26]: 
   0  1
0  1  1
1  1  1

In [27]: df.dtypes
Out[27]: 
0    int64
1    int64
dtype: object

@pkch
Copy link
Author

pkch commented Jul 22, 2016

@chris-b1 Understood. Perhaps then clarify the docs?
The most common case I see where this was a problem is this:

df = pd.read_csv(io.StringIO('1,\n"",\n3a,\n4,'), header=None)

Now df[0] is inferred as object due to some values that couldn't be converted. But there's no easy to identify which rows caused this problem. Applying pd.numeric(df[0], errors='ignore') somehow leaves the entire column as a string as long as at least one value couldn't be parsed (is it a bug btw?). Applying pd.numeric(df[0], errors='coerce') would not separate the (usually numerous) empty strings that are interpreted as NaN from the actual data issues that cause conversion to fail.

The ideal solution would have been to produce an optional log of parse errors. Barring that, perhaps modifying pd.to_numeric to actually ignore only the unconvertible value rather than the entire column.

@sinhrks sinhrks added the IO CSV read_csv, to_csv label Jul 22, 2016
@jorisvandenbossche
Copy link
Member

@pkch For converting the object column to numeric, you need to set errors='coerce':

In [61]: pd.to_numeric(df[0], errors='coerce')
Out[61]:
0    1.0
1    NaN
2    NaN
3    4.0
Name: 0, dtype: float64

As the docstring says, errors='ignore' will "invalid parsing will return the input". But maybe it can be made clearer that the entire input is returned, and not for the single element that failed parsing. (PRs always welcome!)

If you want to find the values that did cause the conversion to fail, you can first drop the NaN values before applying to_numeric, or you can compare where there were NaN values before and after. Eg this identifies the problematic values: df0_converted = pd.to_numeric(df[0], errors='coerce'); df[0].notnull() & df0_converted.isnull()

It would maybe also be nice if to_numeric(df[0], errors='raise') actually showed the failing string.

@sinhrks sinhrks added the Dtype Conversions Unexpected or buggy dtype conversions label Jul 22, 2016
@jreback
Copy link
Contributor

jreback commented Jul 22, 2016

yeah at best this is a doc issue, a nice note section under dtypes in io.rst / cookbook of the perils of having mixed dtypes.

@jreback jreback added this to the Next Major Release milestone Jul 22, 2016
@jreback jreback changed the title read_csv type inference too brittle, needs documentation or redesign DOC: document the perils of reading mixed dtypes / how to handle Jul 22, 2016
@jreback jreback modified the milestones: 0.19.0, Next Major Release Jul 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants