DOC: document the perils of reading mixed dtypes / how to handle #13746

pkch · 2016-07-22T01:00:31Z

Code Sample, a copy-pastable example if possible

See for example http://stackoverflow.com/questions/18471859/pandas-read-csv-dtype-inference-issue
(the behavior didn't since 3 years ago, except a warning is now issued)

import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA') # warning issued
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique() # ['1' 'X' 1]
for a in df2['a'][262140:262150]:
    print repr(a) # switching between '1' and 1

Similar examples: the column that contains '1' and 1 is inferred as string; one that contains 1 and '1' is inferred as numeric.

import io
df = pd.read_csv(io.StringIO('1, "1"\n"1",1'), header=None) # no warning
df.dtypes # int and str

Furthermore, pd.to_numeric(df[1]) won't actually work in the above case.

While it's an understandable behavior, it is unexpected to many users and isn't even documented (beyond the general warnings that type inference isn't perfect). Given the number of people who use pandas for reading csv files, without knowing the history of the library or the mechanics of type inference, this results in a lot of wasted time identifying and fixing a problem that could have been prevented.

I suggest at least adding clear documentation on this, but preferably changing the default behavior to either fail with an error stating that type inference failed and that dtypes should be explicitly provided (I don't think it's possible to use two passes since the input data may be a generator that is exhausted after the first read).

Expected Output

the column

output of `pd.show_versions()`

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-07-22T01:15:34Z

I could see an argument for the first case raising - although it's been this way forever and the warning is decent and gives explicit corrective action?

df2 = pd.read_csv('test', sep='\t')
C:\Users\Chris\Anaconda\lib\site-packages\IPython\core\interactiveshell.py:2705: DtypeWarning: 
Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

Case 2 is a different issue altogether - you have a space in your data and need to pass skipinitialspace=True

In [26]: from pandas.compat import StringIO
    ...: df = pd.read_csv(StringIO('1, "1"\n"1",1'), header=None, skipinitialspace=True)
    ...: df
Out[26]: 
   0  1
0  1  1
1  1  1

In [27]: df.dtypes
Out[27]: 
0    int64
1    int64
dtype: object

pkch · 2016-07-22T03:21:55Z

@chris-b1 Understood. Perhaps then clarify the docs?
The most common case I see where this was a problem is this:

df = pd.read_csv(io.StringIO('1,\n"",\n3a,\n4,'), header=None)

Now df[0] is inferred as object due to some values that couldn't be converted. But there's no easy to identify which rows caused this problem. Applying pd.numeric(df[0], errors='ignore') somehow leaves the entire column as a string as long as at least one value couldn't be parsed (is it a bug btw?). Applying pd.numeric(df[0], errors='coerce') would not separate the (usually numerous) empty strings that are interpreted as NaN from the actual data issues that cause conversion to fail.

The ideal solution would have been to produce an optional log of parse errors. Barring that, perhaps modifying pd.to_numeric to actually ignore only the unconvertible value rather than the entire column.

jorisvandenbossche · 2016-07-22T08:42:33Z

@pkch For converting the object column to numeric, you need to set errors='coerce':

In [61]: pd.to_numeric(df[0], errors='coerce')
Out[61]:
0    1.0
1    NaN
2    NaN
3    4.0
Name: 0, dtype: float64

As the docstring says, errors='ignore' will "invalid parsing will return the input". But maybe it can be made clearer that the entire input is returned, and not for the single element that failed parsing. (PRs always welcome!)

If you want to find the values that did cause the conversion to fail, you can first drop the NaN values before applying to_numeric, or you can compare where there were NaN values before and after. Eg this identifies the problematic values: df0_converted = pd.to_numeric(df[0], errors='coerce'); df[0].notnull() & df0_converted.isnull()

It would maybe also be nice if to_numeric(df[0], errors='raise') actually showed the failing string.

jreback · 2016-07-22T17:08:35Z

yeah at best this is a doc issue, a nice note section under dtypes in io.rst / cookbook of the perils of having mixed dtypes.

sinhrks added the IO CSV read_csv, to_csv label Jul 22, 2016

sinhrks added the Dtype Conversions Unexpected or buggy dtype conversions label Jul 22, 2016

jreback added Docs Difficulty Novice labels Jul 22, 2016

jreback added this to the Next Major Release milestone Jul 22, 2016

jreback changed the title ~~read_csv type inference too brittle, needs documentation or redesign~~ DOC: document the perils of reading mixed dtypes / how to handle Jul 22, 2016

chris-b1 mentioned this issue Jul 24, 2016

DOC: show failing string on numeric parse #13773

Merged

4 tasks

wcwagner mentioned this issue Jul 25, 2016

DOC: Added note to io.rst regarding reading in mixed dtypes #13782

Closed

4 tasks

jreback modified the milestones: 0.19.0, Next Major Release Jul 26, 2016

jreback closed this as completed in 63285a4 Jul 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: document the perils of reading mixed dtypes / how to handle #13746

DOC: document the perils of reading mixed dtypes / how to handle #13746

pkch commented Jul 22, 2016

chris-b1 commented Jul 22, 2016

pkch commented Jul 22, 2016

jorisvandenbossche commented Jul 22, 2016

jreback commented Jul 22, 2016 •

edited

Loading

DOC: document the perils of reading mixed dtypes / how to handle #13746

DOC: document the perils of reading mixed dtypes / how to handle #13746

Comments

pkch commented Jul 22, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

chris-b1 commented Jul 22, 2016

pkch commented Jul 22, 2016

jorisvandenbossche commented Jul 22, 2016

jreback commented Jul 22, 2016 • edited Loading

output of `pd.show_versions()`

jreback commented Jul 22, 2016 •

edited

Loading