Skip to content

Commit 335d043

Browse files
committed
DOC: Added note to io.rst regarding reading in mixed dtypes
1 parent a3cddfa commit 335d043

File tree

1 file changed

+39
-0
lines changed

1 file changed

+39
-0
lines changed

doc/source/io.rst

+39
Original file line numberDiff line numberDiff line change
@@ -440,6 +440,45 @@ individual columns:
440440
Specifying ``dtype`` with ``engine`` other than 'c' raises a
441441
``ValueError``.
442442

443+
.. note::
444+
445+
Reading in data with mixed dtypes and relying on ``pandas``
446+
to infer them is not recommended. In doing so, the parsing engine will
447+
loop over all the dtypes, trying to convert them to an actual
448+
type; if something breaks during that process, the engine will go to the
449+
next ``dtype`` and the data is left modified in place. For example,
450+
451+
.. ipython:: python
452+
453+
from collections import Counter
454+
df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
455+
df.to_csv('foo')
456+
mixed_df = pd.read_csv('foo')
457+
Counter(mixed_df['col_1'].apply(lambda x: type(x)))
458+
459+
will result with `mixed_df` containing an ``int`` dtype for the first
460+
262,143 values, and ``str`` for others due to a problem during
461+
parsing. Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
462+
contain only one ``dtype``. For instance, you could use the ``converters``
463+
argument of :func:`~pandas.read_csv`
464+
465+
.. ipython:: python
466+
467+
fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
468+
Counter(fixed_df1['col_1'].apply(lambda x: type(x)))
469+
470+
Or you could use the :func:`~pandas.to_numeric` function to coerce the
471+
dtypes after reading in the data,
472+
473+
.. ipython:: python
474+
475+
fixed_df2 = pd.read_csv('foo')
476+
fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
477+
Counter(fixed_df2['col_1'].apply(lambda x: type(x)))
478+
479+
which would convert all valid parsing to ints, leaving the invalid parsing
480+
as ``NaN``.
481+
443482
Naming and Using Columns
444483
''''''''''''''''''''''''
445484

0 commit comments

Comments
 (0)