DOC: Added note to io.rst regarding reading in mixed dtypes

wcwagner · wcwagner · commit 335d043be20e · 2016-07-26T19:13:33.000-04:00
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -440,6 +440,45 @@ individual columns:
     Specifying ``dtype`` with ``engine`` other than 'c' raises a
     ``ValueError``.
 
+.. note::
+
+   Reading in data with mixed dtypes and relying on ``pandas``
+   to infer them is not recommended. In doing so, the parsing engine will
+   loop over all the dtypes, trying to convert them to an actual
+   type; if something breaks during that process, the engine will go to the
+   next ``dtype`` and the data is left modified in place. For example,
+
+   .. ipython:: python
+
+       from collections import Counter
+       df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
+       df.to_csv('foo')
+       mixed_df = pd.read_csv('foo')
+       Counter(mixed_df['col_1'].apply(lambda x: type(x)))
+
+   will result with `mixed_df` containing an ``int`` dtype for the first
+   262,143 values, and ``str`` for others due to a problem during
+   parsing. Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
+   contain only one ``dtype``. For instance, you could use the ``converters``
+   argument of :func:`~pandas.read_csv`
+
+   .. ipython:: python
+
+       fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
+       Counter(fixed_df1['col_1'].apply(lambda x: type(x)))
+
+   Or you could use the :func:`~pandas.to_numeric` function to coerce the
+   dtypes after reading in the data,
+
+   .. ipython:: python
+
+       fixed_df2 = pd.read_csv('foo')
+       fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
+       Counter(fixed_df2['col_1'].apply(lambda x: type(x)))
+
+   which would convert all valid parsing to ints, leaving the invalid parsing
+   as ``NaN``.
+
 Naming and Using Columns
 ''''''''''''''''''''''''