Skip to content

Commit 63285a4

Browse files
wcwagnerjreback
authored andcommitted
DOC: Added note to io.rst regarding reading in mixed dtypes
closes pandas-dev#13746 Author: wcwagner <[email protected]> Closes pandas-dev#13782 from wcwagner/doc/13746 and squashes the following commits: 7400607 [wcwagner] DOC: Added refs to basics.dtypes and basics.object_conversion, added whatsnew entry 8112ad5 [wcwagner] DOC: Shortened note, moved alternatives to main text ba4c2ce [wcwagner] DOC: Added short commentary on alternatives b6e2b64 [wcwagner] DOC: Swtiched Counter to value_counts, added low_memory alternative example, clarified type inference process 335d043 [wcwagner] DOC: Added note to io.rst regarding reading in mixed dtypes
1 parent 10da3ae commit 63285a4

File tree

3 files changed

+64
-0
lines changed

3 files changed

+64
-0
lines changed

doc/source/basics.rst

+2
Original file line numberDiff line numberDiff line change
@@ -1751,6 +1751,8 @@ Convert a subset of columns to a specified type using :meth:`~DataFrame.astype`
17511751
dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)
17521752
dft.dtypes
17531753
1754+
.. _basics.object_conversion:
1755+
17541756
object conversion
17551757
~~~~~~~~~~~~~~~~~
17561758

doc/source/io.rst

+60
Original file line numberDiff line numberDiff line change
@@ -435,11 +435,71 @@ individual columns:
435435
df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64})
436436
df.dtypes
437437
438+
Fortunately, ``pandas`` offers more than one way to ensure that your column(s)
439+
contain only one ``dtype``. If you're unfamiliar with these concepts, you can
440+
see :ref:`here<basics.dtypes>` to learn more about dtypes, and
441+
:ref:`here<basics.object_conversion>` to learn more about ``object`` conversion in
442+
``pandas``.
443+
444+
445+
For instance, you can use the ``converters`` argument
446+
of :func:`~pandas.read_csv`:
447+
448+
.. ipython:: python
449+
450+
data = "col_1\n1\n2\n'A'\n4.22"
451+
df = pd.read_csv(StringIO(data), converters={'col_1':str})
452+
df
453+
df['col_1'].apply(type).value_counts()
454+
455+
Or you can use the :func:`~pandas.to_numeric` function to coerce the
456+
dtypes after reading in the data,
457+
458+
.. ipython:: python
459+
460+
df2 = pd.read_csv(StringIO(data))
461+
df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce')
462+
df2
463+
df2['col_1'].apply(type).value_counts()
464+
465+
which would convert all valid parsing to floats, leaving the invalid parsing
466+
as ``NaN``.
467+
468+
Ultimately, how you deal with reading in columns containing mixed dtypes
469+
depends on your specific needs. In the case above, if you wanted to ``NaN`` out
470+
the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
471+
However, if you wanted for all the data to be coerced, no matter the type, then
472+
using the ``converters`` argument of :func:`~pandas.read_csv` would certainly be
473+
worth trying.
474+
438475
.. note::
439476
The ``dtype`` option is currently only supported by the C engine.
440477
Specifying ``dtype`` with ``engine`` other than 'c' raises a
441478
``ValueError``.
442479

480+
.. note::
481+
In some cases, reading in abnormal data with columns containing mixed dtypes
482+
will result in an inconsistent dataset. If you rely on pandas to infer the
483+
dtypes of your columns, the parsing engine will go and infer the dtypes for
484+
different chunks of the data, rather than the whole dataset at once. Consequently,
485+
you can end up with column(s) with mixed dtypes. For example,
486+
487+
.. ipython:: python
488+
:okwarning:
489+
490+
df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
491+
df.to_csv('foo')
492+
mixed_df = pd.read_csv('foo')
493+
mixed_df['col_1'].apply(type).value_counts()
494+
mixed_df['col_1'].dtype
495+
496+
will result with `mixed_df` containing an ``int`` dtype for certain chunks
497+
of the column, and ``str`` for others due to the mixed dtypes from the
498+
data that was read in. It is important to note that the overall column will be
499+
marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes.
500+
501+
502+
443503
Naming and Using Columns
444504
''''''''''''''''''''''''
445505

doc/source/whatsnew/v0.19.0.txt

+2
Original file line numberDiff line numberDiff line change
@@ -323,6 +323,8 @@ Other enhancements
323323
index=['row1', 'row2'])
324324
df.sort_values(by='row2', axis=1)
325325

326+
- Added documentation to :ref:`I/O<io.dtypes>` regarding the perils of reading in columns with mixed dtypes and how to handle it (:issue:`13746`)
327+
326328
.. _whatsnew_0190.api:
327329

328330

0 commit comments

Comments
 (0)