Skip to content

Commit ba4c2ce

Browse files
committed
DOC: Added short commentary on alternatives
1 parent b6e2b64 commit ba4c2ce

File tree

1 file changed

+19
-14
lines changed

1 file changed

+19
-14
lines changed

doc/source/io.rst

+19-14
Original file line numberDiff line numberDiff line change
@@ -442,24 +442,24 @@ individual columns:
442442

443443
.. note::
444444

445-
Reading in data with mixed dtypes and relying on ``pandas``
446-
to infer them is not recommended. In doing so, the parsing engine will
447-
infer the dtypes for different chunks of the data, rather than the whole
448-
dataset at once. Consequently, you can end up with column(s) with mixed
449-
dtypes. For example,
445+
Reading in data with columns containing mixed dtypes and relying
446+
on ``pandas`` to infer them is not recommended. In doing so, the
447+
parsing engine will infer the dtypes for different chunks of the data,
448+
rather than the whole dataset at once. Consequently, you can end up with
449+
column(s) with mixed dtypes. For example,
450450

451451
.. ipython:: python
452452
:okwarning:
453453
454454
df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
455455
df.to_csv('foo')
456456
mixed_df = pd.read_csv('foo')
457-
mixed_df['col_1'].apply(lambda x: type(x)).value_counts()
457+
mixed_df['col_1'].apply(type).value_counts()
458458
mixed_df['col_1'].dtype
459459
460-
will result with `mixed_df` containing an ``int`` dtype for the first
461-
262,143 values, and ``str`` for others due to a problem during
462-
parsing. It is important to note that the overall column will be marked with a
460+
will result with `mixed_df` containing an ``int`` dtype for certain chunks
461+
of the column, and ``str`` for others due to a problem during parsing.
462+
It is important to note that the overall column will be marked with a
463463
``dtype`` of ``object``, which is used for columns with mixed dtypes.
464464

465465
Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
@@ -469,7 +469,7 @@ individual columns:
469469
.. ipython:: python
470470
471471
fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
472-
fixed_df1['col_1'].apply(lambda x: type(x)).value_counts()
472+
fixed_df1['col_1'].apply(type).value_counts()
473473
474474
Or you could use the :func:`~pandas.to_numeric` function to coerce the
475475
dtypes after reading in the data,
@@ -479,9 +479,9 @@ individual columns:
479479
480480
fixed_df2 = pd.read_csv('foo')
481481
fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
482-
fixed_df2['col_1'].apply(lambda x: type(x)).value_counts()
482+
fixed_df2['col_1'].apply(type).value_counts()
483483
484-
which would convert all valid parsing to ints, leaving the invalid parsing
484+
which would convert all valid parsing to floats, leaving the invalid parsing
485485
as ``NaN``.
486486

487487
Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv`
@@ -490,9 +490,14 @@ individual columns:
490490
.. ipython:: python
491491
492492
fixed_df3 = pd.read_csv('foo', low_memory=False)
493-
fixed_df2['col_1'].apply(lambda x: type(x)).value_counts()
493+
fixed_df3['col_1'].apply(type).value_counts()
494+
495+
Ultimately, how you deal with reading in columns containing mixed dtypes
496+
depends on your specific needs. In the case above, if you wanted to ``NaN`` out
497+
the data anomalies, then :func:`~pandas.to_numeric` is probably your best option.
498+
However, if you wanted for all the data to be coerced, no matter the type, then
499+
using the ``converters`` argument of :func:`~pandas.read_csv` would certainly work.
494500

495-
which achieves a similar result.
496501
Naming and Using Columns
497502
''''''''''''''''''''''''
498503

0 commit comments

Comments
 (0)