Skip to content

Commit b6e2b64

Browse files
committed
DOC: Swtiched Counter to value_counts, added low_memory alternative example, clarified type inference process
1 parent 335d043 commit b6e2b64

File tree

1 file changed

+22
-8
lines changed

1 file changed

+22
-8
lines changed

doc/source/io.rst

+22-8
Original file line numberDiff line numberDiff line change
@@ -444,41 +444,55 @@ individual columns:
444444

445445
Reading in data with mixed dtypes and relying on ``pandas``
446446
to infer them is not recommended. In doing so, the parsing engine will
447-
loop over all the dtypes, trying to convert them to an actual
448-
type; if something breaks during that process, the engine will go to the
449-
next ``dtype`` and the data is left modified in place. For example,
447+
infer the dtypes for different chunks of the data, rather than the whole
448+
dataset at once. Consequently, you can end up with column(s) with mixed
449+
dtypes. For example,
450450

451451
.. ipython:: python
452+
:okwarning:
452453
453-
from collections import Counter
454454
df = pd.DataFrame({'col_1':range(500000) + ['a', 'b'] + range(500000)})
455455
df.to_csv('foo')
456456
mixed_df = pd.read_csv('foo')
457-
Counter(mixed_df['col_1'].apply(lambda x: type(x)))
457+
mixed_df['col_1'].apply(lambda x: type(x)).value_counts()
458+
mixed_df['col_1'].dtype
458459
459460
will result with `mixed_df` containing an ``int`` dtype for the first
460461
262,143 values, and ``str`` for others due to a problem during
461-
parsing. Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
462+
parsing. It is important to note that the overall column will be marked with a
463+
``dtype`` of ``object``, which is used for columns with mixed dtypes.
464+
465+
Fortunately, ``pandas`` offers a few ways to ensure that the column(s)
462466
contain only one ``dtype``. For instance, you could use the ``converters``
463467
argument of :func:`~pandas.read_csv`
464468

465469
.. ipython:: python
466470
467471
fixed_df1 = pd.read_csv('foo', converters={'col_1':str})
468-
Counter(fixed_df1['col_1'].apply(lambda x: type(x)))
472+
fixed_df1['col_1'].apply(lambda x: type(x)).value_counts()
469473
470474
Or you could use the :func:`~pandas.to_numeric` function to coerce the
471475
dtypes after reading in the data,
472476

473477
.. ipython:: python
478+
:okwarning:
474479
475480
fixed_df2 = pd.read_csv('foo')
476481
fixed_df2['col_1'] = pd.to_numeric(fixed_df2['col_1'], errors='coerce')
477-
Counter(fixed_df2['col_1'].apply(lambda x: type(x)))
482+
fixed_df2['col_1'].apply(lambda x: type(x)).value_counts()
478483
479484
which would convert all valid parsing to ints, leaving the invalid parsing
480485
as ``NaN``.
481486

487+
Alternatively, you could set the ``low_memory`` argument of :func:`~pandas.read_csv`
488+
to ``False``. Such as,
489+
490+
.. ipython:: python
491+
492+
fixed_df3 = pd.read_csv('foo', low_memory=False)
493+
fixed_df2['col_1'].apply(lambda x: type(x)).value_counts()
494+
495+
which achieves a similar result.
482496
Naming and Using Columns
483497
''''''''''''''''''''''''
484498

0 commit comments

Comments
 (0)