@@ -444,41 +444,55 @@ individual columns:
444
444
445
445
Reading in data with mixed dtypes and relying on ``pandas ``
446
446
to infer them is not recommended. In doing so, the parsing engine will
447
- loop over all the dtypes, trying to convert them to an actual
448
- type; if something breaks during that process, the engine will go to the
449
- next `` dtype `` and the data is left modified in place . For example,
447
+ infer the dtypes for different chunks of the data, rather than the whole
448
+ dataset at once. Consequently, you can end up with column(s) with mixed
449
+ dtypes . For example,
450
450
451
451
.. ipython :: python
452
+ :okwarning:
452
453
453
- from collections import Counter
454
454
df = pd.DataFrame({' col_1' :range (500000 ) + [' a' , ' b' ] + range (500000 )})
455
455
df.to_csv(' foo' )
456
456
mixed_df = pd.read_csv(' foo' )
457
- Counter(mixed_df[' col_1' ].apply(lambda x : type (x)))
457
+ mixed_df[' col_1' ].apply(lambda x : type (x)).value_counts()
458
+ mixed_df[' col_1' ].dtype
458
459
459
460
will result with `mixed_df ` containing an ``int `` dtype for the first
460
461
262,143 values, and ``str `` for others due to a problem during
461
- parsing. Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
462
+ parsing. It is important to note that the overall column will be marked with a
463
+ ``dtype `` of ``object ``, which is used for columns with mixed dtypes.
464
+
465
+ Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
462
466
contain only one ``dtype ``. For instance, you could use the ``converters ``
463
467
argument of :func: `~pandas.read_csv `
464
468
465
469
.. ipython :: python
466
470
467
471
fixed_df1 = pd.read_csv(' foo' , converters = {' col_1' :str })
468
- Counter( fixed_df1[' col_1' ].apply(lambda x : type (x)))
472
+ fixed_df1[' col_1' ].apply(lambda x : type (x)).value_counts( )
469
473
470
474
Or you could use the :func: `~pandas.to_numeric ` function to coerce the
471
475
dtypes after reading in the data,
472
476
473
477
.. ipython :: python
478
+ :okwarning:
474
479
475
480
fixed_df2 = pd.read_csv(' foo' )
476
481
fixed_df2[' col_1' ] = pd.to_numeric(fixed_df2[' col_1' ], errors = ' coerce' )
477
- Counter( fixed_df2[' col_1' ].apply(lambda x : type (x)))
482
+ fixed_df2[' col_1' ].apply(lambda x : type (x)).value_counts( )
478
483
479
484
which would convert all valid parsing to ints, leaving the invalid parsing
480
485
as ``NaN ``.
481
486
487
+ Alternatively, you could set the ``low_memory `` argument of :func: `~pandas.read_csv `
488
+ to ``False ``. Such as,
489
+
490
+ .. ipython :: python
491
+
492
+ fixed_df3 = pd.read_csv(' foo' , low_memory = False )
493
+ fixed_df2[' col_1' ].apply(lambda x : type (x)).value_counts()
494
+
495
+ which achieves a similar result.
482
496
Naming and Using Columns
483
497
''''''''''''''''''''''''
484
498
0 commit comments