@@ -442,24 +442,24 @@ individual columns:
442
442
443
443
.. note ::
444
444
445
- Reading in data with mixed dtypes and relying on `` pandas ``
446
- to infer them is not recommended. In doing so, the parsing engine will
447
- infer the dtypes for different chunks of the data, rather than the whole
448
- dataset at once. Consequently, you can end up with column(s) with mixed
449
- dtypes. For example,
445
+ Reading in data with columns containing mixed dtypes and relying
446
+ on `` pandas `` to infer them is not recommended. In doing so, the
447
+ parsing engine will infer the dtypes for different chunks of the data,
448
+ rather than the whole dataset at once. Consequently, you can end up with
449
+ column(s) with mixed dtypes. For example,
450
450
451
451
.. ipython :: python
452
452
:okwarning:
453
453
454
454
df = pd.DataFrame({' col_1' :range (500000 ) + [' a' , ' b' ] + range (500000 )})
455
455
df.to_csv(' foo' )
456
456
mixed_df = pd.read_csv(' foo' )
457
- mixed_df[' col_1' ].apply(lambda x : type (x) ).value_counts()
457
+ mixed_df[' col_1' ].apply(type ).value_counts()
458
458
mixed_df[' col_1' ].dtype
459
459
460
- will result with `mixed_df ` containing an ``int `` dtype for the first
461
- 262,143 values , and ``str `` for others due to a problem during
462
- parsing. It is important to note that the overall column will be marked with a
460
+ will result with `mixed_df ` containing an ``int `` dtype for certain chunks
461
+ of the column , and ``str `` for others due to a problem during parsing.
462
+ It is important to note that the overall column will be marked with a
463
463
``dtype `` of ``object ``, which is used for columns with mixed dtypes.
464
464
465
465
Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
@@ -469,7 +469,7 @@ individual columns:
469
469
.. ipython :: python
470
470
471
471
fixed_df1 = pd.read_csv(' foo' , converters = {' col_1' :str })
472
- fixed_df1[' col_1' ].apply(lambda x : type (x) ).value_counts()
472
+ fixed_df1[' col_1' ].apply(type ).value_counts()
473
473
474
474
Or you could use the :func: `~pandas.to_numeric ` function to coerce the
475
475
dtypes after reading in the data,
@@ -479,9 +479,9 @@ individual columns:
479
479
480
480
fixed_df2 = pd.read_csv(' foo' )
481
481
fixed_df2[' col_1' ] = pd.to_numeric(fixed_df2[' col_1' ], errors = ' coerce' )
482
- fixed_df2[' col_1' ].apply(lambda x : type (x) ).value_counts()
482
+ fixed_df2[' col_1' ].apply(type ).value_counts()
483
483
484
- which would convert all valid parsing to ints , leaving the invalid parsing
484
+ which would convert all valid parsing to floats , leaving the invalid parsing
485
485
as ``NaN ``.
486
486
487
487
Alternatively, you could set the ``low_memory `` argument of :func: `~pandas.read_csv `
@@ -490,9 +490,14 @@ individual columns:
490
490
.. ipython :: python
491
491
492
492
fixed_df3 = pd.read_csv(' foo' , low_memory = False )
493
- fixed_df2[' col_1' ].apply(lambda x : type (x)).value_counts()
493
+ fixed_df3[' col_1' ].apply(type ).value_counts()
494
+
495
+ Ultimately, how you deal with reading in columns containing mixed dtypes
496
+ depends on your specific needs. In the case above, if you wanted to ``NaN `` out
497
+ the data anomalies, then :func: `~pandas.to_numeric ` is probably your best option.
498
+ However, if you wanted for all the data to be coerced, no matter the type, then
499
+ using the ``converters `` argument of :func: `~pandas.read_csv ` would certainly work.
494
500
495
- which achieves a similar result.
496
501
Naming and Using Columns
497
502
''''''''''''''''''''''''
498
503
0 commit comments