@@ -440,6 +440,45 @@ individual columns:
440
440
Specifying ``dtype `` with ``engine `` other than 'c' raises a
441
441
``ValueError ``.
442
442
443
+ .. note ::
444
+
445
+ Reading in data with mixed dtypes and relying on ``pandas ``
446
+ to infer them is not recommended. In doing so, the parsing engine will
447
+ loop over all the dtypes, trying to convert them to an actual
448
+ type; if something breaks during that process, the engine will go to the
449
+ next ``dtype `` and the data is left modified in place. For example,
450
+
451
+ .. ipython :: python
452
+
453
+ from collections import Counter
454
+ df = pd.DataFrame({' col_1' :range (500000 ) + [' a' , ' b' ] + range (500000 )})
455
+ df.to_csv(' foo' )
456
+ mixed_df = pd.read_csv(' foo' )
457
+ Counter(mixed_df[' col_1' ].apply(lambda x : type (x)))
458
+
459
+ will result with `mixed_df ` containing an ``int `` dtype for the first
460
+ 262,143 values, and ``str `` for others due to a problem during
461
+ parsing. Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
462
+ contain only one ``dtype ``. For instance, you could use the ``converters ``
463
+ argument of :func: `~pandas.read_csv `
464
+
465
+ .. ipython :: python
466
+
467
+ fixed_df1 = pd.read_csv(' foo' , converters = {' col_1' :str })
468
+ Counter(fixed_df1[' col_1' ].apply(lambda x : type (x)))
469
+
470
+ Or you could use the :func: `~pandas.to_numeric ` function to coerce the
471
+ dtypes after reading in the data,
472
+
473
+ .. ipython :: python
474
+
475
+ fixed_df2 = pd.read_csv(' foo' )
476
+ fixed_df2[' col_1' ] = pd.to_numeric(fixed_df2[' col_1' ], errors = ' coerce' )
477
+ Counter(fixed_df2[' col_1' ].apply(lambda x : type (x)))
478
+
479
+ which would convert all valid parsing to ints, leaving the invalid parsing
480
+ as ``NaN ``.
481
+
443
482
Naming and Using Columns
444
483
''''''''''''''''''''''''
445
484
0 commit comments