Skip to content

Commit 50c1ee8

Browse files
committed
Merge pull request #10265 from bashtage/enforce-coercion-conversion
BUG: Ensure 'coerce' actually coerces datatypes
2 parents 35c0863 + e9d6678 commit 50c1ee8

20 files changed

+356
-154
lines changed

doc/source/basics.rst

+21-8
Original file line numberDiff line numberDiff line change
@@ -1522,23 +1522,31 @@ then the more *general* one will be used as the result of the operation.
15221522
object conversion
15231523
~~~~~~~~~~~~~~~~~
15241524

1525-
:meth:`~DataFrame.convert_objects` is a method to try to force conversion of types from the ``object`` dtype to other types.
1526-
To force conversion of specific types that are *number like*, e.g. could be a string that represents a number,
1527-
pass ``convert_numeric=True``. This will force strings and numbers alike to be numbers if possible, otherwise
1528-
they will be set to ``np.nan``.
1525+
.. note::
1526+
1527+
The syntax of :meth:`~DataFrame.convert_objects` changed in 0.17.0. See
1528+
:ref:`API changes <whatsnew_0170.api_breaking.convert_objects>`
1529+
for more details.
1530+
1531+
:meth:`~DataFrame.convert_objects` is a method to try to force conversion of
1532+
types from the ``object`` dtype to other types. To try converting specific
1533+
types that are *number like*, e.g. could be a string that represents a number,
1534+
pass ``numeric=True``. To force the conversion, add the keyword argument
1535+
``coerce=True``. This will force strings and number-like objects to be numbers if
1536+
possible, otherwise they will be set to ``np.nan``.
15291537

15301538
.. ipython:: python
15311539
15321540
df3['D'] = '1.'
15331541
df3['E'] = '1'
1534-
df3.convert_objects(convert_numeric=True).dtypes
1542+
df3.convert_objects(numeric=True).dtypes
15351543
15361544
# same, but specific dtype conversion
15371545
df3['D'] = df3['D'].astype('float16')
15381546
df3['E'] = df3['E'].astype('int32')
15391547
df3.dtypes
15401548
1541-
To force conversion to ``datetime64[ns]``, pass ``convert_dates='coerce'``.
1549+
To force conversion to ``datetime64[ns]``, pass ``datetime=True`` and ``coerce=True``.
15421550
This will convert any datetime-like object to dates, forcing other values to ``NaT``.
15431551
This might be useful if you are reading in data which is mostly dates,
15441552
but occasionally has non-dates intermixed and you want to represent as missing.
@@ -1550,10 +1558,15 @@ but occasionally has non-dates intermixed and you want to represent as missing.
15501558
'foo', 1.0, 1, pd.Timestamp('20010104'),
15511559
'20010105'], dtype='O')
15521560
s
1553-
s.convert_objects(convert_dates='coerce')
1561+
s.convert_objects(datetime=True, coerce=True)
15541562
1555-
In addition, :meth:`~DataFrame.convert_objects` will attempt the *soft* conversion of any *object* dtypes, meaning that if all
1563+
Without passing ``coerce=True``, :meth:`~DataFrame.convert_objects` will attempt
1564+
*soft* conversion of any *object* dtypes, meaning that if all
15561565
the objects in a Series are of the same type, the Series will have that dtype.
1566+
Note that setting ``coerce=True`` does not *convert* arbitrary types to either
1567+
``datetime64[ns]`` or ``timedelta64[ns]``. For example, a series containing string
1568+
dates will not be converted to a series of datetimes. To convert between types,
1569+
see :ref:`converting to timestamps <timeseries.converting>`.
15571570

15581571
gotchas
15591572
~~~~~~~

doc/source/whatsnew/v0.17.0.txt

+66
Original file line numberDiff line numberDiff line change
@@ -48,13 +48,79 @@ Backwards incompatible API changes
4848

4949
.. _whatsnew_0170.api_breaking.other:
5050

51+
.. _whatsnew_0170.api_breaking.convert_objects:
52+
Changes to convert_objects
53+
^^^^^^^^^^^^^^^^^^^^^^^^^^
54+
- ``DataFrame.convert_objects`` keyword arguments have been shortened. (:issue:`10265`)
55+
56+
===================== =============
57+
Old New
58+
===================== =============
59+
``convert_dates`` ``datetime``
60+
``convert_numeric`` ``numeric``
61+
``convert_timedelta`` ``timedelta``
62+
===================== =============
63+
64+
- Coercing types with ``DataFrame.convert_objects`` is now implemented using the
65+
keyword argument ``coerce=True``. Previously types were coerced by setting a
66+
keyword argument to ``'coerce'`` instead of ``True``, as in ``convert_dates='coerce'``.
67+
68+
.. ipython:: python
69+
70+
df = pd.DataFrame({'i': ['1','2'],
71+
'f': ['apple', '4.2'],
72+
's': ['apple','banana']})
73+
df
74+
75+
The old usage of ``DataFrame.convert_objects`` used `'coerce'` along with the
76+
type.
77+
78+
.. code-block:: python
79+
80+
In [2]: df.convert_objects(convert_numeric='coerce')
81+
82+
Now the ``coerce`` keyword must be explicitly used.
83+
84+
.. ipython:: python
85+
86+
df.convert_objects(numeric=True, coerce=True)
87+
88+
- In earlier versions of pandas, ``DataFrame.convert_objects`` would not coerce
89+
numeric types when there were no values convertible to a numeric type. For example,
90+
91+
.. code-block:: python
92+
93+
In [1]: df = pd.DataFrame({'s': ['a','b']})
94+
In [2]: df.convert_objects(convert_numeric='coerce')
95+
Out[2]:
96+
s
97+
0 a
98+
1 b
99+
100+
returns the original DataFrame with no conversion. This change alters
101+
this behavior so that
102+
103+
.. ipython:: python
104+
105+
pd.DataFrame({'s': ['a','b']})
106+
df.convert_objects(numeric=True, coerce=True)
107+
108+
converts all non-number-like strings to ``NaN``.
109+
110+
- In earlier versions of pandas, the default behavior was to try and convert
111+
datetimes and timestamps. The new default is for ``DataFrame.convert_objects``
112+
to do nothing, and so it is necessary to pass at least one conversion target
113+
in the method call.
114+
115+
51116
Other API Changes
52117
^^^^^^^^^^^^^^^^^
53118
- Enable writing Excel files in :ref:`memory <_io.excel_writing_buffer>` using StringIO/BytesIO (:issue:`7074`)
54119
- Enable serialization of lists and dicts to strings in ExcelWriter (:issue:`8188`)
55120
- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
56121
- Serialize metadata properties of subclasses of pandas objects (:issue:`10553`).
57122

123+
58124
.. _whatsnew_0170.deprecations:
59125

60126
Deprecations

pandas/core/common.py

+57-54
Original file line numberDiff line numberDiff line change
@@ -1887,65 +1887,68 @@ def _maybe_box_datetimelike(value):
18871887

18881888
_values_from_object = lib.values_from_object
18891889

1890-
def _possibly_convert_objects(values, convert_dates=True,
1891-
convert_numeric=True,
1892-
convert_timedeltas=True):
1890+
1891+
def _possibly_convert_objects(values,
1892+
datetime=True,
1893+
numeric=True,
1894+
timedelta=True,
1895+
coerce=False,
1896+
copy=True):
18931897
""" if we have an object dtype, try to coerce dates and/or numbers """
18941898

1895-
# if we have passed in a list or scalar
1899+
conversion_count = sum((datetime, numeric, timedelta))
1900+
if conversion_count == 0:
1901+
import warnings
1902+
warnings.warn('Must explicitly pass type for conversion. Defaulting to '
1903+
'pre-0.17 behavior where datetime=True, numeric=True, '
1904+
'timedelta=True and coerce=False', DeprecationWarning)
1905+
datetime = numeric = timedelta = True
1906+
coerce = False
1907+
18961908
if isinstance(values, (list, tuple)):
1909+
# List or scalar
18971910
values = np.array(values, dtype=np.object_)
1898-
if not hasattr(values, 'dtype'):
1911+
elif not hasattr(values, 'dtype'):
18991912
values = np.array([values], dtype=np.object_)
1900-
1901-
# convert dates
1902-
if convert_dates and values.dtype == np.object_:
1903-
1904-
# we take an aggressive stance and convert to datetime64[ns]
1905-
if convert_dates == 'coerce':
1906-
new_values = _possibly_cast_to_datetime(
1907-
values, 'M8[ns]', coerce=True)
1908-
1909-
# if we are all nans then leave me alone
1910-
if not isnull(new_values).all():
1911-
values = new_values
1912-
1913-
else:
1914-
values = lib.maybe_convert_objects(
1915-
values, convert_datetime=convert_dates)
1916-
1917-
# convert timedeltas
1918-
if convert_timedeltas and values.dtype == np.object_:
1919-
1920-
if convert_timedeltas == 'coerce':
1921-
from pandas.tseries.timedeltas import to_timedelta
1922-
values = to_timedelta(values, coerce=True)
1923-
1924-
# if we are all nans then leave me alone
1925-
if not isnull(new_values).all():
1926-
values = new_values
1927-
1928-
else:
1929-
values = lib.maybe_convert_objects(
1930-
values, convert_timedelta=convert_timedeltas)
1931-
1932-
# convert to numeric
1933-
if values.dtype == np.object_:
1934-
if convert_numeric:
1935-
try:
1936-
new_values = lib.maybe_convert_numeric(
1937-
values, set(), coerce_numeric=True)
1938-
1939-
# if we are all nans then leave me alone
1940-
if not isnull(new_values).all():
1941-
values = new_values
1942-
1943-
except:
1944-
pass
1945-
else:
1946-
1947-
# soft-conversion
1948-
values = lib.maybe_convert_objects(values)
1913+
elif not is_object_dtype(values.dtype):
1914+
# If not object, do not attempt conversion
1915+
values = values.copy() if copy else values
1916+
return values
1917+
1918+
# If 1 flag is coerce, ensure 2 others are False
1919+
if coerce:
1920+
if conversion_count > 1:
1921+
raise ValueError("Only one of 'datetime', 'numeric' or "
1922+
"'timedelta' can be True when when coerce=True.")
1923+
1924+
# Immediate return if coerce
1925+
if datetime:
1926+
return pd.to_datetime(values, coerce=True, box=False)
1927+
elif timedelta:
1928+
return pd.to_timedelta(values, coerce=True, box=False)
1929+
elif numeric:
1930+
return lib.maybe_convert_numeric(values, set(), coerce_numeric=True)
1931+
1932+
# Soft conversions
1933+
if datetime:
1934+
values = lib.maybe_convert_objects(values,
1935+
convert_datetime=datetime)
1936+
1937+
if timedelta and is_object_dtype(values.dtype):
1938+
# Object check to ensure only run if previous did not convert
1939+
values = lib.maybe_convert_objects(values,
1940+
convert_timedelta=timedelta)
1941+
1942+
if numeric and is_object_dtype(values.dtype):
1943+
try:
1944+
converted = lib.maybe_convert_numeric(values,
1945+
set(),
1946+
coerce_numeric=True)
1947+
# If all NaNs, then do not-alter
1948+
values = converted if not isnull(converted).all() else values
1949+
values = values.copy() if copy else values
1950+
except:
1951+
pass
19491952

19501953
return values
19511954

pandas/core/frame.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -3351,7 +3351,7 @@ def combine(self, other, func, fill_value=None, overwrite=True):
33513351
return self._constructor(result,
33523352
index=new_index,
33533353
columns=new_columns).convert_objects(
3354-
convert_dates=True,
3354+
datetime=True,
33553355
copy=False)
33563356

33573357
def combine_first(self, other):
@@ -3830,7 +3830,9 @@ def _apply_standard(self, func, axis, ignore_failures=False, reduce=True):
38303830

38313831
if axis == 1:
38323832
result = result.T
3833-
result = result.convert_objects(copy=False)
3833+
result = result.convert_objects(datetime=True,
3834+
timedelta=True,
3835+
copy=False)
38343836

38353837
else:
38363838

@@ -3958,7 +3960,10 @@ def append(self, other, ignore_index=False, verify_integrity=False):
39583960
combined_columns = self.columns.tolist() + self.columns.union(other.index).difference(self.columns).tolist()
39593961
other = other.reindex(combined_columns, copy=False)
39603962
other = DataFrame(other.values.reshape((1, len(other))),
3961-
index=index, columns=combined_columns).convert_objects()
3963+
index=index,
3964+
columns=combined_columns)
3965+
other = other.convert_objects(datetime=True, timedelta=True)
3966+
39623967
if not self.columns.equals(combined_columns):
39633968
self = self.reindex(columns=combined_columns)
39643969
elif isinstance(other, list) and not isinstance(other[0], DataFrame):

pandas/core/generic.py

+19-14
Original file line numberDiff line numberDiff line change
@@ -2433,22 +2433,26 @@ def copy(self, deep=True):
24332433
data = self._data.copy(deep=deep)
24342434
return self._constructor(data).__finalize__(self)
24352435

2436-
def convert_objects(self, convert_dates=True, convert_numeric=False,
2437-
convert_timedeltas=True, copy=True):
2436+
@deprecate_kwarg(old_arg_name='convert_dates', new_arg_name='datetime')
2437+
@deprecate_kwarg(old_arg_name='convert_numeric', new_arg_name='numeric')
2438+
@deprecate_kwarg(old_arg_name='convert_timedeltas', new_arg_name='timedelta')
2439+
def convert_objects(self, datetime=False, numeric=False,
2440+
timedelta=False, coerce=False, copy=True):
24382441
"""
24392442
Attempt to infer better dtype for object columns
24402443
24412444
Parameters
24422445
----------
2443-
convert_dates : boolean, default True
2444-
If True, convert to date where possible. If 'coerce', force
2445-
conversion, with unconvertible values becoming NaT.
2446-
convert_numeric : boolean, default False
2447-
If True, attempt to coerce to numbers (including strings), with
2446+
datetime : boolean, default False
2447+
If True, convert to date where possible.
2448+
numeric : boolean, default False
2449+
If True, attempt to convert to numbers (including strings), with
24482450
unconvertible values becoming NaN.
2449-
convert_timedeltas : boolean, default True
2450-
If True, convert to timedelta where possible. If 'coerce', force
2451-
conversion, with unconvertible values becoming NaT.
2451+
timedelta : boolean, default False
2452+
If True, convert to timedelta where possible.
2453+
coerce : boolean, default False
2454+
If True, force conversion with unconvertible values converted to
2455+
nulls (NaN or NaT)
24522456
copy : boolean, default True
24532457
If True, return a copy even if no copy is necessary (e.g. no
24542458
conversion was done). Note: This is meant for internal use, and
@@ -2459,9 +2463,10 @@ def convert_objects(self, convert_dates=True, convert_numeric=False,
24592463
converted : same as input object
24602464
"""
24612465
return self._constructor(
2462-
self._data.convert(convert_dates=convert_dates,
2463-
convert_numeric=convert_numeric,
2464-
convert_timedeltas=convert_timedeltas,
2466+
self._data.convert(datetime=datetime,
2467+
numeric=numeric,
2468+
timedelta=timedelta,
2469+
coerce=coerce,
24652470
copy=copy)).__finalize__(self)
24662471

24672472
#----------------------------------------------------------------------
@@ -2859,7 +2864,7 @@ def replace(self, to_replace=None, value=None, inplace=False, limit=None,
28592864
'{0!r}').format(type(to_replace).__name__)
28602865
raise TypeError(msg) # pragma: no cover
28612866

2862-
new_data = new_data.convert(copy=not inplace, convert_numeric=False)
2867+
new_data = new_data.convert(copy=not inplace, numeric=False)
28632868

28642869
if inplace:
28652870
self._update_inplace(new_data)

0 commit comments

Comments
 (0)