Skip to content

Commit 0727803

Browse files
bashtageKevin Sheppard
authored and
Kevin Sheppard
committed
BUG: Ensure 'coerce' actually coerces datatypes
Changes behavior of convert objects so that passing 'coerce' will ensure that data of the correct type is returned, even if all values are null-types (NaN or NaT). closes #9589
1 parent 35c0863 commit 0727803

19 files changed

+301
-153
lines changed

doc/source/basics.rst

+17-8
Original file line numberDiff line numberDiff line change
@@ -1522,23 +1522,29 @@ then the more *general* one will be used as the result of the operation.
15221522
object conversion
15231523
~~~~~~~~~~~~~~~~~
15241524

1525-
:meth:`~DataFrame.convert_objects` is a method to try to force conversion of types from the ``object`` dtype to other types.
1526-
To force conversion of specific types that are *number like*, e.g. could be a string that represents a number,
1527-
pass ``convert_numeric=True``. This will force strings and numbers alike to be numbers if possible, otherwise
1528-
they will be set to ``np.nan``.
1525+
.. note::
1526+
1527+
The syntax of :meth:`~DataFrame.convert_objects` changed in 0.17.0.
1528+
1529+
:meth:`~DataFrame.convert_objects` is a method to try to force conversion of
1530+
types from the ``object`` dtype to other types. To try converting specific
1531+
types that are *number like*, e.g. could be a string that represents a number,
1532+
pass ``numeric=True``. The force the conversion, add the keword argument
1533+
``coerce=True``. This will force strings and numbers alike to be numbers if
1534+
possible, otherwise they will be set to ``np.nan``.
15291535

15301536
.. ipython:: python
15311537
15321538
df3['D'] = '1.'
15331539
df3['E'] = '1'
1534-
df3.convert_objects(convert_numeric=True).dtypes
1540+
df3.convert_objects(numeric=True).dtypes
15351541
15361542
# same, but specific dtype conversion
15371543
df3['D'] = df3['D'].astype('float16')
15381544
df3['E'] = df3['E'].astype('int32')
15391545
df3.dtypes
15401546
1541-
To force conversion to ``datetime64[ns]``, pass ``convert_dates='coerce'``.
1547+
To force conversion to ``datetime64[ns]``, pass ``datetime=True`` and ``coerce=True``.
15421548
This will convert any datetime-like object to dates, forcing other values to ``NaT``.
15431549
This might be useful if you are reading in data which is mostly dates,
15441550
but occasionally has non-dates intermixed and you want to represent as missing.
@@ -1550,10 +1556,13 @@ but occasionally has non-dates intermixed and you want to represent as missing.
15501556
'foo', 1.0, 1, pd.Timestamp('20010104'),
15511557
'20010105'], dtype='O')
15521558
s
1553-
s.convert_objects(convert_dates='coerce')
1559+
s.convert_objects(datetime=True, coerce=True)
15541560
1555-
In addition, :meth:`~DataFrame.convert_objects` will attempt the *soft* conversion of any *object* dtypes, meaning that if all
1561+
Without passing ``coerce=True``, :meth:`~DataFrame.convert_objects` will attempt
1562+
the *soft* conversion of any *object* dtypes, meaning that if all
15561563
the objects in a Series are of the same type, the Series will have that dtype.
1564+
Setting ``coerce=True`` will not *convert* - for example, a series of string
1565+
dates will not be converted to a series of datetimes.
15571566

15581567
gotchas
15591568
~~~~~~~

doc/source/whatsnew/v0.17.0.txt

+39
Original file line numberDiff line numberDiff line change
@@ -48,13 +48,52 @@ Backwards incompatible API changes
4848

4949
.. _whatsnew_0170.api_breaking.other:
5050

51+
Changes to convert_objects
52+
^^^^^^^^^^^^^^^^^^^^^^^^^^
53+
- ``DataFrame.convert_objects`` keyword arguments have been shortened. (:issue:`10265`)
54+
55+
===================== =============
56+
Old New
57+
===================== =============
58+
``convert_dates`` ``datetime``
59+
``convert_numeric`` ``numeric``
60+
``convert_timedelta`` ``timedelta``
61+
===================== =============
62+
63+
- Coercing types with ``DataFrame.convert_objects`` is now implemented using the
64+
keyword argument ``coerce=True``. Previously types were coerced by setting a
65+
keyword argument to ``'coerce'`` instead of ``True``, as in ``convert_dates='coerce'``.
66+
67+
.. ipython:: python
68+
69+
df = pd.DataFrame({'i': ['1','2'], 'f': ['apple', '4.2']})
70+
df
71+
72+
The old usage of ``DataFrame.convert_objects`` used `'coerce'` along with the
73+
type.
74+
75+
.. code-block:: python
76+
77+
In [2]: df.convert_objects(convert_numeric='coerce')
78+
79+
Now the ``coerce`` keyword must be explicitly used.
80+
81+
.. ipython:: python
82+
83+
df.convert_objects(numeric=True, coerce=True)
84+
85+
- The new default behavior for ``DataFrame.convert_objects`` is to do nothing,
86+
and so it is necessary to pass at least one conversion target when calling.
87+
88+
5189
Other API Changes
5290
^^^^^^^^^^^^^^^^^
5391
- Enable writing Excel files in :ref:`memory <_io.excel_writing_buffer>` using StringIO/BytesIO (:issue:`7074`)
5492
- Enable serialization of lists and dicts to strings in ExcelWriter (:issue:`8188`)
5593
- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
5694
- Serialize metadata properties of subclasses of pandas objects (:issue:`10553`).
5795

96+
5897
.. _whatsnew_0170.deprecations:
5998

6099
Deprecations

pandas/core/common.py

+52-54
Original file line numberDiff line numberDiff line change
@@ -1887,65 +1887,63 @@ def _maybe_box_datetimelike(value):
18871887

18881888
_values_from_object = lib.values_from_object
18891889

1890-
def _possibly_convert_objects(values, convert_dates=True,
1891-
convert_numeric=True,
1892-
convert_timedeltas=True):
1890+
1891+
def _possibly_convert_objects(values,
1892+
datetime=True,
1893+
numeric=True,
1894+
timedelta=True,
1895+
coerce=False):
18931896
""" if we have an object dtype, try to coerce dates and/or numbers """
18941897

1895-
# if we have passed in a list or scalar
1898+
conversion_count = sum((datetime, numeric, timedelta))
1899+
if conversion_count == 0:
1900+
import warnings
1901+
warnings.warn('Must explicitly pass type for conversion. Original '
1902+
'value returned.', RuntimeWarning)
1903+
return values
1904+
18961905
if isinstance(values, (list, tuple)):
1906+
# List or scalar
18971907
values = np.array(values, dtype=np.object_)
1898-
if not hasattr(values, 'dtype'):
1908+
elif not hasattr(values, 'dtype'):
18991909
values = np.array([values], dtype=np.object_)
1900-
1901-
# convert dates
1902-
if convert_dates and values.dtype == np.object_:
1903-
1904-
# we take an aggressive stance and convert to datetime64[ns]
1905-
if convert_dates == 'coerce':
1906-
new_values = _possibly_cast_to_datetime(
1907-
values, 'M8[ns]', coerce=True)
1908-
1909-
# if we are all nans then leave me alone
1910-
if not isnull(new_values).all():
1911-
values = new_values
1912-
1913-
else:
1914-
values = lib.maybe_convert_objects(
1915-
values, convert_datetime=convert_dates)
1916-
1917-
# convert timedeltas
1918-
if convert_timedeltas and values.dtype == np.object_:
1919-
1920-
if convert_timedeltas == 'coerce':
1921-
from pandas.tseries.timedeltas import to_timedelta
1922-
values = to_timedelta(values, coerce=True)
1923-
1924-
# if we are all nans then leave me alone
1925-
if not isnull(new_values).all():
1926-
values = new_values
1927-
1928-
else:
1929-
values = lib.maybe_convert_objects(
1930-
values, convert_timedelta=convert_timedeltas)
1931-
1932-
# convert to numeric
1933-
if values.dtype == np.object_:
1934-
if convert_numeric:
1935-
try:
1936-
new_values = lib.maybe_convert_numeric(
1937-
values, set(), coerce_numeric=True)
1938-
1939-
# if we are all nans then leave me alone
1940-
if not isnull(new_values).all():
1941-
values = new_values
1942-
1943-
except:
1944-
pass
1945-
else:
1946-
1947-
# soft-conversion
1948-
values = lib.maybe_convert_objects(values)
1910+
elif not is_object_dtype(values.dtype):
1911+
# If not object, do not attempt conversion
1912+
return values
1913+
1914+
# If 1 flag is coerce, ensure 2 others are False
1915+
if coerce:
1916+
if conversion_count > 1:
1917+
raise ValueError("Only one of 'datetime', 'numeric' or "
1918+
"'timedelta' can be True when when coerce=True.")
1919+
1920+
# Immediate return if coerce
1921+
if datetime:
1922+
return pd.to_datetime(values, coerce=True, box=False)
1923+
elif timedelta:
1924+
return pd.to_timedelta(values, coerce=True, box=False)
1925+
elif numeric:
1926+
return lib.maybe_convert_numeric(values, set(), coerce_numeric=True)
1927+
1928+
# Soft conversions
1929+
if datetime:
1930+
values = lib.maybe_convert_objects(values,
1931+
convert_datetime=datetime)
1932+
1933+
if timedelta and is_object_dtype(values.dtype):
1934+
# Object check to ensure only run if previous did not convert
1935+
values = lib.maybe_convert_objects(values,
1936+
convert_timedelta=timedelta)
1937+
1938+
if numeric and is_object_dtype(values.dtype):
1939+
try:
1940+
converted = lib.maybe_convert_numeric(values,
1941+
set(),
1942+
coerce_numeric=True)
1943+
# If all NaNs, then do not-alter
1944+
values = converted if not isnull(converted).all() else values
1945+
except:
1946+
pass
19491947

19501948
return values
19511949

pandas/core/frame.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -3351,7 +3351,7 @@ def combine(self, other, func, fill_value=None, overwrite=True):
33513351
return self._constructor(result,
33523352
index=new_index,
33533353
columns=new_columns).convert_objects(
3354-
convert_dates=True,
3354+
datetime=True,
33553355
copy=False)
33563356

33573357
def combine_first(self, other):
@@ -3830,7 +3830,9 @@ def _apply_standard(self, func, axis, ignore_failures=False, reduce=True):
38303830

38313831
if axis == 1:
38323832
result = result.T
3833-
result = result.convert_objects(copy=False)
3833+
result = result.convert_objects(datetime=True,
3834+
timedelta=True,
3835+
copy=False)
38343836

38353837
else:
38363838

@@ -3958,7 +3960,10 @@ def append(self, other, ignore_index=False, verify_integrity=False):
39583960
combined_columns = self.columns.tolist() + self.columns.union(other.index).difference(self.columns).tolist()
39593961
other = other.reindex(combined_columns, copy=False)
39603962
other = DataFrame(other.values.reshape((1, len(other))),
3961-
index=index, columns=combined_columns).convert_objects()
3963+
index=index,
3964+
columns=combined_columns)
3965+
other = other.convert_objects(datetime=True, timedelta=True)
3966+
39623967
if not self.columns.equals(combined_columns):
39633968
self = self.reindex(columns=combined_columns)
39643969
elif isinstance(other, list) and not isinstance(other[0], DataFrame):

pandas/core/generic.py

+19-14
Original file line numberDiff line numberDiff line change
@@ -2433,22 +2433,26 @@ def copy(self, deep=True):
24332433
data = self._data.copy(deep=deep)
24342434
return self._constructor(data).__finalize__(self)
24352435

2436-
def convert_objects(self, convert_dates=True, convert_numeric=False,
2437-
convert_timedeltas=True, copy=True):
2436+
@deprecate_kwarg(old_arg_name='convert_dates', new_arg_name='datetime')
2437+
@deprecate_kwarg(old_arg_name='convert_numeric', new_arg_name='numeric')
2438+
@deprecate_kwarg(old_arg_name='convert_timedeltas', new_arg_name='timedelta')
2439+
def convert_objects(self, datetime=False, numeric=False,
2440+
timedelta=False, coerce=False, copy=True):
24382441
"""
24392442
Attempt to infer better dtype for object columns
24402443
24412444
Parameters
24422445
----------
2443-
convert_dates : boolean, default True
2444-
If True, convert to date where possible. If 'coerce', force
2445-
conversion, with unconvertible values becoming NaT.
2446-
convert_numeric : boolean, default False
2447-
If True, attempt to coerce to numbers (including strings), with
2446+
datetime : boolean, default False
2447+
If True, convert to date where possible.
2448+
numeric : boolean, default False
2449+
If True, attempt to convert to numbers (including strings), with
24482450
unconvertible values becoming NaN.
2449-
convert_timedeltas : boolean, default True
2450-
If True, convert to timedelta where possible. If 'coerce', force
2451-
conversion, with unconvertible values becoming NaT.
2451+
timedelta : boolean, default False
2452+
If True, convert to timedelta where possible.
2453+
coerce : boolean, default False
2454+
If True, force conversion with unconvertible values converted to
2455+
nulls (NaN or NaT)
24522456
copy : boolean, default True
24532457
If True, return a copy even if no copy is necessary (e.g. no
24542458
conversion was done). Note: This is meant for internal use, and
@@ -2459,9 +2463,10 @@ def convert_objects(self, convert_dates=True, convert_numeric=False,
24592463
converted : same as input object
24602464
"""
24612465
return self._constructor(
2462-
self._data.convert(convert_dates=convert_dates,
2463-
convert_numeric=convert_numeric,
2464-
convert_timedeltas=convert_timedeltas,
2466+
self._data.convert(datetime=datetime,
2467+
numeric=numeric,
2468+
timedelta=timedelta,
2469+
coerce=coerce,
24652470
copy=copy)).__finalize__(self)
24662471

24672472
#----------------------------------------------------------------------
@@ -2859,7 +2864,7 @@ def replace(self, to_replace=None, value=None, inplace=False, limit=None,
28592864
'{0!r}').format(type(to_replace).__name__)
28602865
raise TypeError(msg) # pragma: no cover
28612866

2862-
new_data = new_data.convert(copy=not inplace, convert_numeric=False)
2867+
new_data = new_data.convert(copy=not inplace, numeric=False)
28632868

28642869
if inplace:
28652870
self._update_inplace(new_data)

pandas/core/groupby.py

+20-12
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ def f(self):
111111
except Exception:
112112
result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
113113
if _convert:
114-
result = result.convert_objects()
114+
result = result.convert_objects(datetime=True)
115115
return result
116116

117117
f.__doc__ = "Compute %s of group values" % name
@@ -2700,7 +2700,7 @@ def aggregate(self, arg, *args, **kwargs):
27002700
self._insert_inaxis_grouper_inplace(result)
27012701
result.index = np.arange(len(result))
27022702

2703-
return result.convert_objects()
2703+
return result.convert_objects(datetime=True)
27042704

27052705
def _aggregate_multiple_funcs(self, arg):
27062706
from pandas.tools.merge import concat
@@ -2939,18 +2939,25 @@ def _wrap_applied_output(self, keys, values, not_indexed_same=False):
29392939

29402940
# if we have date/time like in the original, then coerce dates
29412941
# as we are stacking can easily have object dtypes here
2942-
if (self._selected_obj.ndim == 2
2943-
and self._selected_obj.dtypes.isin(_DATELIKE_DTYPES).any()):
2944-
cd = 'coerce'
2942+
if (self._selected_obj.ndim == 2 and
2943+
self._selected_obj.dtypes.isin(_DATELIKE_DTYPES).any()):
2944+
result = result.convert_objects(numeric=True)
2945+
date_cols = self._selected_obj.select_dtypes(
2946+
include=list(_DATELIKE_DTYPES)).columns
2947+
result[date_cols] = (result[date_cols]
2948+
.convert_objects(datetime=True,
2949+
coerce=True))
29452950
else:
2946-
cd = True
2947-
result = result.convert_objects(convert_dates=cd)
2951+
result = result.convert_objects(datetime=True)
2952+
29482953
return self._reindex_output(result)
29492954

29502955
else:
29512956
# only coerce dates if we find at least 1 datetime
2952-
cd = 'coerce' if any([ isinstance(v,Timestamp) for v in values ]) else False
2953-
return Series(values, index=key_index).convert_objects(convert_dates=cd)
2957+
coerce = True if any([ isinstance(v,Timestamp) for v in values ]) else False
2958+
return (Series(values, index=key_index)
2959+
.convert_objects(datetime=True,
2960+
coerce=coerce))
29542961

29552962
else:
29562963
# Handle cases like BinGrouper
@@ -3053,7 +3060,8 @@ def transform(self, func, *args, **kwargs):
30533060
if any(counts == 0):
30543061
results = self._try_cast(results, obj[result.columns])
30553062

3056-
return DataFrame(results,columns=result.columns,index=obj.index).convert_objects()
3063+
return (DataFrame(results,columns=result.columns,index=obj.index)
3064+
.convert_objects(datetime=True))
30573065

30583066
def _define_paths(self, func, *args, **kwargs):
30593067
if isinstance(func, compat.string_types):
@@ -3246,7 +3254,7 @@ def _wrap_aggregated_output(self, output, names=None):
32463254
if self.axis == 1:
32473255
result = result.T
32483256

3249-
return self._reindex_output(result).convert_objects()
3257+
return self._reindex_output(result).convert_objects(datetime=True)
32503258

32513259
def _wrap_agged_blocks(self, items, blocks):
32523260
if not self.as_index:
@@ -3264,7 +3272,7 @@ def _wrap_agged_blocks(self, items, blocks):
32643272
if self.axis == 1:
32653273
result = result.T
32663274

3267-
return self._reindex_output(result).convert_objects()
3275+
return self._reindex_output(result).convert_objects(datetime=True)
32683276

32693277
def _reindex_output(self, result):
32703278
"""

0 commit comments

Comments
 (0)