Skip to content

BUG: Ensure 'coerce' actually coerces datatypes #10265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 14, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 21 additions & 8 deletions doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1522,23 +1522,31 @@ then the more *general* one will be used as the result of the operation.
object conversion
~~~~~~~~~~~~~~~~~

:meth:`~DataFrame.convert_objects` is a method to try to force conversion of types from the ``object`` dtype to other types.
To force conversion of specific types that are *number like*, e.g. could be a string that represents a number,
pass ``convert_numeric=True``. This will force strings and numbers alike to be numbers if possible, otherwise
they will be set to ``np.nan``.
.. note::

The syntax of :meth:`~DataFrame.convert_objects` changed in 0.17.0. See
:ref:`API changes <whatsnew_0170.api_breaking.convert_objects>`
for more details.

:meth:`~DataFrame.convert_objects` is a method to try to force conversion of
types from the ``object`` dtype to other types. To try converting specific
types that are *number like*, e.g. could be a string that represents a number,
pass ``numeric=True``. To force the conversion, add the keyword argument
``coerce=True``. This will force strings and number-like objects to be numbers if
possible, otherwise they will be set to ``np.nan``.

.. ipython:: python

df3['D'] = '1.'
df3['E'] = '1'
df3.convert_objects(convert_numeric=True).dtypes
df3.convert_objects(numeric=True).dtypes

# same, but specific dtype conversion
df3['D'] = df3['D'].astype('float16')
df3['E'] = df3['E'].astype('int32')
df3.dtypes

To force conversion to ``datetime64[ns]``, pass ``convert_dates='coerce'``.
To force conversion to ``datetime64[ns]``, pass ``datetime=True`` and ``coerce=True``.
This will convert any datetime-like object to dates, forcing other values to ``NaT``.
This might be useful if you are reading in data which is mostly dates,
but occasionally has non-dates intermixed and you want to represent as missing.
Expand All @@ -1550,10 +1558,15 @@ but occasionally has non-dates intermixed and you want to represent as missing.
'foo', 1.0, 1, pd.Timestamp('20010104'),
'20010105'], dtype='O')
s
s.convert_objects(convert_dates='coerce')
s.convert_objects(datetime=True, coerce=True)

In addition, :meth:`~DataFrame.convert_objects` will attempt the *soft* conversion of any *object* dtypes, meaning that if all
Without passing ``coerce=True``, :meth:`~DataFrame.convert_objects` will attempt
*soft* conversion of any *object* dtypes, meaning that if all
the objects in a Series are of the same type, the Series will have that dtype.
Note that setting ``coerce=True`` does not *convert* arbitrary types to either
``datetime64[ns]`` or ``timedelta64[ns]``. For example, a series containing string
dates will not be converted to a series of datetimes. To convert between types,
see :ref:`converting to timestamps <timeseries.converting>`.

gotchas
~~~~~~~
Expand Down
66 changes: 66 additions & 0 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,79 @@ Backwards incompatible API changes

.. _whatsnew_0170.api_breaking.other:

.. _whatsnew_0170.api_breaking.convert_objects:
Changes to convert_objects
^^^^^^^^^^^^^^^^^^^^^^^^^^
- ``DataFrame.convert_objects`` keyword arguments have been shortened. (:issue:`10265`)

===================== =============
Old New
===================== =============
``convert_dates`` ``datetime``
``convert_numeric`` ``numeric``
``convert_timedelta`` ``timedelta``
===================== =============

- Coercing types with ``DataFrame.convert_objects`` is now implemented using the
keyword argument ``coerce=True``. Previously types were coerced by setting a
keyword argument to ``'coerce'`` instead of ``True``, as in ``convert_dates='coerce'``.

.. ipython:: python

df = pd.DataFrame({'i': ['1','2'],
'f': ['apple', '4.2'],
's': ['apple','banana']})
df

The old usage of ``DataFrame.convert_objects`` used `'coerce'` along with the
type.

.. code-block:: python

In [2]: df.convert_objects(convert_numeric='coerce')

Now the ``coerce`` keyword must be explicitly used.

.. ipython:: python

df.convert_objects(numeric=True, coerce=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example seems a bit strange, as for numeric, the setting of coerce does not matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include case where coerce matters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add an example of what a usage with coerce did when u have 0 and 1 of the values being converted (eg no convert / convert ) - this is the example that could break code so want to make it prominent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one.


- In earlier versions of pandas, ``DataFrame.convert_objects`` would not coerce
numeric types when there were no values convertible to a numeric type. For example,

.. code-block:: python

In [1]: df = pd.DataFrame({'s': ['a','b']})
In [2]: df.convert_objects(convert_numeric='coerce')
Out[2]:
s
0 a
1 b

returns the original DataFrame with no conversion. This change alters
this behavior so that

.. ipython:: python

pd.DataFrame({'s': ['a','b']})
df.convert_objects(numeric=True, coerce=True)

converts all non-number-like strings to ``NaN``.

- In earlier versions of pandas, the default behavior was to try and convert
datetimes and timestamps. The new default is for ``DataFrame.convert_objects``
to do nothing, and so it is necessary to pass at least one conversion target
in the method call.


Other API Changes
^^^^^^^^^^^^^^^^^
- Enable writing Excel files in :ref:`memory <_io.excel_writing_buffer>` using StringIO/BytesIO (:issue:`7074`)
- Enable serialization of lists and dicts to strings in ExcelWriter (:issue:`8188`)
- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
- Serialize metadata properties of subclasses of pandas objects (:issue:`10553`).


.. _whatsnew_0170.deprecations:

Deprecations
Expand Down
111 changes: 57 additions & 54 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1887,65 +1887,68 @@ def _maybe_box_datetimelike(value):

_values_from_object = lib.values_from_object

def _possibly_convert_objects(values, convert_dates=True,
convert_numeric=True,
convert_timedeltas=True):

def _possibly_convert_objects(values,
datetime=True,
numeric=True,
timedelta=True,
coerce=False,
copy=True):
""" if we have an object dtype, try to coerce dates and/or numbers """

# if we have passed in a list or scalar
conversion_count = sum((datetime, numeric, timedelta))
if conversion_count == 0:
import warnings
warnings.warn('Must explicitly pass type for conversion. Defaulting to '
'pre-0.17 behavior where datetime=True, numeric=True, '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. I added this to the deprecation issue so can remove this at some point.

'timedelta=True and coerce=False', DeprecationWarning)
datetime = numeric = timedelta = True
coerce = False

if isinstance(values, (list, tuple)):
# List or scalar
values = np.array(values, dtype=np.object_)
if not hasattr(values, 'dtype'):
elif not hasattr(values, 'dtype'):
values = np.array([values], dtype=np.object_)

# convert dates
if convert_dates and values.dtype == np.object_:

# we take an aggressive stance and convert to datetime64[ns]
if convert_dates == 'coerce':
new_values = _possibly_cast_to_datetime(
values, 'M8[ns]', coerce=True)

# if we are all nans then leave me alone
if not isnull(new_values).all():
values = new_values

else:
values = lib.maybe_convert_objects(
values, convert_datetime=convert_dates)

# convert timedeltas
if convert_timedeltas and values.dtype == np.object_:

if convert_timedeltas == 'coerce':
from pandas.tseries.timedeltas import to_timedelta
values = to_timedelta(values, coerce=True)

# if we are all nans then leave me alone
if not isnull(new_values).all():
values = new_values

else:
values = lib.maybe_convert_objects(
values, convert_timedelta=convert_timedeltas)

# convert to numeric
if values.dtype == np.object_:
if convert_numeric:
try:
new_values = lib.maybe_convert_numeric(
values, set(), coerce_numeric=True)

# if we are all nans then leave me alone
if not isnull(new_values).all():
values = new_values

except:
pass
else:

# soft-conversion
values = lib.maybe_convert_objects(values)
elif not is_object_dtype(values.dtype):
# If not object, do not attempt conversion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to handle the copy arg here. All of the other conversions copy I think. (need a test for this as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't have a copy keyword - this is handled by the caller.

_possibly_convert_objects(values,datetime=True,numeric=True,
                              timedelta=True,coerce=False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug. This needs handling either IN _possibly_convert_objects or, handle here. IOW, fall thru on the if/else, then copy if indicated. Though if some conversion is done then it is a new object. So this needs to be copied is if you don't pass ANY options (and the copy flag is True) OR it is all nan after conversion (and so you don't actually convert anything), e.g. L1944 (below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bug is in convert -- it is the last function to see copy=? but doesn't do anything with it.

Maybe copy needs to go down a level for this edge case. I don't really understand what copy=False is supposed to do in convert_objects. Is it acting like an inplace on a numpy array so that the same DataFrame is returned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep could be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed copy one level deeper so I can trap these.

values = values.copy() if copy else values
return values

# If 1 flag is coerce, ensure 2 others are False
if coerce:
if conversion_count > 1:
raise ValueError("Only one of 'datetime', 'numeric' or "
"'timedelta' can be True when when coerce=True.")

# Immediate return if coerce
if datetime:
return pd.to_datetime(values, coerce=True, box=False)
elif timedelta:
return pd.to_timedelta(values, coerce=True, box=False)
elif numeric:
return lib.maybe_convert_numeric(values, set(), coerce_numeric=True)

# Soft conversions
if datetime:
values = lib.maybe_convert_objects(values,
convert_datetime=datetime)

if timedelta and is_object_dtype(values.dtype):
# Object check to ensure only run if previous did not convert
values = lib.maybe_convert_objects(values,
convert_timedelta=timedelta)

if numeric and is_object_dtype(values.dtype):
try:
converted = lib.maybe_convert_numeric(values,
set(),
coerce_numeric=True)
# If all NaNs, then do not-alter
values = converted if not isnull(converted).all() else values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that maybe_convert_numeric should just do this (maybe need to make it an option though if its not necessary all over)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that if it isn't numeric and coerce=False, it will raise and do nothing, but if it is not numeric but is a timestamp then it will get nan-filled. This is why this logic is here.

values = values.copy() if copy else values
except:
pass

return values

Expand Down
11 changes: 8 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3351,7 +3351,7 @@ def combine(self, other, func, fill_value=None, overwrite=True):
return self._constructor(result,
index=new_index,
columns=new_columns).convert_objects(
convert_dates=True,
datetime=True,
copy=False)

def combine_first(self, other):
Expand Down Expand Up @@ -3830,7 +3830,9 @@ def _apply_standard(self, func, axis, ignore_failures=False, reduce=True):

if axis == 1:
result = result.T
result = result.convert_objects(copy=False)
result = result.convert_objects(datetime=True,
timedelta=True,
copy=False)

else:

Expand Down Expand Up @@ -3958,7 +3960,10 @@ def append(self, other, ignore_index=False, verify_integrity=False):
combined_columns = self.columns.tolist() + self.columns.union(other.index).difference(self.columns).tolist()
other = other.reindex(combined_columns, copy=False)
other = DataFrame(other.values.reshape((1, len(other))),
index=index, columns=combined_columns).convert_objects()
index=index,
columns=combined_columns)
other = other.convert_objects(datetime=True, timedelta=True)

if not self.columns.equals(combined_columns):
self = self.reindex(columns=combined_columns)
elif isinstance(other, list) and not isinstance(other[0], DataFrame):
Expand Down
33 changes: 19 additions & 14 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2433,22 +2433,26 @@ def copy(self, deep=True):
data = self._data.copy(deep=deep)
return self._constructor(data).__finalize__(self)

def convert_objects(self, convert_dates=True, convert_numeric=False,
convert_timedeltas=True, copy=True):
@deprecate_kwarg(old_arg_name='convert_dates', new_arg_name='datetime')
@deprecate_kwarg(old_arg_name='convert_numeric', new_arg_name='numeric')
@deprecate_kwarg(old_arg_name='convert_timedeltas', new_arg_name='timedelta')
def convert_objects(self, datetime=False, numeric=False,
timedelta=False, coerce=False, copy=True):
"""
Attempt to infer better dtype for object columns

Parameters
----------
convert_dates : boolean, default True
If True, convert to date where possible. If 'coerce', force
conversion, with unconvertible values becoming NaT.
convert_numeric : boolean, default False
If True, attempt to coerce to numbers (including strings), with
datetime : boolean, default False
If True, convert to date where possible.
numeric : boolean, default False
If True, attempt to convert to numbers (including strings), with
unconvertible values becoming NaN.
convert_timedeltas : boolean, default True
If True, convert to timedelta where possible. If 'coerce', force
conversion, with unconvertible values becoming NaT.
timedelta : boolean, default False
If True, convert to timedelta where possible.
coerce : boolean, default False
If True, force conversion with unconvertible values converted to
nulls (NaN or NaT)
copy : boolean, default True
If True, return a copy even if no copy is necessary (e.g. no
conversion was done). Note: This is meant for internal use, and
Expand All @@ -2459,9 +2463,10 @@ def convert_objects(self, convert_dates=True, convert_numeric=False,
converted : same as input object
"""
return self._constructor(
self._data.convert(convert_dates=convert_dates,
convert_numeric=convert_numeric,
convert_timedeltas=convert_timedeltas,
self._data.convert(datetime=datetime,
numeric=numeric,
timedelta=timedelta,
coerce=coerce,
copy=copy)).__finalize__(self)

#----------------------------------------------------------------------
Expand Down Expand Up @@ -2859,7 +2864,7 @@ def replace(self, to_replace=None, value=None, inplace=False, limit=None,
'{0!r}').format(type(to_replace).__name__)
raise TypeError(msg) # pragma: no cover

new_data = new_data.convert(copy=not inplace, convert_numeric=False)
new_data = new_data.convert(copy=not inplace, numeric=False)

if inplace:
self._update_inplace(new_data)
Expand Down
Loading