Skip to content

API: add infer_objects for soft conversions #16915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 18, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,35 @@ New features
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
and :class:`~pandas.ExcelWriter` to work properly with the file system path protocol (:issue:`13823`)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a sub-section reference

``infer_objects`` type conversion
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``DataFrame`` and ``Series`` have gained an ``infer_objects`` method to perform dtype inference
on object columns, replacing some of the functionality of the deprecated ``convert_objects``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe want to make these :func:`Series.infer_objects` (and the like), but needs rewording to do that

function (:issue:`11221`)

This function only performs soft conversions on object columns, converting python scalars
to native types, but not any coercive conversions. For example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python --> Python
for example --> For example:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python scalars -> python objects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use colon after For example


.. ipython:: python

df = pd.DataFrame({'A': [1, 2, 3],
'B': np.array([1, 2, 3], dtype='object'),
'C': ['1', '2', '3']})
df.dtypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sould have this same example in the docs (in basics.rst) and a ref to that.

df.infer_objects().dtype

Note that column ``'C'`` was not converted - only scalar numeric types
will be inferred to a new type. Other types of conversion should be accomplished
using :func:`to_numeric` function (or :func:`to_datetime`, :func:`to_timedelta`)

.. ipython:: python

df = df.infer_objects()
df['C'] = pd.to_numeric(df['C'], errors='coerce')
df.dtypes

.. _whatsnew_0210.enhancements.other:

Other Enhancements
Expand Down
48 changes: 45 additions & 3 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -3634,16 +3634,58 @@ def convert_objects(self, convert_dates=True, convert_numeric=False,
converted : same as input object
"""
from warnings import warn
warn("convert_objects is deprecated. Use the data-type specific "
"converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.",
FutureWarning, stacklevel=2)
msg = ("convert_objects is deprecated. To re-infer data dtypes for "
"object columns, use {klass}.infer_objects()\nFor all "
"other conversions use the data-type specific converters "
"pd.to_datetime, pd.to_timedelta and pd.to_numeric."
).format(klass=self.__class__.__name__)
warn(msg, FutureWarning, stacklevel=2)

return self._constructor(
self._data.convert(convert_dates=convert_dates,
convert_numeric=convert_numeric,
convert_timedeltas=convert_timedeltas,
copy=copy)).__finalize__(self)

def infer_objects(self):
"""
Attempt to infer better dtypes for only object columns
Only attempts soft conversions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a version added tag


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beginning of the docstring should be a single line.

See Also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a note that this is the same as the inference used when a DataFrame is constructed?

--------
pandas.to_datetime : Convert argument to datetime.
pandas.to_timedelta : Convert argument to timedelta.
pandas.to_numeric : Convert argument to numeric typeR

Returns
-------
converted : same as input object
Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite though, as the dtypes could have been changed, right?


Examples
--------
>>> df = pd.DataFrame({"A": ["a", 1, 2, 3]})
>>> df = df.iloc[1:]
>>> df
A
1 1
2 2
3 3
>>> df.dtypes
A object
dtype: object
>>> df.infer_objects().dtypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like blank lines between these sections of code

A int64
dtype: object
"""
# numeric=False necessary to only soft convert;
# python objects will still be converted to
# native numpy numeric types
return self._constructor(
self._data.convert(datetime=True, numeric=False,
timedelta=True, coerce=False,
copy=True)).__finalize__(self)

# ----------------------------------------------------------------------
# Filling NA's

Expand Down
22 changes: 22 additions & 0 deletions pandas/tests/frame/test_block_internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -495,6 +495,28 @@ def test_convert_objects_no_conversion(self):
mixed2 = mixed1._convert(datetime=True)
assert_frame_equal(mixed1, mixed2)

def test_infer_objects(self):
# GH 11221
df = DataFrame({'a': ['a', 1, 2, 3],
'b': ['b', 2.0, 3.0, 4.1],
'c': ['c', datetime(2016, 1, 1),
datetime(2016, 1, 2),
datetime(2016, 1, 3)]},
columns=['a', 'b', 'c'])
df = df.iloc[1:].infer_objects()

assert df['a'].dtype == 'int64'
assert df['b'].dtype == 'float64'
assert df['c'].dtype == 'M8[ns]'

expected = DataFrame({'a': [1, 2, 3],
'b': [2.0, 3.0, 4.1],
'c': [datetime(2016, 1, 1),
datetime(2016, 1, 2),
datetime(2016, 1, 3)]},
columns=['a', 'b', 'c'])
tm.assert_frame_equal(df.reset_index(drop=True), expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this part of the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enforces that the inference rules I'm checking against are the same as DataFrame constructor

Copy link
Member

@gfyoung gfyoung Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Perhaps a comment above expected might be useful, since it wasn't immediately clear (at least to me) why this was done.


def test_stale_cached_series_bug_473(self):

# this is chained, but ok
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/series/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,3 +268,19 @@ def test_series_to_categorical(self):
expected = Series(['a', 'b', 'c'], dtype='category')

tm.assert_series_equal(result, expected)

def test_infer_objects_series(self):
# GH 11221
actual = Series(np.array([1, 2, 3], dtype='O')).infer_objects()
expected = Series([1, 2, 3])
tm.assert_series_equal(actual, expected)

actual = Series(np.array([1, 2, 3, None], dtype='O')).infer_objects()
expected = Series([1., 2., 3., np.nan])
tm.assert_series_equal(actual, expected)

actual = (Series(np.array([1, 2, 3, None, 'a'], dtype='O'))
.infer_objects())
expected = Series([1, 2, 3, None, 'a'])
assert actual.dtype == 'object'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a comment above that block of code explaining why this is the case would be useful.

Copy link
Member

@gfyoung gfyoung Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why isn't there a similar test for DataFrame (i.e. one where we actually get object) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add newline between "expected =" and "assert " (for readability)

tm.assert_series_equal(actual, expected)