Skip to content

ENH: support cut/qcut for datetime/timedelta (GH14714) #14737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Dec 3, 2016
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ Other enhancements

- ``pd.read_excel`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`)

- ``pd.cut`` and ``pd.qcut`` now support datetime64 and timedelta64 dtypes (issue:`14714`)

.. _whatsnew_0200.api_breaking:

Expand Down
31 changes: 31 additions & 0 deletions pandas/tools/tests/test_tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from pandas.core.algorithms import quantile
from pandas.tools.tile import cut, qcut
import pandas.tools.tile as tmod
from pandas import to_datetime, DatetimeIndex


class TestCut(tm.TestCase):
Expand Down Expand Up @@ -283,6 +284,36 @@ def test_single_bin(self):
result = cut(s, 1, labels=False)
tm.assert_series_equal(result, expected)

def test_datetime_cut(self):
# GH 14714
# testing for time data to be present as series
data = to_datetime(Series(['2013-01-01', '2013-01-02', '2013-01-03']))
result, bins = cut(data, 3, retbins=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u also test:

  • list of np.datetime64
  • ndarray of datetime64 dtype
  • DatetimeIndex

Copy link
Contributor Author

@aileronajay aileronajay Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks I have added tests for list of np.datetime64 and ndarray. I did not get the DatetimeIndex part, to test this do we want to have a series/dataframe with datetime as the index and datetime as the data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks added test for DatetimeIndex cut in be9b2fd

expected = Series(['(2012-12-31 23:57:07.200000, 2013-01-01 16:00:00]',
'(2013-01-01 16:00:00, 2013-01-02 08:00:00]',
'(2013-01-02 08:00:00, 2013-01-03 00:00:00]'],
).astype("category", ordered=True)
tm.assert_series_equal(result, expected)

# testing for time data to be present as list
data = [np.datetime64('2013-01-01'), np.datetime64('2013-01-02'),
np.datetime64('2013-01-03')]
result, bins = cut(data, 3, retbins=True)
tm.assert_series_equal(Series(result), expected)

# testing for time data to be present as ndarray

data = np.array([np.datetime64('2013-01-01'),
np.datetime64('2013-01-02'),
np.datetime64('2013-01-03')])
result, bins = cut(data, 3, retbins=True)
tm.assert_series_equal(Series(result), expected)

# testing for time data to be present as datetime index
data = DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'])
result, bins = cut(data, 3, retbins=True)
tm.assert_series_equal(Series(result), expected)


def curpath():
pth, _ = os.path.split(os.path.abspath(__file__))
Expand Down
115 changes: 88 additions & 27 deletions pandas/tools/tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@
import pandas.core.algorithms as algos
import pandas.core.nanops as nanops
from pandas.compat import zip

from pandas import to_timedelta, to_datetime
import numpy as np
from pandas.types.common import is_datetime64_dtype, is_timedelta64_dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this one above the numpy import? (to keep the pandas imports grouped like it was before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment



def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
Expand Down Expand Up @@ -81,6 +82,11 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
array([1, 1, 1, 1, 1], dtype=int64)
"""
# NOTE: this binning code is changed a bit from histogram for var(x) == 0
# for handling the cut for datetime and timedelta objects
x_is_series, series_index, name, x = _preprocess_for_cut(x)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to convert to array likes here because input can be a list. prepare a private function which can be used in cut and qcut.

Copy link
Contributor Author

@aileronajay aileronajay Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks If i convert to array, would that preserve the index? @jorisvandenbossche has earlier commented that converting to array is not desirable as it would lead to loss of index information

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should preserve the index. So needs below logic before converting to array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly it will be easier to move the x_is_series logic up (to here, so you can then just deal with arrays)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @sinhrks the logic that preserves the index is present in _bins_to_cuts method. Whereas the error that we get due to date/timedelta data is due to code that is present in the cut method. So we need to convert into int64 in the cut method itself. So i have two questions. The first is that do i need to convert it to array, that i originally started out doing by making a call to x.values.view('i8') ? if yes, do i need to move to index handling code to the cut method instead of _bins_to_cuts method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree preprocessing/postprocessing should be moved to cut/qcut from _bins_to_cuts. what on my mind is something like below.

Actually preprocessing / postprocessing can be wrapped as a private function to reduce code duplications.

def cut(...):
    x_is_series = ...
    ... 
    x = np.asarray(x)    
    if is_datetime64_dtype(...):
        ....

    (core logic)

    if x_is_series:
        ....
    return...

Copy link
Contributor Author

@aileronajay aileronajay Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks I have created two private methods, _preprocess_for_cut and _postprocess_for_cut to move the logic from _bins_to_cuts to cut and qcut methods, present in commit 95e5989

Copy link
Member

@sinhrks sinhrks Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the change. I think _preprocess_for_cut should be called before _coerce_to_type to handle list properly? Or these can be merged.

Copy link
Contributor Author

@aileronajay aileronajay Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i can merge _coerce_to_type code with _preprocess_for_cut and return an additional dtype parameter from the _preprocess_for_cut method. @jreback are there any concerns if i remove the _coerce_to_type code method and move the processing that is being done (by _coerce_to_type) to _preprocess_for_cut method? Also if i merge _coerce_to_type with _preprocess_for_cut method then the call to _preprocess_for_cut would have to be made at the beginning of the cut method so that by the time we reach the line that fails (mn, mx = [mi + 0.0 for mi in rng]), we have already converted time data to int64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks i have moved the call to _preprocess_for_cut prior to coerce

original, x, dtype = _coerce_to_type(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think original is used? In that case, it's not needed to return it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment


if not np.iterable(bins):
if is_scalar(bins) and bins < 1:
raise ValueError("`bins` should be a positive integer.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try ... except ... below should not be necessary anymore, since x is already an array (asarray is used in _preprocess_for_cut)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche you are referring to the try except in the pd.cut method, specifically these lines

    try:  # for array-like
        sz = x.size
    except AttributeError:
        x = np.asarray(x)
        sz = x.size

Is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes (I couldn't comment on those lines)
Normally it should already be assured before that you have an array here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche thanks, that is what i thought!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment

Expand Down Expand Up @@ -114,9 +120,12 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
if (np.diff(bins) < 0).any():
raise ValueError('bins must increase monotonically.')

return _bins_to_cuts(x, bins, right=right, labels=labels,
retbins=retbins, precision=precision,
include_lowest=include_lowest)
fac, bins = _bins_to_cuts(x, bins, right=right, labels=labels,
retbins=retbins, precision=precision,
include_lowest=include_lowest, dtype=dtype)

return _postprocess_for_cut(fac, bins, retbins, x_is_series,
series_index, name)


def qcut(x, q, labels=None, retbins=False, precision=3):
Expand Down Expand Up @@ -166,26 +175,26 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
>>> pd.qcut(range(5), 4, labels=False)
array([0, 0, 1, 2, 3], dtype=int64)
"""
x_is_series, series_index, name, x = _preprocess_for_cut(x)

original, x, dtype = _coerce_to_type(x)

if is_integer(q):
quantiles = np.linspace(0, 1, q + 1)
else:
quantiles = q
bins = algos.quantile(x, quantiles)
return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
precision=precision, include_lowest=True)

fac, bins = _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
precision=precision, include_lowest=True,
dtype=dtype)

def _bins_to_cuts(x, bins, right=True, labels=None, retbins=False,
precision=3, name=None, include_lowest=False):
x_is_series = isinstance(x, Series)
series_index = None
return _postprocess_for_cut(fac, bins, retbins, x_is_series,
series_index, name)

if x_is_series:
series_index = x.index
if name is None:
name = x.name

x = np.asarray(x)
def _bins_to_cuts(x, bins, right=True, labels=None, retbins=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retbins arg is not used anymore in this function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same for name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment

precision=3, name=None, include_lowest=False,
dtype=None):

side = 'left' if right else 'right'
ids = bins.searchsorted(x, side=side)
Expand All @@ -205,7 +214,8 @@ def _bins_to_cuts(x, bins, right=True, labels=None, retbins=False,
while True:
try:
levels = _format_levels(bins, precision, right=right,
include_lowest=include_lowest)
include_lowest=include_lowest,
dtype=dtype)
except ValueError:
increases += 1
precision += 1
Expand All @@ -229,18 +239,12 @@ def _bins_to_cuts(x, bins, right=True, labels=None, retbins=False,
fac = fac.astype(np.float64)
np.putmask(fac, na_mask, np.nan)

if x_is_series:
fac = Series(fac, index=series_index, name=name)

if not retbins:
return fac

return fac, bins


def _format_levels(bins, prec, right=True,
include_lowest=False):
fmt = lambda v: _format_label(v, precision=prec)
include_lowest=False, dtype=None):
fmt = lambda v: _format_label(v, precision=prec, dtype=dtype)
if right:
levels = []
for a, b in zip(bins, bins[1:]):
Expand All @@ -258,12 +262,16 @@ def _format_levels(bins, prec, right=True,
else:
levels = ['[%s, %s)' % (fmt(a), fmt(b))
for a, b in zip(bins, bins[1:])]

return levels


def _format_label(x, precision=3):
def _format_label(x, precision=3, dtype=None):
fmt_str = '%%.%dg' % precision

if is_datetime64_dtype(dtype):
return to_datetime(x, unit='ns')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would like to see thur format_labels so to do all labels in 1 go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback can you please explain, i am not getting what you are trying to say

if is_timedelta64_dtype(dtype):
return to_timedelta(x, unit='ns')
if np.isinf(x):
return str(x)
elif is_float(x):
Expand Down Expand Up @@ -300,3 +308,56 @@ def _trim_zeros(x):
if len(x) > 1 and x[-1] == '.':
x = x[:-1]
return x


def _coerce_to_type(x):
"""
if the passed data is of datetime/timedelta type,
this method converts it to integer so that cut method can
handle it
"""
dtype = None
original = x
if is_timedelta64_dtype(x):
x = to_timedelta(x).view(np.int64)
dtype = np.timedelta64

elif is_datetime64_dtype(x):
x = to_datetime(x).view(np.int64)
dtype = np.datetime64
return original, x, dtype


def _preprocess_for_cut(x):
"""
handles preprocessing for cut where we convert passed
input to array, strip the index information and store it
seperately
"""
x_is_series = isinstance(x, Series)
series_index = None

name = None
if x_is_series:
series_index = x.index
if name is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is not needed, as you defined name = None above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment

name = x.name

x = np.asarray(x)

return x_is_series, series_index, name, x


def _postprocess_for_cut(fac, bins, retbins, x_is_series, series_index, name):
"""
handles post processing for the cut method where
we combine the index information if the originally passed
datatype was a series
"""
if x_is_series:
fac = Series(fac, index=series_index, name=name)

if not retbins:
return fac

return fac, bins