ENH: support cut/qcut for datetime/timedelta (GH14714) #14737

aileronajay · 2016-11-25T06:43:10Z

closes ENH: enable pd.cut to handle i8 convertibles #14714
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

aileronajay · 2016-11-25T06:44:19Z

aileronajay · 2016-11-25T06:47:03Z

@jreback created this PR to verify if i am heading in the right direction with this change

jorisvandenbossche · 2016-11-25T08:21:25Z

@aileronajay Roughly this looks in the right direction, yes. But, we will also want this to work with datetime and not only timedelta values. So you will have to pass the dtype somehow to distuinguish between both when making the labels.
Another problem will be that you now assume this to be a series (you access the .values), while this also works for arrays. And that you convert a series to an array, and in this way loose the index information.

I would also encourage you to already add tests with some examples and the desired results, so you know what you are working towards.

jreback · 2016-11-25T13:23:50Z

pandas/tools/tile.py

+    # for handling the cut for datetime and timedelta objects
+    if needs_i8_conversion(x):
+        x = x.values.view('i8')
+        time_data = True


I would pass the dtype instead

dtype = x.dtype

then you can act on it below

actually what I would do instead here is pass a formatter

if is_timedelta64_dtype(x): x = x.values.view('i8') formatter = lambda x: pd.to_timedelta(float(x), unit='ns') elif is_datetime64_dtype(x): x = x.values.view('i8') ......

@jreback the x.dtype call errors out with this message when i run nosetests on test_tile.py AttributeError: 'list' object has no attribute 'dtype'. The formatter that i create in the suggested change here will be used in lieu of the formatter that is present in the downstream code, in method _format_levels , is that correct?

use np.asarray

jreback · 2016-11-25T13:25:20Z

pandas/tools/tile.py

@@ -249,16 +259,24 @@ def _format_levels(bins, prec, right=True,
            if a != b and fa == fb:
                raise ValueError('precision too low')

-            formatted = '(%s, %s]' % (fa, fb)
+            if time_data:


look at pandas/core/algorithms.py/factorize for how we handle the differing dtypes. we want to replicate that type of logic (what you are doing is fine, but the code will be more smooth).

@jreback i have moved this handling down to the formatter in the latest iteration of this code

jreback · 2016-11-25T13:27:22Z

pls add some tests

codecov-io · 2016-11-25T16:37:15Z

Current coverage is 85.27% (diff: 93.02%)

No coverage report found for master at 725453d.

Powered by Codecov. Last update 725453d...be9b2fd

aileronajay · 2016-11-25T19:56:42Z

@jorisvandenbossche ". And that you convert a series to an array, and in this way loose the index information.
"
After making call to the cut, will the output still be related to the initial index? I thought we split the data into ranges by making a call to cut and the result does not have the same order/index as the original input that we pass to cut

jorisvandenbossche · 2016-11-25T20:47:54Z

@aileronajay As you can see, yes, the return value of cut retains the index:

In [18]: s = pd.Series(range(5))

In [19]: s
Out[19]: 
0    0
1    1
2    2
3    3
4    4
dtype: int64

In [20]: pd.cut(s, bins=2)
Out[20]: 
0    (-0.004, 2]
1    (-0.004, 2]
2    (-0.004, 2]
3         (2, 4]
4         (2, 4]
dtype: category
Categories (2, object): [(-0.004, 2] < (2, 4]]

Have a look for x_is_series how this is handled currently (eg here: https://github.com/aileronajay/pandas/blob/1b2dd2428d6ed2371cac89e56d474941c7225653/pandas/tools/tile.py#L243)

aileronajay · 2016-11-25T21:17:51Z

@jorisvandenbossche @jreback i have changed how i convert the time object to integers, so i get a series now instead of array , x = x.astype(np.int64), it is similar to what is done in pandas.tools.util to_numeric method. This is the output that i get now

s = pd.Series(pd.to_timedelta(np.random.randint(0,10000,size=10),unit='ns')).sort_values()
pd.cut(s,6)
1 (0 days 00:00:00.000000, 0 days 00:00:00.000001]
9 (0 days 00:00:00.000000, 0 days 00:00:00.000001]
7 (0 days 00:00:00.000001, 0 days 00:00:00.000003]
4 (0 days 00:00:00.000004, 0 days 00:00:00.000006]
2 (0 days 00:00:00.000004, 0 days 00:00:00.000006]
0 (0 days 00:00:00.000004, 0 days 00:00:00.000006]
3 (0 days 00:00:00.000004, 0 days 00:00:00.000006]
5 (0 days 00:00:00.000006, 0 days 00:00:00.000007]
8 (0 days 00:00:00.000006, 0 days 00:00:00.000007]
6 (0 days 00:00:00.000007, 0 days 00:00:00.000008]

aileronajay · 2016-11-28T04:03:51Z

@jreback i have moved the extra processing to the format method

sinhrks · 2016-11-28T06:39:29Z

pandas/tools/tile.py

+
+    dtype = None
+    if is_timedelta64_dtype(x):
+        x = x.astype(np.int64)


pls use view(np.int64)

@sinhrks incorporated this change

sinhrks · 2016-11-28T06:43:59Z

pandas/tools/tile.py

@@ -81,6 +83,17 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
    array([1, 1, 1, 1, 1], dtype=int64)
    """
    # NOTE: this binning code is changed a bit from histogram for var(x) == 0
+    # for handling the cut for datetime and timedelta objects
+


better to convert to array likes here because input can be a list. prepare a private function which can be used in cut and qcut.

@sinhrks If i convert to array, would that preserve the index? @jorisvandenbossche has earlier commented that converting to array is not desirable as it would lead to loss of index information

Yes, we should preserve the index. So needs below logic before converting to array.

https://github.com/aileronajay/pandas/blob/b21fa237537ff225fca955355000a7b2fc2a2bb4/pandas/tools/tile.py#L204

Possibly it will be easier to move the x_is_series logic up (to here, so you can then just deal with arrays)

@jorisvandenbossche @sinhrks the logic that preserves the index is present in _bins_to_cuts method. Whereas the error that we get due to date/timedelta data is due to code that is present in the cut method. So we need to convert into int64 in the cut method itself. So i have two questions. The first is that do i need to convert it to array, that i originally started out doing by making a call to x.values.view('i8') ? if yes, do i need to move to index handling code to the cut method instead of _bins_to_cuts method?

Yes, I agree preprocessing/postprocessing should be moved to cut/qcut from _bins_to_cuts. what on my mind is something like below.

Actually preprocessing / postprocessing can be wrapped as a private function to reduce code duplications.

def cut(...): x_is_series = ... ... x = np.asarray(x) if is_datetime64_dtype(...): .... (core logic) if x_is_series: .... return...

@sinhrks I have created two private methods, _preprocess_for_cut and _postprocess_for_cut to move the logic from _bins_to_cuts to cut and qcut methods, present in commit 95e5989

Thx for the change. I think _preprocess_for_cut should be called before _coerce_to_type to handle list properly? Or these can be merged.

i think i can merge _coerce_to_type code with _preprocess_for_cut and return an additional dtype parameter from the _preprocess_for_cut method. @jreback are there any concerns if i remove the _coerce_to_type code method and move the processing that is being done (by _coerce_to_type) to _preprocess_for_cut method? Also if i merge _coerce_to_type with _preprocess_for_cut method then the call to _preprocess_for_cut would have to be made at the beginning of the cut method so that by the time we reach the line that fails (mn, mx = [mi + 0.0 for mi in rng]), we have already converted time data to int64

@sinhrks i have moved the call to _preprocess_for_cut prior to coerce

sinhrks · 2016-11-28T06:44:35Z

pandas/tools/tile.py

    fmt_str = '%%.%dg' % precision
+
+    if dtype == np.datetime64:


pls use is_datetime64_dtype here also.

@sinhrks, i can do that, that means that i would not have to pass the dtype as an argument, right? (I can check the dtype using is_datetime64_dtype call). Though it would mean that i make this call twice, once early on, when qcut or cut is called and later in _format_label, i mean if the call is not expensive, it shouldnt make a lot of difference

Shouldn't be expensive. Alternative option is to use temporary variable. Including other answers, following impl is on my mind.

https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L400

@sinhrks made this change now

sinhrks · 2016-11-28T06:44:59Z

pandas/tools/tile.py

+        x = x.astype(np.int64)
+        dtype = np.timedelta64
+
+    if is_datetime64_dtype(x):


It is nice if we can support datetimetz.

and will that done be in the same same manner as well? (by converting to int64?)

Yes, it should be.

jorisvandenbossche · 2016-11-28T09:04:39Z

pandas/tools/tests/test_tile.py

+        self.assertEqual(result[1],
+                         '(2013-01-01 16:00:00, 2013-01-02 08:00:00]')
+        self.assertEqual(result[2],
+                         '(2013-01-02 08:00:00, 2013-01-03 00:00:00]')


Instead of checking each element, it is better to construct the series, and then use assert_series_equal:

exp = Series(['(2012-12-31 23:57:07.200000, 2013-01-01 16:00:00]', '(2013-01-01 16:00:00, 2013-01-02 08:00:00]', '(2013-01-02 08:00:00, 2013-01-03 00:00:00]')]) tm.assert_series_equal(result, exp)

add the issue number as a comment

@jorisvandenbossche @jreback incorporated your review comments

jreback · 2016-11-28T11:08:42Z

pandas/tools/tile.py

@@ -11,8 +11,10 @@
 import pandas.core.algorithms as algos
 import pandas.core.nanops as nanops
 from pandas.compat import zip
-
+from pandas.tseries.timedeltas import to_timedelta
+from pandas import to_datetime


import to_timedelta from pandas directly

@jreback i have made this change

jreback · 2016-11-28T11:09:15Z

pandas/tools/tile.py

+        x = x.view(np.int64)
+        dtype = np.timedelta64
+
+    if is_datetime64_dtype(x):


should be elseif

@jreback i have made this change

jreback · 2016-11-28T11:10:22Z

pandas/tools/tile.py

    fmt_str = '%%.%dg' % precision
+
+    if is_datetime64_dtype(dtype):
+        return to_datetime(x, unit='ns')


i would like to see thur format_labels so to do all labels in 1 go

@jreback can you please explain, i am not getting what you are trying to say

jreback · 2016-11-30T00:05:14Z

pandas/tools/tile.py

@@ -11,8 +11,10 @@
 import pandas.core.algorithms as algos
 import pandas.core.nanops as nanops
 from pandas.compat import zip
-
+from pandas import to_timedelta


put on the same line

all import from pandas are on the same line

jreback · 2016-11-30T00:05:23Z

pandas/tools/tile.py

 import numpy as np
+from pandas.types.common import (is_datetime64_dtype, is_timedelta64_dtype)


parens not needed

parenthesis removed now

jreback · 2016-11-30T00:09:16Z

pandas/tools/tile.py

+
+    elif is_datetime64_dtype(x):
+        x = x.view(np.int64)
+        dtype = np.datetime64


make a function in this module like

def _coerce_to_type(x): dtype = None original = x x = np.asarray(x) if is_timedelta64_dtype(x): .... return original, x, dtype

Then call this (in the 2 places u r using it).

return the orignal data, coerced x, dtype

then also pass the original data (if you need to reconstruct something in side the functions)

i have implemented this change

sinhrks · 2016-11-30T02:52:07Z

pandas/tools/tests/test_tile.py

+    def test_datetime_cut(self):
+        # GH 14714
+        data = to_datetime(Series(['2013-01-01', '2013-01-02', '2013-01-03']))
+        result, bins = cut(data, 3, retbins=True)


can u also test:

list of np.datetime64

ndarray of datetime64 dtype

DatetimeIndex

@sinhrks I have added tests for list of np.datetime64 and ndarray. I did not get the DatetimeIndex part, to test this do we want to have a series/dataframe with datetime as the index and datetime as the data?

@sinhrks added test for DatetimeIndex cut in be9b2fd

sinhrks · 2016-11-30T02:53:37Z

pandas/tools/tile.py

@@ -81,6 +83,17 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
    array([1, 1, 1, 1, 1], dtype=int64)
    """
    # NOTE: this binning code is changed a bit from histogram for var(x) == 0
+    # for handling the cut for datetime and timedelta objects
+


Thx for the change. I think _preprocess_for_cut should be called before _coerce_to_type to handle list properly? Or these can be merged.

aileronajay · 2016-11-30T06:02:29Z

@sinhrks @jreback @jorisvandenbossche i have incorporated all changes pertaining to all the review comments

jorisvandenbossche · 2016-11-30T15:41:33Z

@aileronajay looking good!
Some additional general remarks:

Can you add a notice in the whatsnew file (doc/source/whatsnew/v0.20.0.txt)
For the tests, you nicely structured them. But, if you would combine them in one test (you can keep the structure by using comments), you can shorten the code a large part by not having to repeat the expected values.
Can you add docstrings to the new functions you defined? I know we don't always do that consistently in the codebase but we have to start somewhere (it doesn't always have to be long, eg a one line description can already do a lot)
Not sure if this should be dealt with in this PR, but we should also make it possible to specify your own bins for these data types (now, you only tested for the case where you specify the number of bins, eg bins=True, while eg bins=[pd.Timestamp(..), ...] should also work)

…st, updated whatsnew

aileronajay · 2016-12-02T16:42:49Z

@jreback are there additional changes required in the PR?

jorisvandenbossche

@aileronajay a few more small comments. After that it should be ready to merge I think! (we gave you already enough rounds of comments :-))

jorisvandenbossche · 2016-12-02T22:12:44Z

pandas/tools/tile.py

+    x_is_series, series_index, name, x = _preprocess_for_cut(x)
+
+    original, x, dtype = _coerce_to_type(x)
+
    if not np.iterable(bins):
        if is_scalar(bins) and bins < 1:
            raise ValueError("`bins` should be a positive integer.")


The try ... except ... below should not be necessary anymore, since x is already an array (asarray is used in _preprocess_for_cut)

@jorisvandenbossche you are referring to the try except in the pd.cut method, specifically these lines

try: # for array-like sz = x.size except AttributeError: x = np.asarray(x) sz = x.size

Is that correct?

yes (I couldn't comment on those lines)
Normally it should already be assured before that you have an array here

@jorisvandenbossche thanks, that is what i thought!