Skip to content

ENH: support cut/qcut for datetime/timedelta (GH14714) #14737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Dec 3, 2016

Conversation

aileronajay
Copy link
Contributor

@aileronajay aileronajay commented Nov 25, 2016

@aileronajay
Copy link
Contributor Author

#14714

@aileronajay aileronajay changed the title Enhancement to support https://github.com/pandas-dev/pandas/issues/14714 Enhancement to support issue #14714 Nov 25, 2016
@aileronajay
Copy link
Contributor Author

@jreback created this PR to verify if i am heading in the right direction with this change

@jorisvandenbossche
Copy link
Member

@aileronajay Roughly this looks in the right direction, yes. But, we will also want this to work with datetime and not only timedelta values. So you will have to pass the dtype somehow to distuinguish between both when making the labels.
Another problem will be that you now assume this to be a series (you access the .values), while this also works for arrays. And that you convert a series to an array, and in this way loose the index information.

I would also encourage you to already add tests with some examples and the desired results, so you know what you are working towards.

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement Timedelta Timedelta data type Datetime Datetime data dtype labels Nov 25, 2016
@jorisvandenbossche jorisvandenbossche changed the title Enhancement to support issue #14714 ENH: support cut/qcut for datetime/timedelta (GH14714) Nov 25, 2016
# for handling the cut for datetime and timedelta objects
if needs_i8_conversion(x):
x = x.values.view('i8')
time_data = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pass the dtype instead

dtype = x.dtype

then you can act on it below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually what I would do instead here is pass a formatter

if is_timedelta64_dtype(x):
    x = x.values.view('i8')
    formatter = lambda x: pd.to_timedelta(float(x), unit='ns')
elif is_datetime64_dtype(x):
  x = x.values.view('i8')
......

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback the x.dtype call errors out with this message when i run nosetests on test_tile.py AttributeError: 'list' object has no attribute 'dtype'. The formatter that i create in the suggested change here will be used in lieu of the formatter that is present in the downstream code, in method _format_levels , is that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.asarray

@@ -249,16 +259,24 @@ def _format_levels(bins, prec, right=True,
if a != b and fa == fb:
raise ValueError('precision too low')

formatted = '(%s, %s]' % (fa, fb)
if time_data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at pandas/core/algorithms.py/factorize for how we handle the differing dtypes. we want to replicate that type of logic (what you are doing is fine, but the code will be more smooth).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback i have moved this handling down to the formatter in the latest iteration of this code

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

pls add some tests

@codecov-io
Copy link

codecov-io commented Nov 25, 2016

Current coverage is 85.27% (diff: 93.02%)

No coverage report found for master at 725453d.

Powered by Codecov. Last update 725453d...be9b2fd

@aileronajay
Copy link
Contributor Author

@jorisvandenbossche ". And that you convert a series to an array, and in this way loose the index information.
"
After making call to the cut, will the output still be related to the initial index? I thought we split the data into ranges by making a call to cut and the result does not have the same order/index as the original input that we pass to cut

@jorisvandenbossche
Copy link
Member

@aileronajay As you can see, yes, the return value of cut retains the index:

In [18]: s = pd.Series(range(5))

In [19]: s
Out[19]: 
0    0
1    1
2    2
3    3
4    4
dtype: int64

In [20]: pd.cut(s, bins=2)
Out[20]: 
0    (-0.004, 2]
1    (-0.004, 2]
2    (-0.004, 2]
3         (2, 4]
4         (2, 4]
dtype: category
Categories (2, object): [(-0.004, 2] < (2, 4]]

Have a look for x_is_series how this is handled currently (eg here: https://github.com/aileronajay/pandas/blob/1b2dd2428d6ed2371cac89e56d474941c7225653/pandas/tools/tile.py#L243)

@aileronajay
Copy link
Contributor Author

@jorisvandenbossche @jreback i have changed how i convert the time object to integers, so i get a series now instead of array , x = x.astype(np.int64), it is similar to what is done in pandas.tools.util to_numeric method. This is the output that i get now

s = pd.Series(pd.to_timedelta(np.random.randint(0,10000,size=10),unit='ns')).sort_values()
pd.cut(s,6)
1 (0 days 00:00:00.000000, 0 days 00:00:00.000001]
9 (0 days 00:00:00.000000, 0 days 00:00:00.000001]
7 (0 days 00:00:00.000001, 0 days 00:00:00.000003]
4 (0 days 00:00:00.000004, 0 days 00:00:00.000006]
2 (0 days 00:00:00.000004, 0 days 00:00:00.000006]
0 (0 days 00:00:00.000004, 0 days 00:00:00.000006]
3 (0 days 00:00:00.000004, 0 days 00:00:00.000006]
5 (0 days 00:00:00.000006, 0 days 00:00:00.000007]
8 (0 days 00:00:00.000006, 0 days 00:00:00.000007]
6 (0 days 00:00:00.000007, 0 days 00:00:00.000008]

@aileronajay
Copy link
Contributor Author

@jreback i have moved the extra processing to the format method


dtype = None
if is_timedelta64_dtype(x):
x = x.astype(np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use view(np.int64)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks incorporated this change

@@ -81,6 +83,17 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
array([1, 1, 1, 1, 1], dtype=int64)
"""
# NOTE: this binning code is changed a bit from histogram for var(x) == 0
# for handling the cut for datetime and timedelta objects

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to convert to array likes here because input can be a list. prepare a private function which can be used in cut and qcut.

Copy link
Contributor Author

@aileronajay aileronajay Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks If i convert to array, would that preserve the index? @jorisvandenbossche has earlier commented that converting to array is not desirable as it would lead to loss of index information

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should preserve the index. So needs below logic before converting to array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly it will be easier to move the x_is_series logic up (to here, so you can then just deal with arrays)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @sinhrks the logic that preserves the index is present in _bins_to_cuts method. Whereas the error that we get due to date/timedelta data is due to code that is present in the cut method. So we need to convert into int64 in the cut method itself. So i have two questions. The first is that do i need to convert it to array, that i originally started out doing by making a call to x.values.view('i8') ? if yes, do i need to move to index handling code to the cut method instead of _bins_to_cuts method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree preprocessing/postprocessing should be moved to cut/qcut from _bins_to_cuts. what on my mind is something like below.

Actually preprocessing / postprocessing can be wrapped as a private function to reduce code duplications.

def cut(...):
    x_is_series = ...
    ... 
    x = np.asarray(x)    
    if is_datetime64_dtype(...):
        ....

    (core logic)

    if x_is_series:
        ....
    return...

Copy link
Contributor Author

@aileronajay aileronajay Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks I have created two private methods, _preprocess_for_cut and _postprocess_for_cut to move the logic from _bins_to_cuts to cut and qcut methods, present in commit 95e5989

Copy link
Member

@sinhrks sinhrks Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the change. I think _preprocess_for_cut should be called before _coerce_to_type to handle list properly? Or these can be merged.

Copy link
Contributor Author

@aileronajay aileronajay Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i can merge _coerce_to_type code with _preprocess_for_cut and return an additional dtype parameter from the _preprocess_for_cut method. @jreback are there any concerns if i remove the _coerce_to_type code method and move the processing that is being done (by _coerce_to_type) to _preprocess_for_cut method? Also if i merge _coerce_to_type with _preprocess_for_cut method then the call to _preprocess_for_cut would have to be made at the beginning of the cut method so that by the time we reach the line that fails (mn, mx = [mi + 0.0 for mi in rng]), we have already converted time data to int64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks i have moved the call to _preprocess_for_cut prior to coerce

fmt_str = '%%.%dg' % precision

if dtype == np.datetime64:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use is_datetime64_dtype here also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks, i can do that, that means that i would not have to pass the dtype as an argument, right? (I can check the dtype using is_datetime64_dtype call). Though it would mean that i make this call twice, once early on, when qcut or cut is called and later in _format_label, i mean if the call is not expensive, it shouldnt make a lot of difference

Copy link
Member

@sinhrks sinhrks Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be expensive. Alternative option is to use temporary variable. Including other answers, following impl is on my mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks made this change now

x = x.astype(np.int64)
dtype = np.timedelta64

if is_datetime64_dtype(x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice if we can support datetimetz.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and will that done be in the same same manner as well? (by converting to int64?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be.

self.assertEqual(result[1],
'(2013-01-01 16:00:00, 2013-01-02 08:00:00]')
self.assertEqual(result[2],
'(2013-01-02 08:00:00, 2013-01-03 00:00:00]')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking each element, it is better to construct the series, and then use assert_series_equal:

exp = Series(['(2012-12-31 23:57:07.200000, 2013-01-01 16:00:00]', '(2013-01-01 16:00:00, 2013-01-02 08:00:00]', '(2013-01-02 08:00:00, 2013-01-03 00:00:00]')])
tm.assert_series_equal(result, exp)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @jreback incorporated your review comments

@@ -11,8 +11,10 @@
import pandas.core.algorithms as algos
import pandas.core.nanops as nanops
from pandas.compat import zip

from pandas.tseries.timedeltas import to_timedelta
from pandas import to_datetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import to_timedelta from pandas directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback i have made this change

x = x.view(np.int64)
dtype = np.timedelta64

if is_datetime64_dtype(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be elseif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback i have made this change

fmt_str = '%%.%dg' % precision

if is_datetime64_dtype(dtype):
return to_datetime(x, unit='ns')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would like to see thur format_labels so to do all labels in 1 go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback can you please explain, i am not getting what you are trying to say

@@ -11,8 +11,10 @@
import pandas.core.algorithms as algos
import pandas.core.nanops as nanops
from pandas.compat import zip

from pandas import to_timedelta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put on the same line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all import from pandas are on the same line

import numpy as np
from pandas.types.common import (is_datetime64_dtype, is_timedelta64_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parens not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parenthesis removed now


elif is_datetime64_dtype(x):
x = x.view(np.int64)
dtype = np.datetime64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a function in this module like

def _coerce_to_type(x):
     dtype = None
     original = x
     x = np.asarray(x)
     if is_timedelta64_dtype(x):
....

    return original, x, dtype

Then call this (in the 2 places u r using it).

return the orignal data, coerced x, dtype

then also pass the original data (if you need to reconstruct something in side the functions)

Copy link
Contributor Author

@aileronajay aileronajay Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have implemented this change

def test_datetime_cut(self):
# GH 14714
data = to_datetime(Series(['2013-01-01', '2013-01-02', '2013-01-03']))
result, bins = cut(data, 3, retbins=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u also test:

  • list of np.datetime64
  • ndarray of datetime64 dtype
  • DatetimeIndex

Copy link
Contributor Author

@aileronajay aileronajay Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks I have added tests for list of np.datetime64 and ndarray. I did not get the DatetimeIndex part, to test this do we want to have a series/dataframe with datetime as the index and datetime as the data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks added test for DatetimeIndex cut in be9b2fd

@@ -81,6 +83,17 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
array([1, 1, 1, 1, 1], dtype=int64)
"""
# NOTE: this binning code is changed a bit from histogram for var(x) == 0
# for handling the cut for datetime and timedelta objects

Copy link
Member

@sinhrks sinhrks Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the change. I think _preprocess_for_cut should be called before _coerce_to_type to handle list properly? Or these can be merged.

@aileronajay
Copy link
Contributor Author

aileronajay commented Nov 30, 2016

@sinhrks @jreback @jorisvandenbossche i have incorporated all changes pertaining to all the review comments

@jorisvandenbossche
Copy link
Member

@aileronajay looking good!
Some additional general remarks:

  • Can you add a notice in the whatsnew file (doc/source/whatsnew/v0.20.0.txt)
  • For the tests, you nicely structured them. But, if you would combine them in one test (you can keep the structure by using comments), you can shorten the code a large part by not having to repeat the expected values.
  • Can you add docstrings to the new functions you defined? I know we don't always do that consistently in the codebase but we have to start somewhere (it doesn't always have to be long, eg a one line description can already do a lot)
  • Not sure if this should be dealt with in this PR, but we should also make it possible to specify your own bins for these data types (now, you only tested for the case where you specify the number of bins, eg bins=True, while eg bins=[pd.Timestamp(..), ...] should also work)

@aileronajay
Copy link
Contributor Author

@jreback are there additional changes required in the PR?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aileronajay a few more small comments. After that it should be ready to merge I think! (we gave you already enough rounds of comments :-))

x_is_series, series_index, name, x = _preprocess_for_cut(x)

original, x, dtype = _coerce_to_type(x)

if not np.iterable(bins):
if is_scalar(bins) and bins < 1:
raise ValueError("`bins` should be a positive integer.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try ... except ... below should not be necessary anymore, since x is already an array (asarray is used in _preprocess_for_cut)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche you are referring to the try except in the pd.cut method, specifically these lines

    try:  # for array-like
        sz = x.size
    except AttributeError:
        x = np.asarray(x)
        sz = x.size

Is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes (I couldn't comment on those lines)
Normally it should already be assured before that you have an array here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche thanks, that is what i thought!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment


x = np.asarray(x)
def _bins_to_cuts(x, bins, right=True, labels=None, retbins=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retbins arg is not used anymore in this function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same for name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment

# for handling the cut for datetime and timedelta objects
x_is_series, series_index, name, x = _preprocess_for_cut(x)

original, x, dtype = _coerce_to_type(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think original is used? In that case, it's not needed to return it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment

name = None
if x_is_series:
series_index = x.index
if name is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is not needed, as you defined name = None above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment

import numpy as np
from pandas.types.common import is_datetime64_dtype, is_timedelta64_dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this one above the numpy import? (to keep the pandas imports grouped like it was before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this review comment

@jorisvandenbossche jorisvandenbossche merged commit 56c3aae into pandas-dev:master Dec 3, 2016
@jorisvandenbossche
Copy link
Member

@aileronajay Thanks a lot!
Do you want to do a follow-up PR to also add the ability to specify custom bins for datetime/timedelta values?

@aileronajay
Copy link
Contributor Author

@jorisvandenbossche thanks! Yes i would like to do PR. I will need some additional information though as to how the custom bins for datetime/timedelta values would work, i mean what would be the expected functionality

@jorisvandenbossche
Copy link
Member

@aileronajay The idea is the following. Assume the example from your tests:

In [2]: data = pd.to_datetime(pd.Series(['2013-01-01', '2013-01-02', '2013-01-03']))

In [6]: pd.cut(data, bins=3)
Out[6]: 
0    (2012-12-31 23:57:07.200000, 2013-01-01 16:00:00]
1           (2013-01-01 16:00:00, 2013-01-02 08:00:00]
2           (2013-01-02 08:00:00, 2013-01-03 00:00:00]
dtype: category
Categories (3, object): [(2012-12-31 23:57:07.200000, 2013-01-01 16:00:00] < (2013-01-01 16:00:00, 2013-01-02 08:00:00] <
                         (2013-01-02 08:00:00, 2013-01-03 00:00:00]]

Here you specify the number of bins, and the bin edges are generated automatically. But you can also specify the bin edges yourself (for numeric dtypes, this is not yet working for datetimes). So the idea is that you would be able to specify it like this:

In [7]: pd.cut(data, bins=[pd.Timestamp('2012-12-31'), pd.Timestamp('2012-01-02'), pd.Timestamp('2012-01-04')])

@aileronajay
Copy link
Contributor Author

@jorisvandenbossche, i think if we specify the bin (edges) in terms of datetime, the "data" or the input should also be of the type datetime, is that correct? Also we would only be able to specify edges only in terms of datetime and not timedelta, as datetime represent specific time instances and so between the datetime instances we have windows that we can split the input data into. Timedelta objects are difference in time objects, would be able to use them as bins?

@jorisvandenbossche
Copy link
Member

if we specify the bin (edges) in terms of datetime, the "data" or the input should also be of the type datetime, is that correct?

Yes, indeed, the specified bin edges need to match the type of the data

we would only be able to specify edges only in terms of datetime and not timedelta

No, you added support for 'cutting' timedelta data in this PR:

In [6]: data = pd.Series(pd.timedelta_range("0 days", freq='H', periods=5))

In [7]: data
Out[7]: 
0   00:00:00
1   01:00:00
2   02:00:00
3   03:00:00
4   04:00:00
dtype: timedelta64[ns]

In [8]: pd.cut(data, bins=3)
Out[8]: 
0    (-1 days +23:59:45.600000, 0 days 01:20:00]
1    (-1 days +23:59:45.600000, 0 days 01:20:00]
2             (0 days 01:20:00, 0 days 02:40:00]
3             (0 days 02:40:00, 0 days 04:00:00]
4             (0 days 02:40:00, 0 days 04:00:00]
dtype: category
Categories (3, object): [(-1 days +23:59:45.600000, 0 days 01:20:00] < (0 days 01:20:00, 0 days 02:40:00] <
                         (0 days 02:40:00, 0 days 04:00:00]]

so it should also be possible to specify the bin edges manually for this type of data.

Actually, I see you didn't add tests for the timedelta case, and I didn't notice this before merging. Would you also like to do a PR for adding some tests for this (can be separate PR)?

@aileronajay
Copy link
Contributor Author

aileronajay commented Dec 4, 2016

@jorisvandenbossche yes, i will do separate PR for tests for the timedelta test cases

@aileronajay
Copy link
Contributor Author

@jorisvandenbossche i coerced the bins to a numeric type, using the coerce private method that I created earlier and i get the following output (below) , is this on lines of what we desire?

pd.cut(pd.Series([np.datetime64('2013-01-01'),np.datetime64('2013-01-02')]),
... [np.datetime64('2012-12-18'), np.datetime64('2013-01-03'),np.datetime64('2013-01-05')])
0 (2012-12-18 00:00:00, 2013-01-03 00:00:00]
1 (2012-12-18 00:00:00, 2013-01-03 00:00:00]
dtype: category
Categories (2, object): [(2012-12-18 00:00:00, 2013-01-03 00:00:00] < (2013-01-03 00:00:00, 2013-01-05 00:00:00]]

@jorisvandenbossche
Copy link
Member

@aileronajay Yes, that looks like it. But you can just put up a PR, even if the code is only work in progress, then it will be easier to judge.

@aileronajay
Copy link
Contributor Author

@jorisvandenbossche , yeah about to do that now

jreback pushed a commit that referenced this pull request Dec 22, 2016
xref #14714, follow-on to #14737

Author: Ajay Saxena <[email protected]>

Closes #14798 from aileronajay/cut_timetype_bin and squashes the following commits:

82bffa1 [Ajay Saxena] added method for time type bins in pd cut and modified tests
ac919cf [Ajay Saxena] added test for datetime bin type
355e569 [Ajay Saxena]  allowing datetime and timedelta datatype in pd cut bins
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
xref pandas-dev#14714, follow-on to pandas-dev#14737

Author: Ajay Saxena <[email protected]>

Closes pandas-dev#14798 from aileronajay/cut_timetype_bin and squashes the following commits:

82bffa1 [Ajay Saxena] added method for time type bins in pd cut and modified tests
ac919cf [Ajay Saxena] added test for datetime bin type
355e569 [Ajay Saxena]  allowing datetime and timedelta datatype in pd cut bins
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Datetime Datetime data dtype Enhancement Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: enable pd.cut to handle i8 convertibles
5 participants