-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: support cut/qcut for datetime/timedelta (GH14714) #14737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support cut/qcut for datetime/timedelta (GH14714) #14737
Conversation
@jreback created this PR to verify if i am heading in the right direction with this change |
@aileronajay Roughly this looks in the right direction, yes. But, we will also want this to work with datetime and not only timedelta values. So you will have to pass the dtype somehow to distuinguish between both when making the labels. I would also encourage you to already add tests with some examples and the desired results, so you know what you are working towards. |
# for handling the cut for datetime and timedelta objects | ||
if needs_i8_conversion(x): | ||
x = x.values.view('i8') | ||
time_data = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pass the dtype instead
dtype = x.dtype
then you can act on it below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually what I would do instead here is pass a formatter
if is_timedelta64_dtype(x):
x = x.values.view('i8')
formatter = lambda x: pd.to_timedelta(float(x), unit='ns')
elif is_datetime64_dtype(x):
x = x.values.view('i8')
......
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback the x.dtype call errors out with this message when i run nosetests on test_tile.py AttributeError: 'list' object has no attribute 'dtype'. The formatter that i create in the suggested change here will be used in lieu of the formatter that is present in the downstream code, in method _format_levels , is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.asarray
@@ -249,16 +259,24 @@ def _format_levels(bins, prec, right=True, | |||
if a != b and fa == fb: | |||
raise ValueError('precision too low') | |||
|
|||
formatted = '(%s, %s]' % (fa, fb) | |||
if time_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at pandas/core/algorithms.py/factorize
for how we handle the differing dtypes. we want to replicate that type of logic (what you are doing is fine, but the code will be more smooth).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback i have moved this handling down to the formatter in the latest iteration of this code
pls add some tests |
273db34
to
1b2dd24
Compare
Current coverage is 85.27% (diff: 93.02%)
|
@jorisvandenbossche ". And that you convert a series to an array, and in this way loose the index information. |
@aileronajay As you can see, yes, the return value of
Have a look for |
@jorisvandenbossche @jreback i have changed how i convert the time object to integers, so i get a series now instead of array , x = x.astype(np.int64), it is similar to what is done in pandas.tools.util to_numeric method. This is the output that i get now s = pd.Series(pd.to_timedelta(np.random.randint(0,10000,size=10),unit='ns')).sort_values() |
990ad1c
to
7384925
Compare
@jreback i have moved the extra processing to the format method |
|
||
dtype = None | ||
if is_timedelta64_dtype(x): | ||
x = x.astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls use view(np.int64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sinhrks incorporated this change
@@ -81,6 +83,17 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3, | |||
array([1, 1, 1, 1, 1], dtype=int64) | |||
""" | |||
# NOTE: this binning code is changed a bit from histogram for var(x) == 0 | |||
# for handling the cut for datetime and timedelta objects | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to convert to array likes here because input can be a list
. prepare a private function which can be used in cut
and qcut
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sinhrks If i convert to array, would that preserve the index? @jorisvandenbossche has earlier commented that converting to array is not desirable as it would lead to loss of index information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should preserve the index. So needs below logic before converting to array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly it will be easier to move the x_is_series
logic up (to here, so you can then just deal with arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche @sinhrks the logic that preserves the index is present in _bins_to_cuts method. Whereas the error that we get due to date/timedelta data is due to code that is present in the cut method. So we need to convert into int64 in the cut method itself. So i have two questions. The first is that do i need to convert it to array, that i originally started out doing by making a call to x.values.view('i8') ? if yes, do i need to move to index handling code to the cut method instead of _bins_to_cuts method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree preprocessing/postprocessing should be moved to cut/qcut
from _bins_to_cuts
. what on my mind is something like below.
Actually preprocessing / postprocessing can be wrapped as a private function to reduce code duplications.
def cut(...):
x_is_series = ...
...
x = np.asarray(x)
if is_datetime64_dtype(...):
....
(core logic)
if x_is_series:
....
return...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the change. I think _preprocess_for_cut
should be called before _coerce_to_type
to handle list
properly? Or these can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i can merge _coerce_to_type code with _preprocess_for_cut and return an additional dtype parameter from the _preprocess_for_cut method. @jreback are there any concerns if i remove the _coerce_to_type code method and move the processing that is being done (by _coerce_to_type) to _preprocess_for_cut method? Also if i merge _coerce_to_type with _preprocess_for_cut method then the call to _preprocess_for_cut would have to be made at the beginning of the cut method so that by the time we reach the line that fails (mn, mx = [mi + 0.0 for mi in rng]), we have already converted time data to int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sinhrks i have moved the call to _preprocess_for_cut prior to coerce
fmt_str = '%%.%dg' % precision | ||
|
||
if dtype == np.datetime64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls use is_datetime64_dtype
here also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sinhrks, i can do that, that means that i would not have to pass the dtype as an argument, right? (I can check the dtype using is_datetime64_dtype call). Though it would mean that i make this call twice, once early on, when qcut or cut is called and later in _format_label, i mean if the call is not expensive, it shouldnt make a lot of difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be expensive. Alternative option is to use temporary variable. Including other answers, following impl is on my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sinhrks made this change now
x = x.astype(np.int64) | ||
dtype = np.timedelta64 | ||
|
||
if is_datetime64_dtype(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is nice if we can support datetimetz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and will that done be in the same same manner as well? (by converting to int64?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be.
self.assertEqual(result[1], | ||
'(2013-01-01 16:00:00, 2013-01-02 08:00:00]') | ||
self.assertEqual(result[2], | ||
'(2013-01-02 08:00:00, 2013-01-03 00:00:00]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking each element, it is better to construct the series, and then use assert_series_equal
:
exp = Series(['(2012-12-31 23:57:07.200000, 2013-01-01 16:00:00]', '(2013-01-01 16:00:00, 2013-01-02 08:00:00]', '(2013-01-02 08:00:00, 2013-01-03 00:00:00]')])
tm.assert_series_equal(result, exp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche @jreback incorporated your review comments
@@ -11,8 +11,10 @@ | |||
import pandas.core.algorithms as algos | |||
import pandas.core.nanops as nanops | |||
from pandas.compat import zip | |||
|
|||
from pandas.tseries.timedeltas import to_timedelta | |||
from pandas import to_datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import to_timedelta from pandas directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback i have made this change
x = x.view(np.int64) | ||
dtype = np.timedelta64 | ||
|
||
if is_datetime64_dtype(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be elseif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback i have made this change
fmt_str = '%%.%dg' % precision | ||
|
||
if is_datetime64_dtype(dtype): | ||
return to_datetime(x, unit='ns') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would like to see thur format_labels so to do all labels in 1 go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback can you please explain, i am not getting what you are trying to say
@@ -11,8 +11,10 @@ | |||
import pandas.core.algorithms as algos | |||
import pandas.core.nanops as nanops | |||
from pandas.compat import zip | |||
|
|||
from pandas import to_timedelta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put on the same line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all import from pandas are on the same line
import numpy as np | ||
from pandas.types.common import (is_datetime64_dtype, is_timedelta64_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parens not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parenthesis removed now
|
||
elif is_datetime64_dtype(x): | ||
x = x.view(np.int64) | ||
dtype = np.datetime64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a function in this module like
def _coerce_to_type(x):
dtype = None
original = x
x = np.asarray(x)
if is_timedelta64_dtype(x):
....
return original, x, dtype
Then call this (in the 2 places u r using it).
return the orignal data, coerced x, dtype
then also pass the original data (if you need to reconstruct something in side the functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have implemented this change
77ddff1
to
95e5989
Compare
def test_datetime_cut(self): | ||
# GH 14714 | ||
data = to_datetime(Series(['2013-01-01', '2013-01-02', '2013-01-03'])) | ||
result, bins = cut(data, 3, retbins=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u also test:
list
ofnp.datetime64
ndarray
ofdatetime64
dtypeDatetimeIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sinhrks I have added tests for list of np.datetime64 and ndarray. I did not get the DatetimeIndex part, to test this do we want to have a series/dataframe with datetime as the index and datetime as the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -81,6 +83,17 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3, | |||
array([1, 1, 1, 1, 1], dtype=int64) | |||
""" | |||
# NOTE: this binning code is changed a bit from histogram for var(x) == 0 | |||
# for handling the cut for datetime and timedelta objects | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the change. I think _preprocess_for_cut
should be called before _coerce_to_type
to handle list
properly? Or these can be merged.
@sinhrks @jreback @jorisvandenbossche i have incorporated all changes pertaining to all the review comments |
@aileronajay looking good!
|
…st, updated whatsnew
f00cfed
to
8e4adbb
Compare
@jreback are there additional changes required in the PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aileronajay a few more small comments. After that it should be ready to merge I think! (we gave you already enough rounds of comments :-))
x_is_series, series_index, name, x = _preprocess_for_cut(x) | ||
|
||
original, x, dtype = _coerce_to_type(x) | ||
|
||
if not np.iterable(bins): | ||
if is_scalar(bins) and bins < 1: | ||
raise ValueError("`bins` should be a positive integer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try ... except ...
below should not be necessary anymore, since x
is already an array (asarray is used in _preprocess_for_cut
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche you are referring to the try except in the pd.cut method, specifically these lines
try: # for array-like
sz = x.size
except AttributeError:
x = np.asarray(x)
sz = x.size
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes (I couldn't comment on those lines)
Normally it should already be assured before that you have an array here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche thanks, that is what i thought!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed this review comment
|
||
x = np.asarray(x) | ||
def _bins_to_cuts(x, bins, right=True, labels=None, retbins=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think retbins
arg is not used anymore in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the same for name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed this review comment
# for handling the cut for datetime and timedelta objects | ||
x_is_series, series_index, name, x = _preprocess_for_cut(x) | ||
|
||
original, x, dtype = _coerce_to_type(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think original
is used? In that case, it's not needed to return it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed this review comment
name = None | ||
if x_is_series: | ||
series_index = x.index | ||
if name is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is not needed, as you defined name = None
above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed this review comment
import numpy as np | ||
from pandas.types.common import is_datetime64_dtype, is_timedelta64_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this one above the numpy import? (to keep the pandas imports grouped like it was before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed this review comment
@aileronajay Thanks a lot! |
@jorisvandenbossche thanks! Yes i would like to do PR. I will need some additional information though as to how the custom bins for datetime/timedelta values would work, i mean what would be the expected functionality |
@aileronajay The idea is the following. Assume the example from your tests:
Here you specify the number of bins, and the bin edges are generated automatically. But you can also specify the bin edges yourself (for numeric dtypes, this is not yet working for datetimes). So the idea is that you would be able to specify it like this:
|
@jorisvandenbossche, i think if we specify the bin (edges) in terms of datetime, the "data" or the input should also be of the type datetime, is that correct? Also we would only be able to specify edges only in terms of datetime and not timedelta, as datetime represent specific time instances and so between the datetime instances we have windows that we can split the input data into. Timedelta objects are difference in time objects, would be able to use them as bins? |
Yes, indeed, the specified bin edges need to match the type of the data
No, you added support for 'cutting' timedelta data in this PR:
so it should also be possible to specify the bin edges manually for this type of data. Actually, I see you didn't add tests for the timedelta case, and I didn't notice this before merging. Would you also like to do a PR for adding some tests for this (can be separate PR)? |
@jorisvandenbossche yes, i will do separate PR for tests for the timedelta test cases |
@jorisvandenbossche i coerced the bins to a numeric type, using the coerce private method that I created earlier and i get the following output (below) , is this on lines of what we desire? pd.cut(pd.Series([np.datetime64('2013-01-01'),np.datetime64('2013-01-02')]), |
@aileronajay Yes, that looks like it. But you can just put up a PR, even if the code is only work in progress, then it will be easier to judge. |
@jorisvandenbossche , yeah about to do that now |
xref #14714, follow-on to #14737 Author: Ajay Saxena <[email protected]> Closes #14798 from aileronajay/cut_timetype_bin and squashes the following commits: 82bffa1 [Ajay Saxena] added method for time type bins in pd cut and modified tests ac919cf [Ajay Saxena] added test for datetime bin type 355e569 [Ajay Saxena] allowing datetime and timedelta datatype in pd cut bins
xref pandas-dev#14714, follow-on to pandas-dev#14737 Author: Ajay Saxena <[email protected]> Closes pandas-dev#14798 from aileronajay/cut_timetype_bin and squashes the following commits: 82bffa1 [Ajay Saxena] added method for time type bins in pd cut and modified tests ac919cf [Ajay Saxena] added test for datetime bin type 355e569 [Ajay Saxena] allowing datetime and timedelta datatype in pd cut bins
git diff upstream/master | flake8 --diff