Skip to content

PROPOSAL: use datetime64 and timedelta64 to eval #22063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

holymonson
Copy link
Contributor

This PR aims to use datetime64 and timedelta64 in evaluation directly, instead of casting to integer and then casting back.

Benefits:

  1. values is saved in ndarray[datetime64], casting to int to eval is non-intuitive. It seems back to the time before numpy 1.7 which implements datetime64.
  2. numpy.ufunc can handle all the dtype change, e.g. datetime64 - datetime64 = timedelta64. Additonally, nonsense operations are prohibited, e.g. datetime64 + datetime64.
  3. numpy implements datetime64('NaT'), seems values_mask and other_mask are no longer needed for nans.
  4. Series can share the same call path with DataFrame in Block. Now Series is using dispatch_to_index_op to wrap as Index, but Index is 1d-array while DataFrame is 2d-array, so DataFrame can't use the same wrapper. ndarray([Index]) is dtype='O' so can't be benefited from numpy.ufunc.

This commit has passed tests/{internals,frame,dtypes} , but not tests/series, because dispatch_to_index_op breaks it.

@jbrockmendel
Copy link
Member

Series can share the same call path with DataFrame in Block

We actually want to go in the other direction. We want to make DataFrame dispatch to the Series (or DatetimeIndex/TimedeltaIndex) implementations. Is there a viable way to do that here? Maybe dispatch to self._holder?



# TODO: np.percentile not support datetime64, should be fixed in upsteam.
def percentile(values, q, **kw):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably belongs in core.nanops, definitely not core.common.

# TODO: np.percentile not support datetime64, should be fixed in upsteam.
def percentile(values, q, **kw):
if values.dtype.kind == 'M':
return np.percentile(values.view('i8'), q, **kw).astype(values.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this handle iNaT correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can handle numpy.datetime64('NaT'), but not iNaT.

@@ -2134,7 +2134,7 @@ def _na_value(self):

@property
def fill_value(self):
return tslibs.iNaT
return tslibs.NaT.asm8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iNaT is the standard usage for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iNaT is an int, but here wants a numpy.datetime64('NaT').

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what problem are you trying to solve here?

(operator.pow, 'bool')}
(operator.pow, 'bool'),
(operator.pow, '<m8[ns]'),
(operator.pow, '<M8[ns]')}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good fixes. Can you take it a step further and, instead of skipping, pytest.raises(TypeError) for the appropriate cases?

@holymonson
Copy link
Contributor Author

what problem are you trying to solve here?

@jreback #22005 will no longer exist. But the main purpose is simplify the code, and make DataFrame and Series operations share same call path in Block.

We actually want to go in the other direction. We want to make DataFrame dispatch to the Series (or DatetimeIndex/TimedeltaIndex) implements. Is there a viable way to do that here? Maybe dispatch to self._holder?

@jbrockmendel
As pointed out in Benefits (4), Series (or DatetimeIndex/TimedeltaIndex) can only handle 1d-array. If have to make DataFrame dispatch into 1d-array, there is 2 possible implementations:

  1. Reduce 2d-ndarray into 1d-ndarray, eval with numpy.ufunc, then raise back. This will need to due with shape changing and possible data copying. And it will break ops between 1d-ndarray and 2d-ndarray (need a different call path).

  2. Make it into ndarray[DatetimeIndex](1d-ndarray of DatetimeIndex). But it will be treat as dtype='O' in numpy, the call path will be

    1. make 2d-ndarray[dtype='m8'] into 1d-ndarray[DatetimeIndex]
    2. ufunc with 1d-ndarray[dtype='O']
    3. eval with DatetimeIndex
    4. ufunc with 1d-ndarray[dtype='m8']

    This PR will simply be

    1. ufunc with 2d-ndarray[dtype='m8']

As numpy already handle data operating well, why do we need an extra implement with less efficiency? In fact, dispatching to DatetimeIndex is ending up with ufunc with 1d-ndarray too, what DatetimeIndex do can be done (already implemented, seems only exception is tz-detection) in Block and ufunc.

Could someone show me the benefits of the Index-dispatch implement?


I have write a 2nd commit,which implements:

  1. .tz info save as DatetimeTZBlock.tz, no longer need a DatetimeIndex. Multi-columns can share a same DatetimeTZBlock if they have same tz. (NotImplemented)
  2. _try_coerce_args is implemented in DatetimeLikeBlockMixin, those override in subclass are removed. So as _try_coerce_result. (Benefited from numpy handing data)
  3. values_mask and other_mask are also removed. (numpy can handle NaT)

Tests haven't pass all, those failed mainly is DataFrame columns converting to Series.

@pep8speaks
Copy link

pep8speaks commented Jul 27, 2018

Hello @holymonson! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 12, 2018 at 08:41 Hours UTC

@holymonson holymonson force-pushed the use_datetime64 branch 2 times, most recently from 31a4a02 to 9bbe377 Compare July 27, 2018 06:47
1. `.tz` info save as `DatetimeTZBlock.tz`, no longer need a `DatetimeIndex`. Multi-columns can share a same `DatetimeTZBlock` if they have same `tz`. (NotImplemented)
2. `_try_coerce_args` is implemented in `DatetimeLikeBlockMixin`, those override in subclass are removed. So as `_try_coerce_result`. (Benefited from `numpy` handing data)
3. `values_mask` and `other_mask` are also removed. (`numpy` can handle `NaT`)
# let higher levels handle
raise TypeError

return values, other
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems values can be removed too. The other 2 _try_coerce_args overrides don't touch values, _try_coerce_args might rename to _try_coerce_other.

@jbrockmendel
Copy link
Member

can only handle 1d-array

Can you expand on why this is a compelling reason? Maybe there is some characteristic np.ufunc that I'm not considering?

so DataFrame can't use the same wrapper

The idea is that DataFrame dispatches column-wise, like it already does in DataFrame._combine_frame and DataFrame._compare_frame

Could someone show me the benefits of the Index-dispatch implement?

See #18824. As you've noticed, there are many operations (particularly arithmetic and comparisons) where DataFrame behavior is incorrect, particularly for datetime64 and datetime64tz dtypes. Until a few months ago, the same was true for Series. We fixed the Series problems by having it dispatch to the Index subclasses, since the DatetimeIndex and TimedeltaIndex implementations are much more careful. Now we're down to two implementations instead of three; we still should get it down to one.

@gfyoung gfyoung added Datetime Datetime data dtype Enhancement labels Jul 30, 2018
@@ -68,6 +68,22 @@ def np_array_datetime64_compat(arr, *args, **kwargs):
return np.array(arr, *args, **kwargs)


# np.percentile should support datetime64 in upsteam. numpy/numpy#11620
# TODO: change the version if the bug fixed
if _nlv >= '99.99.99':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a placeholder for "distant future"?

provide compat for np.percentile supporting datetime64
'''
if values.dtype.kind == 'M':
result = np.percentile(values.view('i8'), q, **kw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this handle iNaT correctly?

if _nlv >= '99.99.99':
np_percentile_compat = np.percentile
else:
def np_percentile_compat(values, q, **kw):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the iNat concern aside, this looks like a nice function. But I don't see why it should be in compat. Might make more sense in nanops?

'''
provide compat for np.percentile supporting datetime64
'''
if values.dtype.kind == 'M':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does timedelta64 get handled correctly by numpy?

@@ -789,8 +789,7 @@ def replace(self, to_replace, value, inplace=False, filter=None,
# try to replace, if we raise an error, convert to ObjectBlock and
# retry
try:
values, _, to_replace, _ = self._try_coerce_args(self.values,
to_replace)
values, to_replace = self._try_coerce_args(self.values, to_replace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to tell, but a lot of this looks like it might be unrelated to the np_percentile_compat function. Is that the case? If so, that's great (simplifying try_coerce_args etc would be a big win), but should be done separately.

@jbrockmendel
Copy link
Member

@holymonson can you rebase? Arithmetic ops have seen a lot of cleanup in the last couple months.

@WillAyd
Copy link
Member

WillAyd commented Nov 26, 2018

Closing as stale. Ping if you'd like to pick this back up

@WillAyd WillAyd closed this Nov 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants