PROPOSAL: use datetime64 and timedelta64 to eval #22063

holymonson · 2018-07-26T10:06:33Z

This PR aims to use datetime64 and timedelta64 in evaluation directly, instead of casting to integer and then casting back.

Benefits:

values is saved in ndarray[datetime64], casting to int to eval is non-intuitive. It seems back to the time before numpy 1.7 which implements datetime64.
numpy.ufunc can handle all the dtype change, e.g. datetime64 - datetime64 = timedelta64. Additonally, nonsense operations are prohibited, e.g. datetime64 + datetime64.
numpy implements datetime64('NaT'), seems values_mask and other_mask are no longer needed for nans.
Series can share the same call path with DataFrame in Block. Now Series is using dispatch_to_index_op to wrap as Index, but Index is 1d-array while DataFrame is 2d-array, so DataFrame can't use the same wrapper. ndarray([Index]) is dtype='O' so can't be benefited from numpy.ufunc.

This commit has passed tests/{internals,frame,dtypes} , but not tests/series, because dispatch_to_index_op breaks it.

jbrockmendel · 2018-07-26T18:32:37Z

Series can share the same call path with DataFrame in Block

We actually want to go in the other direction. We want to make DataFrame dispatch to the Series (or DatetimeIndex/TimedeltaIndex) implementations. Is there a viable way to do that here? Maybe dispatch to self._holder?

jbrockmendel · 2018-07-26T18:33:24Z

pandas/core/common.py

+
+
+# TODO: np.percentile not support datetime64, should be fixed in upsteam.
+def percentile(values, q, **kw):


Probably belongs in core.nanops, definitely not core.common.

jbrockmendel · 2018-07-26T18:34:00Z

pandas/core/common.py

+# TODO: np.percentile not support datetime64, should be fixed in upsteam.
+def percentile(values, q, **kw):
+    if values.dtype.kind == 'M':
+        return np.percentile(values.view('i8'), q, **kw).astype(values.dtype)


Will this handle iNaT correctly?

Can handle numpy.datetime64('NaT'), but not iNaT.

jbrockmendel · 2018-07-26T18:34:44Z

pandas/core/internals/blocks.py

@@ -2134,7 +2134,7 @@ def _na_value(self):

    @property
    def fill_value(self):
-        return tslibs.iNaT
+        return tslibs.NaT.asm8


iNaT is the standard usage for this.

iNaT is an int, but here wants a numpy.datetime64('NaT').

jreback

what problem are you trying to solve here?

jbrockmendel · 2018-07-26T18:36:28Z

pandas/tests/internals/test_internals.py

-                (operator.pow, 'bool')}
+                (operator.pow, 'bool'),
+                (operator.pow, '<m8[ns]'),
+                (operator.pow, '<M8[ns]')}


These are good fixes. Can you take it a step further and, instead of skipping, pytest.raises(TypeError) for the appropriate cases?

holymonson · 2018-07-27T06:38:49Z

what problem are you trying to solve here?

@jreback #22005 will no longer exist. But the main purpose is simplify the code, and make DataFrame and Series operations share same call path in Block.

We actually want to go in the other direction. We want to make DataFrame dispatch to the Series (or DatetimeIndex/TimedeltaIndex) implements. Is there a viable way to do that here? Maybe dispatch to self._holder?

@jbrockmendel
As pointed out in Benefits (4), Series (or DatetimeIndex/TimedeltaIndex) can only handle 1d-array. If have to make DataFrame dispatch into 1d-array, there is 2 possible implementations:

Reduce 2d-ndarray into 1d-ndarray, eval with numpy.ufunc, then raise back. This will need to due with shape changing and possible data copying. And it will break ops between 1d-ndarray and 2d-ndarray (need a different call path).
Make it into ndarray[DatetimeIndex](1d-ndarray of DatetimeIndex). But it will be treat as dtype='O' in numpy, the call path will be
1. make 2d-ndarray[dtype='m8'] into 1d-ndarray[DatetimeIndex]
2. ufunc with 1d-ndarray[dtype='O']
3. eval with DatetimeIndex
4. ufunc with 1d-ndarray[dtype='m8']
This PR will simply be
1. ufunc with 2d-ndarray[dtype='m8']

As numpy already handle data operating well, why do we need an extra implement with less efficiency? In fact, dispatching to DatetimeIndex is ending up with ufunc with 1d-ndarray too, what DatetimeIndex do can be done (already implemented, seems only exception is tz-detection) in Block and ufunc.

Could someone show me the benefits of the Index-dispatch implement?

I have write a 2nd commit,which implements:

.tz info save as DatetimeTZBlock.tz, no longer need a DatetimeIndex. Multi-columns can share a same DatetimeTZBlock if they have same tz. (NotImplemented)
_try_coerce_args is implemented in DatetimeLikeBlockMixin, those override in subclass are removed. So as _try_coerce_result. (Benefited from numpy handing data)
values_mask and other_mask are also removed. (numpy can handle NaT)

Tests haven't pass all, those failed mainly is DataFrame columns converting to Series.

pep8speaks · 2018-07-27T06:39:21Z

Hello @holymonson! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 12, 2018 at 08:41 Hours UTC

1. `.tz` info save as `DatetimeTZBlock.tz`, no longer need a `DatetimeIndex`. Multi-columns can share a same `DatetimeTZBlock` if they have same `tz`. (NotImplemented) 2. `_try_coerce_args` is implemented in `DatetimeLikeBlockMixin`, those override in subclass are removed. So as `_try_coerce_result`. (Benefited from `numpy` handing data) 3. `values_mask` and `other_mask` are also removed. (`numpy` can handle `NaT`)

holymonson · 2018-07-27T12:21:50Z

pandas/core/internals/blocks.py

+            # let higher levels handle
+            raise TypeError
+
+        return values, other


Seems values can be removed too. The other 2 _try_coerce_args overrides don't touch values, _try_coerce_args might rename to _try_coerce_other.

jbrockmendel · 2018-07-27T15:45:00Z

can only handle 1d-array

Can you expand on why this is a compelling reason? Maybe there is some characteristic np.ufunc that I'm not considering?

so DataFrame can't use the same wrapper

The idea is that DataFrame dispatches column-wise, like it already does in DataFrame._combine_frame and DataFrame._compare_frame

Could someone show me the benefits of the Index-dispatch implement?

See #18824. As you've noticed, there are many operations (particularly arithmetic and comparisons) where DataFrame behavior is incorrect, particularly for datetime64 and datetime64tz dtypes. Until a few months ago, the same was true for Series. We fixed the Series problems by having it dispatch to the Index subclasses, since the DatetimeIndex and TimedeltaIndex implementations are much more careful. Now we're down to two implementations instead of three; we still should get it down to one.

jbrockmendel · 2018-09-29T18:49:44Z

pandas/compat/numpy/__init__.py

@@ -68,6 +68,22 @@ def np_array_datetime64_compat(arr, *args, **kwargs):
    return np.array(arr, *args, **kwargs)


+# np.percentile should support datetime64 in upsteam. numpy/numpy#11620
+# TODO: change the version if the bug fixed
+if _nlv >= '99.99.99':


Is this just a placeholder for "distant future"?

jbrockmendel · 2018-09-29T18:50:14Z

pandas/compat/numpy/__init__.py

+        provide compat for np.percentile supporting datetime64
+        '''
+        if values.dtype.kind == 'M':
+            result = np.percentile(values.view('i8'), q, **kw)


Will this handle iNaT correctly?

jbrockmendel · 2018-09-29T18:51:06Z

pandas/compat/numpy/__init__.py

+if _nlv >= '99.99.99':
+    np_percentile_compat = np.percentile
+else:
+    def np_percentile_compat(values, q, **kw):


the iNat concern aside, this looks like a nice function. But I don't see why it should be in compat. Might make more sense in nanops?

jbrockmendel · 2018-09-29T18:51:23Z

pandas/compat/numpy/__init__.py

+        '''
+        provide compat for np.percentile supporting datetime64
+        '''
+        if values.dtype.kind == 'M':


Does timedelta64 get handled correctly by numpy?

jbrockmendel · 2018-09-29T18:53:42Z

pandas/core/internals/blocks.py

@@ -789,8 +789,7 @@ def replace(self, to_replace, value, inplace=False, filter=None,
        # try to replace, if we raise an error, convert to ObjectBlock and
        # retry
        try:
-            values, _, to_replace, _ = self._try_coerce_args(self.values,
-                                                             to_replace)
+            values, to_replace = self._try_coerce_args(self.values, to_replace)


It's hard to tell, but a lot of this looks like it might be unrelated to the np_percentile_compat function. Is that the case? If so, that's great (simplifying try_coerce_args etc would be a big win), but should be done separately.

jbrockmendel · 2018-11-03T01:19:16Z

@holymonson can you rebase? Arithmetic ops have seen a lot of cleanup in the last couple months.

WillAyd · 2018-11-26T03:48:35Z

Closing as stale. Ping if you'd like to pick this back up

use datetime64 and timedelta64 to eval

6c33b65

holymonson force-pushed the use_datetime64 branch from 60140a5 to 6c33b65 Compare July 26, 2018 10:44

jbrockmendel reviewed Jul 26, 2018

View reviewed changes

jreback requested changes Jul 26, 2018

View reviewed changes

jbrockmendel reviewed Jul 26, 2018

View reviewed changes

holymonson force-pushed the use_datetime64 branch 2 times, most recently from 31a4a02 to 9bbe377 Compare July 27, 2018 06:47

holymonson force-pushed the use_datetime64 branch from 9bbe377 to 7ff27d2 Compare July 27, 2018 06:53

compat: add np_percentile_compat for datetime64

98bd79a

holymonson commented Jul 27, 2018

View reviewed changes

add pytest.raises to test_binop_other

0c90051

gfyoung added Datetime Datetime data dtype Enhancement labels Jul 30, 2018

jbrockmendel reviewed Sep 29, 2018

View reviewed changes

WillAyd closed this Nov 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PROPOSAL: use datetime64 and timedelta64 to eval #22063

PROPOSAL: use datetime64 and timedelta64 to eval #22063

holymonson commented Jul 26, 2018

jbrockmendel commented Jul 26, 2018

jbrockmendel Jul 26, 2018

jbrockmendel Jul 26, 2018

holymonson Jul 27, 2018

jbrockmendel Jul 26, 2018

holymonson Jul 27, 2018

jreback left a comment

jbrockmendel Jul 26, 2018

holymonson commented Jul 27, 2018

pep8speaks commented Jul 27, 2018 •

edited

Loading

holymonson Jul 27, 2018

jbrockmendel commented Jul 27, 2018

jbrockmendel Sep 29, 2018

jbrockmendel Sep 29, 2018

jbrockmendel Sep 29, 2018

jbrockmendel Sep 29, 2018

jbrockmendel Sep 29, 2018

jbrockmendel commented Nov 3, 2018

WillAyd commented Nov 26, 2018



		# TODO: np.percentile not support datetime64, should be fixed in upsteam.
		def percentile(values, q, **kw):

PROPOSAL: use datetime64 and timedelta64 to eval #22063

PROPOSAL: use datetime64 and timedelta64 to eval #22063

Conversation

holymonson commented Jul 26, 2018

jbrockmendel commented Jul 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holymonson commented Jul 27, 2018

pep8speaks commented Jul 27, 2018 • edited Loading

Comment last updated on August 12, 2018 at 08:41 Hours UTC

Choose a reason for hiding this comment

jbrockmendel commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Nov 3, 2018

WillAyd commented Nov 26, 2018

pep8speaks commented Jul 27, 2018 •

edited

Loading