Skip to content

Make Series[datetime64] - pd.NaT behave like DatetimeIndex - pd.NaT #18960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

jbrockmendel
Copy link
Member

@@ -407,8 +407,12 @@ def _validate_datetime(self, lvalues, rvalues, name):

# if tz's must be equal (same or None)
if getattr(lvalues, 'tz', None) != getattr(rvalues, 'tz', None):
raise ValueError("Incompatible tz's on datetime subtraction "
"ops")
if len(rvalues) == 1 and np.isnat(rvalues[0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never use np.isnat, always use isna.

why is this restricted to len(..) == 1? if they are all nat should be enough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this restricted to len(..) == 1?

Because the bug being addressed is specific to scalar NaT subtraction. Non-scalar cases are handled elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never use np.isnat, always use isna.

I don't have a strong opinion. np.isnat is more strict, which I figure is preferable since all else equal I'd like to avoid catching extra cases by accident.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I do. don't use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got it buddy. Just changed and pushed.

@@ -505,11 +509,20 @@ def _convert_to_array(self, values, name=None, other=None):
inferred_type = lib.infer_dtype(values)
if (inferred_type in ('datetime64', 'datetime', 'date', 'time') or
is_datetimetz(inferred_type)):

if ovalues is pd.NaT and name == '__sub__':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong this this code that is existing? (obviously the dtype is incorrect if other is dt64 but otherwise this is fine)

            # if we have a other of timedelta, but use pd.NaT here we
            # we are in the wrong path
            if (supplied_dtype is None and other is not None and
                (other.dtype in ('timedelta64[ns]', 'datetime64[ns]')) and
                    isna(values).all()):
                values = np.empty(values.shape, dtype='timedelta64[ns]')
                values[:] = iNaT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For starters the other.dtype condition excludes datetime64s with timezones. More generally, this PR is specifically about scalar subtraction of pd.NaT and I would rather give that its own block than try to shoe-horn it into an existing block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify this like the above code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. But in my head I'm going to put sarcasm quotes around the word "simplify".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue. you are repeating code. you only need to select the dtype if other.dtype == 'timedelta64[ns]'
but the construction is regardless

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're suggesting would catch np.array([np.datetime64('NaT')]) which is unambiguously datetimelike, or consider np.array([pd.NaT, datetime64('NaT')]). _TimeOp is a mess in part because it is not clear about what cases it is trying to handle where. This PR is trying to fix a bug tied to one very specific case. The fix should not catch any cases other than that specific intended case.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Dec 27, 2017
@codecov
Copy link

codecov bot commented Dec 27, 2017

Codecov Report

Merging #18960 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18960      +/-   ##
==========================================
+ Coverage   91.55%   91.57%   +0.02%     
==========================================
  Files         150      150              
  Lines       48939    48945       +6     
==========================================
+ Hits        44805    44821      +16     
+ Misses       4134     4124      -10
Flag Coverage Δ
#multiple 89.93% <100%> (+0.02%) ⬆️
#single 41.75% <25%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/ops.py 90.33% <100%> (+0.09%) ⬆️
pandas/util/testing.py 84.74% <0%> (-0.22%) ⬇️
pandas/plotting/_converter.py 66.95% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e3c797...9ac970d. Read the comment docs.

@jbrockmendel
Copy link
Member Author

Alright, the CI is turnaround is going to keep me off your back for a while. In the interim, this and #18959 are ready to within the margin of error. If the stars align #17746 may be there too.

@jbrockmendel
Copy link
Member Author

Alright! We're down to 3 inconsistencies between Series[datetime64] and DatetimeIndex arithmetic. This, one where DatetimeIndex behavior is wrong, and one where a ruling is needed (but I strongly suspect Series behavior is wrong) #18850.

@jbrockmendel
Copy link
Member Author

Appveyor error is complaining that the assert_produces_warning(PerformanceWarning) are instead giving off DeprecationWarning.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

@jbrockmendel yes I see that. trying to figure out why that is happening. we are not catching a deprecation warning on windows somewhere. not sure if you have a windows box?

@@ -407,8 +407,12 @@ def _validate_datetime(self, lvalues, rvalues, name):

# if tz's must be equal (same or None)
if getattr(lvalues, 'tz', None) != getattr(rvalues, 'tz', None):
raise ValueError("Incompatible tz's on datetime subtraction "
"ops")
if len(rvalues) == 1 and isna(rvalues[0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isna(rvalues).all()

@@ -505,11 +509,20 @@ def _convert_to_array(self, values, name=None, other=None):
inferred_type = lib.infer_dtype(values)
if (inferred_type in ('datetime64', 'datetime', 'date', 'time') or
is_datetimetz(inferred_type)):

if ovalues is pd.NaT and name == '__sub__':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify this like the above code

# GH#18808
ser = pd.Series([pd.NaT, pd.Timedelta('1s')])
res = ser - pd.NaT
expected = pd.Series([pd.NaT, pd.NaT], dtype='timedelta64[ns]')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convention is not to use pd.

@@ -407,8 +407,12 @@ def _validate_datetime(self, lvalues, rvalues, name):

# if tz's must be equal (same or None)
if getattr(lvalues, 'tz', None) != getattr(rvalues, 'tz', None):
raise ValueError("Incompatible tz's on datetime subtraction "
"ops")
if len(rvalues) == 1 and isna(rvalues).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this matter if it is exactly len == 1, seems odd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue at hand is that pd.NaT is special because it can play the role of a datetime or a timedelta depending on context. _TimeOp wraps scalar inputs in an array that becomes rvalues. But at a basic level here what we are interested in is "was the original right argument pd.NaT?" What this is doing (and the same you're asking me to do below) is to undo wrapping done by _TimeOp. It is a much less clear way of checking "is the argument pd.NaT?"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you didn't answer the question, whether I have 1 NaT or all NaT is immaterial, I cann't use the rhs dtype as its ambiguous and must use the left.

isna(values).all()):
values = np.empty(values.shape, dtype='timedelta64[ns]')
if len(values) == 1 and other.dtype == 'timedelta64[ns]':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not simpler at all.

values = np.empty(values.shape, dtye=other.dtype)
values[:] = iNaT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, hence the sarcasm quotes and the simpler original implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, even then this is wrong because it isn't conditioning on the op being sub. Gonna revert back to how it was before this commit.

@@ -407,8 +407,12 @@ def _validate_datetime(self, lvalues, rvalues, name):

# if tz's must be equal (same or None)
if getattr(lvalues, 'tz', None) != getattr(rvalues, 'tz', None):
raise ValueError("Incompatible tz's on datetime subtraction "
"ops")
if len(rvalues) == 1 and isna(rvalues).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you didn't answer the question, whether I have 1 NaT or all NaT is immaterial, I cann't use the rhs dtype as its ambiguous and must use the left.

@@ -505,11 +509,20 @@ def _convert_to_array(self, values, name=None, other=None):
inferred_type = lib.infer_dtype(values)
if (inferred_type in ('datetime64', 'datetime', 'date', 'time') or
is_datetimetz(inferred_type)):

if ovalues is pd.NaT and name == '__sub__':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue. you are repeating code. you only need to select the dtype if other.dtype == 'timedelta64[ns]'
but the construction is regardless

@jbrockmendel
Copy link
Member Author

I'll get started on the new crop of comments. In the meantime, #18831 may be ready.

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

closing in favor of #19024

@jreback jreback closed this Jan 1, 2018
@jreback jreback added this to the No action milestone Jan 4, 2018
@jbrockmendel jbrockmendel deleted the subnat branch January 23, 2018 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series.__sub__(NaT) vs DatetimeIndex.__sub__(NaT)
2 participants