Skip to content

REF: implement _should_compare #38105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 2, 2020

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Nov 27, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Index._is_comparable_dtype is limited because it misses a few cases: object dtype, and sub-dtypes within categorical. This implements Indes._should_compare, which handles those correctly. It then implements _get_indexer_non_comparable for cases in which we short-circuit non-comparable dtypes.

As a proof of concept, this then uses _should_compare and _get_indexer_non_comparable for PeriodIndex.get_indexer.

The behavior change, which this tests, is one that IIUC is a bug. That is, when we do get_indexer with a method and non-comparable dtypes, we should raise instead of return all minus-ones.

If implemented, we'll be able to use _should_compare to simplify all of the get_indexer, get_indexer_non_unique, and set op methods.

return no_matches
else:
# This is for get_indexer_non_unique
return no_matches, no_matches
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback theres some ambiguity as to what we're supposed to be returning here. ATM in Index.get_indexer_non_unique we are returning two ndarrays of minus-ones, but in IntervalIndex the second one is an np.arange(len(target))

The other inconsistency is that ATM PeriodIndex.get_indexer defines no_matches = -1 * np.ones(self.shape, dtype=np.intp) whereas i expected it to use target.shape

Can you clear this up for me?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this more, i increasingly think the second one should be np.arange(target.shape, dtype=np.intp)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this method should raise.

this is a pretty big change from status quo

@jbrockmendel
Copy link
Member Author

i don't think this method should raise.

this is a pretty big change from status quo

I stumbled onto this in part because of this test:

def test_reindex_datetimeindexes_tz_naive_and_aware():
    # GH 8306
    idx = date_range("20131101", tz="America/Chicago", periods=7)
    newidx = date_range("20131103", periods=10, freq="H")
    s = Series(range(7), index=idx)
    msg = "Cannot compare tz-naive and tz-aware timestamps"
    with pytest.raises(TypeError, match=msg):
        s.reindex(newidx, method="ffill")

and i figured that the way we treat periods with mismatched freq is almost identical to how we treat tzawareness-compat. Or is dt64tz special in this case?

I'd be OK with not-raising in _get_indexer_non_comparable, just want to get this behavior consistent, bc ATM we're all over the place.

continue
# For object dtype we are liable to get a different exception message
with pytest.raises(TypeError):
pi.get_indexer(other2, method=method)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback notice in these cases we are currently raising in master bc the scalar comparisons raise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this new?

get_indexer is not super public but i believe it will never raise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that get_indexer with method=None should never raise (maybe with tzawareness corner cases), but with method="ffill" the following raises on master:

dti = pd.date_range("2016-01-01", periods=3)
rng = pd.Index(range(5))

>>> dti.get_indexer(rng, method="ffill")
TypeError: '<' not supported between instances of 'int' and 'Timestamp'

@jbrockmendel
Copy link
Member Author

xref #36320

@jbrockmendel
Copy link
Member Author

So I'm increasingly convinced that the current PeriodIndex.get_indexer behavior with method != None is wrong:

dti = pd.date_range("2016-01-01", periods=3)
pi = dti.to_period("D")
pi2 = dti.to_period("W")

ser = pd.Series(range(3), pi)

>>> ser.reindex(pi2, method="ffill")
2015-12-28/2016-01-03   NaN
2015-12-28/2016-01-03   NaN
2015-12-28/2016-01-03   NaN
Freq: W-SUN, dtype: float64

But pi vs pi2 are uncomparable just like pi vs dti, so we should raise the same way:

>>> ser.reindex(dti, method="ffill")
[...]
  File "pandas/_libs/algos.pyx", line 450, in pandas._libs.algos.pad
    if nleft == 0 or nright == 0 or new[nright - 1] < old[0]:
TypeError: '<' not supported between instances of 'Timestamp' and 'Period'

pd.IntervalIndex.from_breaks(dti4),
]
)
def non_comparable(request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non_comparable_idx

@@ -4973,6 +4993,22 @@ def _maybe_promote(self, other: "Index"):

return self, other

def _get_other_deep(self, other: "Index") -> "Index":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just be a function, no? its also a funny name. I think we have other similar things (e.g. this look like what .to_dense() does but for dtypes and not values).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to _unpack_nested_dtypes function

@jreback jreback added the Index Related to the Index class or subclasses label Nov 29, 2020
@jreback jreback merged commit 044df8c into pandas-dev:master Dec 2, 2020
@jbrockmendel
Copy link
Member Author

alright! get ready to see a bunch of de-duplication and fastpaths coming in.

@jbrockmendel jbrockmendel deleted the ref-maybe_promote-2 branch December 2, 2020 01:57
@jreback
Copy link
Contributor

jreback commented Dec 2, 2020

hit me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants