Skip to content

BUG: Fix reindexing with multi-indexed DataFrames #30766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Apr 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
da10a2c
* add missing method kwargs in MultiIndex get_indexer() impl
ChrisRobo Jan 6, 2020
dc96151
handle multiple trickier cases with indexing
ChrisRobo Jan 6, 2020
fec078f
add small bugfixes and cleanup
ChrisRobo Jan 6, 2020
3e6ca9d
fix more bugs and do more cleanup
ChrisRobo Jan 6, 2020
fbe7684
update tests to reflect changes since 0.24.2
ChrisRobo Jan 6, 2020
aeccfed
handle tuple-ordering correctly across several levels when carrying
ChrisRobo Jan 6, 2020
1472e64
rm testing print statements and fix rebase error
ChrisRobo Jan 6, 2020
44e30f0
address flake8 style complaints
ChrisRobo Jan 6, 2020
1394ebf
apply output of
ChrisRobo Jan 6, 2020
df68ca9
fix test import names and ordering
ChrisRobo Jan 6, 2020
bba7e51
move whatsnew addition from Indexing to MultiIndex section
ChrisRobo Jan 6, 2020
3f3b799
address linting issues
ChrisRobo Jan 7, 2020
3457192
use intermediary variables in tests and add issue link in comments
ChrisRobo Jan 7, 2020
5f16fc8
add one-line summary for _do_backfill_carrying() method
ChrisRobo Jan 7, 2020
7241973
use comments instead of docstrings with issue link in tests
ChrisRobo Jan 7, 2020
917a140
rm unnecessary blank line
ChrisRobo Jan 7, 2020
3ea4bae
use indexer dtypes in tests for compatibility with windows
ChrisRobo Jan 7, 2020
c52d87a
one more such case
ChrisRobo Jan 7, 2020
d72bdd6
cleanup for same such cases
ChrisRobo Jan 7, 2020
6b2edd8
run black pandas again, update formatting of a few lines
ChrisRobo Jan 7, 2020
9014f96
move whatsnew entry from 1.0.0 -> 1.0.1
ChrisRobo Jan 30, 2020
09adf2c
actually, mv it to 1.1.0 since that appears to be next release
ChrisRobo Jan 30, 2020
f994286
sort imports in indexing tests with isort
ChrisRobo Jan 30, 2020
734b685
revert unrelated change to not-used 1.0.1 release notes
ChrisRobo Jan 30, 2020
f69cd13
add example code in whatsnew
ChrisRobo Feb 19, 2020
e91b2cb
re-write backend for reindexing MultiIndexes with a filling method
ChrisRobo Feb 19, 2020
33015a3
* add docstings, fix comments
ChrisRobo Feb 19, 2020
590c40f
robustify tests and add additional comments
ChrisRobo Feb 19, 2020
701ca63
include before/after examples
ChrisRobo Feb 19, 2020
c3c811e
ensure type assignments correct
ChrisRobo Feb 19, 2020
f5f5531
use preferred method of getting class from object
ChrisRobo Feb 19, 2020
180d901
add newlines to get docs to compile and unindent for clarity
ChrisRobo Feb 20, 2020
edb2e6c
add clarifying comment re: regression
ChrisRobo Feb 20, 2020
a3908ea
add black formatting
ChrisRobo Feb 20, 2020
f5ef69f
clean up after slightly-botched rebase
ChrisRobo Feb 24, 2020
4bc36a2
take suggestions of black pandas
ChrisRobo Feb 24, 2020
fff0223
address comments
ChrisRobo Feb 27, 2020
4f80ced
update docstrings
ChrisRobo Mar 1, 2020
3e825d9
use ipython directive for example with bugfix
ChrisRobo Mar 19, 2020
83b9830
update formatting
ChrisRobo Mar 19, 2020
5349af5
put get_indexer() calls into MultiIndex engine behind single function
ChrisRobo Mar 19, 2020
9753062
add performance benchmark for MultiIndex get_indexer with filling
ChrisRobo Mar 19, 2020
ab36e67
fix MultiIndex reference
ChrisRobo Mar 19, 2020
3231481
take black pandas suggestions
ChrisRobo Mar 19, 2020
5aa7fea
use black formatter on asv profiling tests
ChrisRobo Mar 27, 2020
7e4e975
move whatsnew docs into their own sub-heading in api changes
ChrisRobo Mar 27, 2020
113e3b3
fix syntax error
ChrisRobo Mar 27, 2020
881bc1b
rm duplicate test from failed rebase
ChrisRobo Mar 27, 2020
7836cc9
clean up rebase error
ChrisRobo Apr 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions asv_bench/benchmarks/multiindex_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,38 @@ def setup(self):
],
dtype=object,
)
self.other_mi_many_mismatches = MultiIndex.from_tuples(
[
(-7, 41),
(-2, 3),
(-0.7, 5),
(0, 0),
(0, 1.5),
(0, 340),
(0, 1001),
(1, -4),
(1, 20),
(1, 1040),
(432, -5),
(432, 17),
(439, 165.5),
(998, -4),
(998, 24065),
(999, 865.2),
(999, 1000),
(1045, -843),
]
)

def time_get_indexer(self):
self.mi_int.get_indexer(self.obj_index)

def time_get_indexer_and_backfill(self):
self.mi_int.get_indexer(self.other_mi_many_mismatches, method="backfill")

def time_get_indexer_and_pad(self):
self.mi_int.get_indexer(self.other_mi_many_mismatches, method="pad")

def time_is_monotonic(self):
self.mi_int.is_monotonic

Expand Down
61 changes: 61 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,67 @@ Backwards incompatible API changes
Previously a ``UnsupportedFunctionCall`` was raised (``AssertionError`` if ``min_count`` passed into :meth:`~DataFrameGroupby.median`) (:issue:`31485`)
- :meth:`DataFrame.at` and :meth:`Series.at` will raise a ``TypeError`` instead of a ``ValueError`` if an incompatible key is passed, and ``KeyError`` if a missing key is passed, matching the behavior of ``.loc[]`` (:issue:`31722`)
- Passing an integer dtype other than ``int64`` to ``np.array(period_index, dtype=...)`` will now raise ``TypeError`` instead of incorrectly using ``int64`` (:issue:`32255`)

``MultiIndex.get_indexer`` interprets `method` argument differently
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This restores the behavior of :meth:`MultiIndex.get_indexer` with ``method='backfill'`` or ``method='pad'`` to the behavior before pandas 0.23.0. In particular, MultiIndexes are treated as a list of tuples and padding or backfilling is done with respect to the ordering of these lists of tuples (:issue:`29896`).

As an example of this, given:

.. ipython:: python

df = pd.DataFrame({
'a': [0, 0, 0, 0],
'b': [0, 2, 3, 4],
'c': ['A', 'B', 'C', 'D'],
}).set_index(['a', 'b'])
mi_2 = pd.MultiIndex.from_product([[0], [-1, 0, 1, 3, 4, 5]])

The differences in reindexing ``df`` with ``mi_2`` and using ``method='backfill'`` can be seen here:

*pandas >= 0.23, < 1.1.0*:

.. code-block:: ipython

In [1]: df.reindex(mi_2, method='backfill')
Out[1]:
c
0 -1 A
0 A
1 D
3 A
4 A
5 C

*pandas <0.23, >= 1.1.0*

.. ipython:: python

df.reindex(mi_2, method='backfill')

And the differences in reindexing ``df`` with ``mi_2`` and using ``method='pad'`` can be seen here:

*pandas >= 0.23, < 1.1.0*

.. code-block:: ipython

In [1]: df.reindex(mi_2, method='pad')
Out[1]:
c
0 -1 NaN
0 NaN
1 D
3 NaN
4 A
5 C

*pandas < 0.23, >= 1.1.0*

.. ipython:: python

df.reindex(mi_2, method='pad')

-

.. _whatsnew_110.api_breaking.indexing_raises_key_errors:
Expand Down
120 changes: 104 additions & 16 deletions pandas/_libs/index.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -612,25 +612,113 @@ cdef class BaseMultiIndexCodesEngine:
in zip(self.levels, zip(*target))]
return self._codes_to_ints(np.array(level_codes, dtype='uint64').T)

def get_indexer(self, object target, object method=None,
object limit=None):
def get_indexer_no_fill(self, object target) -> np.ndarray:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this api? can't you rather just call a helper function if you see method / limit?

Copy link
Contributor Author

@ChrisRobo ChrisRobo Feb 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rationale is that in order to do reindexing with filling (not necessary when not filling though), without using "carrying" logic, it's necessary to know the full set of tuples, meaning we have a few options, as far as I can tell:

(1) pass the tuples in from the MultiIndex class. The current implementation uses this; but I separated it for w/ and w/o filling to avoid adding new arguments to get_indexer().

(2) since the MultiIndex's _engine knows only of the levels and the int representation of the codes, we could reconstruct the tuples from this, after doing appropriate place-value logic (i.e. inverting self._codes_to_ints(), which is possible but a bit non-trivial), and then proceed with the algorithm in get_indexer_and_fill()

(3) have the MultiIndex's engine store a reference to the tuples as well, but I thought a change like that might be inappropriate for this PR. Happy to do whatever you prefer here!

Update:
some tinkering suggests (2) can be accomplished with something like

np.array([
    [
        (int_code if i == 0 else (int_code % (np.uint64(1) << self.offsets[i - 1]))) >> offset
        for i, offset in enumerate(offsets)
    ]
    for int_code in self.vgetter()
]) - 1

which, if used, would remove the need to pass in the tuples and thus change the interface of get_indexer() here in some way

Returns an array giving the positions of each value of `target` in
`self.values`, where -1 represents a value in `target` which does not
appear in `self.values`

Parameters
----------
target : list-like of keys
Each key is a tuple, with a label for each level of the index

Returns
-------
np.ndarray[int64_t, ndim=1] of the indexer of `target` into
`self.values`
"""
lab_ints = self._extract_level_codes(target)
return self._base.get_indexer(self, lab_ints)

# All methods (exact, backfill, pad) directly map to the respective
# methods of the underlying (integers) index...
if method is not None:
# but underlying backfill and pad methods require index and keys
# to be sorted. The index already is (checked in
# Index._get_fill_indexer), sort (integer representations of) keys:
order = np.argsort(lab_ints)
lab_ints = lab_ints[order]
indexer = (getattr(self._base, f'get_{method}_indexer')
(self, lab_ints, limit=limit))
indexer = indexer[order]
else:
indexer = self._base.get_indexer(self, lab_ints)
def get_indexer(self, object target, object values = None,
object method = None, object limit = None) -> np.ndarray:
"""
Returns an array giving the positions of each value of `target` in
`values`, where -1 represents a value in `target` which does not
appear in `values`

return indexer
If `method` is "backfill" then the position for a value in `target`
which does not appear in `values` is that of the next greater value
in `values` (if one exists), and -1 if there is no such value.

Similarly, if the method is "pad" then the position for a value in
`target` which does not appear in `values` is that of the next smaller
value in `values` (if one exists), and -1 if there is no such value.

Parameters
----------
target: list-like of tuples
need not be sorted, but all must have the same length, which must be
the same as the length of all tuples in `values`
values : list-like of tuples
must be sorted and all have the same length. Should be the set of
the MultiIndex's values. Needed only if `method` is not None
method: string
"backfill" or "pad"
limit: int, optional
if provided, limit the number of fills to this value

Returns
-------
np.ndarray[int64_t, ndim=1] of the indexer of `target` into `values`,
filled with the `method` (and optionally `limit`) specified
"""
if method is None:
return self.get_indexer_no_fill(target)

assert method in ("backfill", "pad")
cdef:
int64_t i, j, next_code
int64_t num_values, num_target_values
ndarray[int64_t, ndim=1] target_order
ndarray[object, ndim=1] target_values
ndarray[int64_t, ndim=1] new_codes, new_target_codes
ndarray[int64_t, ndim=1] sorted_indexer

target_order = np.argsort(target.values).astype('int64')
target_values = target.values[target_order]
num_values, num_target_values = len(values), len(target_values)
new_codes, new_target_codes = (
np.empty((num_values,)).astype('int64'),
np.empty((num_target_values,)).astype('int64'),
)

# `values` and `target_values` are both sorted, so we walk through them
# and memoize the (ordered) set of indices in the (implicit) merged-and
# sorted list of the two which belong to each of them
# the effect of this is to create a factorization for the (sorted)
# merger of the index values, where `new_codes` and `new_target_codes`
# are the subset of the factors which appear in `values` and `target`,
# respectively
i, j, next_code = 0, 0, 0
while i < num_values and j < num_target_values:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in python-space a while loop is slower than a for loop. no idea if it matters in cython

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, it's just a question of how it compiles. I'd guess that a for-loop with break statements would compile similarly, and that the only difference here would be the location of the jne statements. If you'd like, though, I can either profile it and/or send you a diff of the relevant objdump snippets?

val, target_val = values[i], target_values[j]
if val <= target_val:
new_codes[i] = next_code
i += 1
if target_val <= val:
new_target_codes[j] = next_code
j += 1
next_code += 1

# at this point, at least one should have reached the end
# the remaining values of the other should be added to the end
assert i == num_values or j == num_target_values
while i < num_values:
new_codes[i] = next_code
i += 1
next_code += 1
while j < num_target_values:
new_target_codes[j] = next_code
j += 1
next_code += 1

# get the indexer, and undo the sorting of `target.values`
sorted_indexer = (
algos.backfill if method == "backfill" else algos.pad
)(new_codes, new_target_codes, limit=limit).astype('int64')
return sorted_indexer[np.argsort(target_order)]

def get_loc(self, object key):
if is_definitely_invalid_key(key):
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -2455,7 +2455,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):
raise NotImplementedError(
"tolerance not implemented yet for MultiIndex"
)
indexer = self._engine.get_indexer(target, method, limit)
indexer = self._engine.get_indexer(
values=self.values, target=target, method=method, limit=limit
)
elif method == "nearest":
raise NotImplementedError(
"method='nearest' not implemented yet "
Expand Down
75 changes: 75 additions & 0 deletions pandas/tests/frame/indexing/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1432,6 +1432,81 @@ def test_set_value_resize(self, float_frame):
with pytest.raises(ValueError, match=msg):
res._set_value("foobar", "baz", "sam")

def test_reindex_with_multi_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel we should create a test_multi.py in this dir (@ChrisRobo you can do this if you want here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're moving towards organizing by method instead of by index subclass, so this will eventually live in a test_reindex file (there are lots of reindex tests in this file, so this is fine for the time being. before long ill batch-move them)

# https://github.com/pandas-dev/pandas/issues/29896
# tests for reindexing a multi-indexed DataFrame with a new MultiIndex
#
# confirms that we can reindex a multi-indexed DataFrame with a new
# MultiIndex object correctly when using no filling, backfilling, and
# padding
#
# The DataFrame, `df`, used in this test is:
# c
# a b
# -1 0 A
# 1 B
# 2 C
# 3 D
# 4 E
# 5 F
# 6 G
# 0 0 A
# 1 B
# 2 C
# 3 D
# 4 E
# 5 F
# 6 G
# 1 0 A
# 1 B
# 2 C
# 3 D
# 4 E
# 5 F
# 6 G
#
# and the other MultiIndex, `new_multi_index`, is:
# 0: 0 0.5
# 1: 2.0
# 2: 5.0
# 3: 5.8
df = pd.DataFrame(
{
"a": [-1] * 7 + [0] * 7 + [1] * 7,
"b": list(range(7)) * 3,
"c": ["A", "B", "C", "D", "E", "F", "G"] * 3,
}
).set_index(["a", "b"])
new_index = [0.5, 2.0, 5.0, 5.8]
new_multi_index = MultiIndex.from_product([[0], new_index], names=["a", "b"])

# reindexing w/o a `method` value
reindexed = df.reindex(new_multi_index)
expected = pd.DataFrame(
{"a": [0] * 4, "b": new_index, "c": [np.nan, "C", "F", np.nan]}
).set_index(["a", "b"])
tm.assert_frame_equal(expected, reindexed)

# reindexing with backfilling
expected = pd.DataFrame(
{"a": [0] * 4, "b": new_index, "c": ["B", "C", "F", "G"]}
).set_index(["a", "b"])
reindexed_with_backfilling = df.reindex(new_multi_index, method="bfill")
tm.assert_frame_equal(expected, reindexed_with_backfilling)

reindexed_with_backfilling = df.reindex(new_multi_index, method="backfill")
tm.assert_frame_equal(expected, reindexed_with_backfilling)

# reindexing with padding
expected = pd.DataFrame(
{"a": [0] * 4, "b": new_index, "c": ["A", "C", "F", "F"]}
).set_index(["a", "b"])
reindexed_with_padding = df.reindex(new_multi_index, method="pad")
tm.assert_frame_equal(expected, reindexed_with_padding)

reindexed_with_padding = df.reindex(new_multi_index, method="ffill")
tm.assert_frame_equal(expected, reindexed_with_padding)

def test_set_value_with_index_dtype_change(self):
df_orig = DataFrame(np.random.randn(3, 3), index=range(3), columns=list("ABC"))

Expand Down
Loading