-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix reindexing with multi-indexed DataFrames #30766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
da10a2c
dc96151
fec078f
3e6ca9d
fbe7684
aeccfed
1472e64
44e30f0
1394ebf
df68ca9
bba7e51
3f3b799
3457192
5f16fc8
7241973
917a140
3ea4bae
c52d87a
d72bdd6
6b2edd8
9014f96
09adf2c
f994286
734b685
f69cd13
e91b2cb
33015a3
590c40f
701ca63
c3c811e
f5f5531
180d901
edb2e6c
a3908ea
f5ef69f
4bc36a2
fff0223
4f80ced
3e825d9
83b9830
5349af5
9753062
ab36e67
3231481
5aa7fea
7e4e975
113e3b3
881bc1b
7836cc9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -612,25 +612,113 @@ cdef class BaseMultiIndexCodesEngine: | |
in zip(self.levels, zip(*target))] | ||
return self._codes_to_ints(np.array(level_codes, dtype='uint64').T) | ||
|
||
def get_indexer(self, object target, object method=None, | ||
object limit=None): | ||
def get_indexer_no_fill(self, object target) -> np.ndarray: | ||
""" | ||
Returns an array giving the positions of each value of `target` in | ||
`self.values`, where -1 represents a value in `target` which does not | ||
appear in `self.values` | ||
|
||
Parameters | ||
---------- | ||
target : list-like of keys | ||
Each key is a tuple, with a label for each level of the index | ||
|
||
Returns | ||
------- | ||
np.ndarray[int64_t, ndim=1] of the indexer of `target` into | ||
`self.values` | ||
""" | ||
lab_ints = self._extract_level_codes(target) | ||
return self._base.get_indexer(self, lab_ints) | ||
|
||
# All methods (exact, backfill, pad) directly map to the respective | ||
# methods of the underlying (integers) index... | ||
if method is not None: | ||
# but underlying backfill and pad methods require index and keys | ||
# to be sorted. The index already is (checked in | ||
# Index._get_fill_indexer), sort (integer representations of) keys: | ||
order = np.argsort(lab_ints) | ||
lab_ints = lab_ints[order] | ||
indexer = (getattr(self._base, f'get_{method}_indexer') | ||
(self, lab_ints, limit=limit)) | ||
indexer = indexer[order] | ||
else: | ||
indexer = self._base.get_indexer(self, lab_ints) | ||
def get_indexer(self, object target, object values = None, | ||
object method = None, object limit = None) -> np.ndarray: | ||
""" | ||
Returns an array giving the positions of each value of `target` in | ||
`values`, where -1 represents a value in `target` which does not | ||
appear in `values` | ||
|
||
return indexer | ||
If `method` is "backfill" then the position for a value in `target` | ||
which does not appear in `values` is that of the next greater value | ||
in `values` (if one exists), and -1 if there is no such value. | ||
|
||
Similarly, if the method is "pad" then the position for a value in | ||
`target` which does not appear in `values` is that of the next smaller | ||
value in `values` (if one exists), and -1 if there is no such value. | ||
|
||
Parameters | ||
---------- | ||
target: list-like of tuples | ||
need not be sorted, but all must have the same length, which must be | ||
the same as the length of all tuples in `values` | ||
values : list-like of tuples | ||
must be sorted and all have the same length. Should be the set of | ||
the MultiIndex's values. Needed only if `method` is not None | ||
method: string | ||
"backfill" or "pad" | ||
limit: int, optional | ||
if provided, limit the number of fills to this value | ||
|
||
Returns | ||
------- | ||
np.ndarray[int64_t, ndim=1] of the indexer of `target` into `values`, | ||
filled with the `method` (and optionally `limit`) specified | ||
""" | ||
if method is None: | ||
return self.get_indexer_no_fill(target) | ||
|
||
assert method in ("backfill", "pad") | ||
cdef: | ||
int64_t i, j, next_code | ||
int64_t num_values, num_target_values | ||
ndarray[int64_t, ndim=1] target_order | ||
ndarray[object, ndim=1] target_values | ||
ndarray[int64_t, ndim=1] new_codes, new_target_codes | ||
ndarray[int64_t, ndim=1] sorted_indexer | ||
|
||
target_order = np.argsort(target.values).astype('int64') | ||
target_values = target.values[target_order] | ||
num_values, num_target_values = len(values), len(target_values) | ||
new_codes, new_target_codes = ( | ||
np.empty((num_values,)).astype('int64'), | ||
np.empty((num_target_values,)).astype('int64'), | ||
) | ||
|
||
# `values` and `target_values` are both sorted, so we walk through them | ||
# and memoize the (ordered) set of indices in the (implicit) merged-and | ||
# sorted list of the two which belong to each of them | ||
# the effect of this is to create a factorization for the (sorted) | ||
# merger of the index values, where `new_codes` and `new_target_codes` | ||
# are the subset of the factors which appear in `values` and `target`, | ||
# respectively | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
i, j, next_code = 0, 0, 0 | ||
while i < num_values and j < num_target_values: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in python-space a while loop is slower than a for loop. no idea if it matters in cython There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I can tell, it's just a question of how it compiles. I'd guess that a for-loop with |
||
val, target_val = values[i], target_values[j] | ||
if val <= target_val: | ||
new_codes[i] = next_code | ||
i += 1 | ||
if target_val <= val: | ||
new_target_codes[j] = next_code | ||
j += 1 | ||
next_code += 1 | ||
|
||
# at this point, at least one should have reached the end | ||
ChrisRobo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# the remaining values of the other should be added to the end | ||
assert i == num_values or j == num_target_values | ||
while i < num_values: | ||
new_codes[i] = next_code | ||
i += 1 | ||
next_code += 1 | ||
while j < num_target_values: | ||
new_target_codes[j] = next_code | ||
j += 1 | ||
next_code += 1 | ||
|
||
# get the indexer, and undo the sorting of `target.values` | ||
sorted_indexer = ( | ||
algos.backfill if method == "backfill" else algos.pad | ||
)(new_codes, new_target_codes, limit=limit).astype('int64') | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return sorted_indexer[np.argsort(target_order)] | ||
|
||
def get_loc(self, object key): | ||
if is_definitely_invalid_key(key): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1432,6 +1432,81 @@ def test_set_value_resize(self, float_frame): | |
with pytest.raises(ValueError, match=msg): | ||
res._set_value("foobar", "baz", "sam") | ||
|
||
def test_reindex_with_multi_index(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel we should create a test_multi.py in this dir (@ChrisRobo you can do this if you want here) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we're moving towards organizing by method instead of by index subclass, so this will eventually live in a test_reindex file (there are lots of reindex tests in this file, so this is fine for the time being. before long ill batch-move them) |
||
# https://github.com/pandas-dev/pandas/issues/29896 | ||
# tests for reindexing a multi-indexed DataFrame with a new MultiIndex | ||
# | ||
# confirms that we can reindex a multi-indexed DataFrame with a new | ||
# MultiIndex object correctly when using no filling, backfilling, and | ||
# padding | ||
# | ||
# The DataFrame, `df`, used in this test is: | ||
# c | ||
# a b | ||
# -1 0 A | ||
# 1 B | ||
# 2 C | ||
# 3 D | ||
# 4 E | ||
# 5 F | ||
# 6 G | ||
# 0 0 A | ||
# 1 B | ||
# 2 C | ||
# 3 D | ||
# 4 E | ||
# 5 F | ||
# 6 G | ||
# 1 0 A | ||
# 1 B | ||
# 2 C | ||
# 3 D | ||
# 4 E | ||
# 5 F | ||
# 6 G | ||
# | ||
# and the other MultiIndex, `new_multi_index`, is: | ||
# 0: 0 0.5 | ||
# 1: 2.0 | ||
# 2: 5.0 | ||
# 3: 5.8 | ||
df = pd.DataFrame( | ||
{ | ||
"a": [-1] * 7 + [0] * 7 + [1] * 7, | ||
"b": list(range(7)) * 3, | ||
"c": ["A", "B", "C", "D", "E", "F", "G"] * 3, | ||
} | ||
).set_index(["a", "b"]) | ||
new_index = [0.5, 2.0, 5.0, 5.8] | ||
new_multi_index = MultiIndex.from_product([[0], new_index], names=["a", "b"]) | ||
|
||
# reindexing w/o a `method` value | ||
reindexed = df.reindex(new_multi_index) | ||
expected = pd.DataFrame( | ||
{"a": [0] * 4, "b": new_index, "c": [np.nan, "C", "F", np.nan]} | ||
).set_index(["a", "b"]) | ||
tm.assert_frame_equal(expected, reindexed) | ||
|
||
# reindexing with backfilling | ||
expected = pd.DataFrame( | ||
{"a": [0] * 4, "b": new_index, "c": ["B", "C", "F", "G"]} | ||
).set_index(["a", "b"]) | ||
reindexed_with_backfilling = df.reindex(new_multi_index, method="bfill") | ||
tm.assert_frame_equal(expected, reindexed_with_backfilling) | ||
|
||
reindexed_with_backfilling = df.reindex(new_multi_index, method="backfill") | ||
tm.assert_frame_equal(expected, reindexed_with_backfilling) | ||
|
||
# reindexing with padding | ||
expected = pd.DataFrame( | ||
{"a": [0] * 4, "b": new_index, "c": ["A", "C", "F", "F"]} | ||
).set_index(["a", "b"]) | ||
reindexed_with_padding = df.reindex(new_multi_index, method="pad") | ||
tm.assert_frame_equal(expected, reindexed_with_padding) | ||
|
||
reindexed_with_padding = df.reindex(new_multi_index, method="ffill") | ||
tm.assert_frame_equal(expected, reindexed_with_padding) | ||
|
||
def test_set_value_with_index_dtype_change(self): | ||
df_orig = DataFrame(np.random.randn(3, 3), index=range(3), columns=list("ABC")) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing this api? can't you rather just call a helper function if you see method / limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rationale is that in order to do reindexing with filling (not necessary when not filling though), without using "carrying" logic, it's necessary to know the full set of tuples, meaning we have a few options, as far as I can tell:
(1) pass the tuples in from the
MultiIndex
class. The current implementation uses this; but I separated it for w/ and w/o filling to avoid adding new arguments toget_indexer()
.(2) since the MultiIndex's
_engine
knows only of the levels and the int representation of the codes, we could reconstruct the tuples from this, after doing appropriate place-value logic (i.e. invertingself._codes_to_ints()
, which is possible but a bit non-trivial), and then proceed with the algorithm inget_indexer_and_fill()
(3) have the MultiIndex's engine store a reference to the tuples as well, but I thought a change like that might be inappropriate for this PR. Happy to do whatever you prefer here!
Update:
some tinkering suggests (2) can be accomplished with something like
which, if used, would remove the need to pass in the tuples and thus change the interface of
get_indexer()
here in some way