Skip to content

Fix multiindex loc nan #43943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
549386c
fix problem without adding tests
CloseChoice Oct 9, 2021
81970c7
add test, still not all tests running
CloseChoice Oct 9, 2021
8443c62
remove print and debug statements
CloseChoice Oct 10, 2021
9eae773
Fixes from pre-commit [automated commit]
CloseChoice Oct 10, 2021
c497e34
add new test
CloseChoice Oct 10, 2021
fde027b
update test as desired in PR comments
CloseChoice Oct 10, 2021
0ef5d04
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Oct 10, 2021
246a421
add whatsnew entry
CloseChoice Oct 10, 2021
bbeeb1b
fix wrong issue number
CloseChoice Oct 10, 2021
6b38ba4
save nan encoding in new state variable and replace it before returni…
CloseChoice Oct 12, 2021
d04b7b8
remove wrong whatsnew entry
CloseChoice Oct 12, 2021
cc75b44
remove debug statement
CloseChoice Oct 12, 2021
c88b2d8
add whatsnew entry
CloseChoice Oct 13, 2021
0a3ce38
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Oct 13, 2021
ed537cc
Merge branch 'master' into FIX-multiindex-loc-nan
CloseChoice Oct 17, 2021
77e7283
fix wrong code quotes in whatsnew
CloseChoice Oct 17, 2021
72fb026
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
cf0355d
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
427a515
Fixes from pre-commit [automated commit]
CloseChoice Oct 22, 2021
29c37b6
fix problem without adding tests
CloseChoice Oct 9, 2021
d274bca
add test, still not all tests running
CloseChoice Oct 9, 2021
f72cbf7
remove print and debug statements
CloseChoice Oct 10, 2021
7ab0c9c
Fixes from pre-commit [automated commit]
CloseChoice Oct 10, 2021
a758d44
add new test
CloseChoice Oct 10, 2021
c3ec7f0
update test as desired in PR comments
CloseChoice Oct 10, 2021
245b141
add whatsnew entry
CloseChoice Oct 10, 2021
ae706c8
fix wrong issue number
CloseChoice Oct 10, 2021
30097ca
save nan encoding in new state variable and replace it before returni…
CloseChoice Oct 12, 2021
ef951a6
remove wrong whatsnew entry
CloseChoice Oct 12, 2021
0fa1784
remove debug statement
CloseChoice Oct 12, 2021
1f78023
add whatsnew entry
CloseChoice Oct 13, 2021
5b3c61d
fix wrong code quotes in whatsnew
CloseChoice Oct 17, 2021
98c8f13
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
ec218d5
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
d92cc60
Merge branch 'master' into FIX-multiindex-loc-nan
CloseChoice Nov 6, 2021
4dda020
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 6, 2021
6850b73
add tests for issue 36060
CloseChoice Nov 6, 2021
e7a52e3
Fixes from pre-commit [automated commit]
CloseChoice Nov 6, 2021
9682350
changes according to PR discussions
CloseChoice Nov 7, 2021
0ad2583
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 7, 2021
cb92f95
use assert_series_equal in test_groupby_dropna
CloseChoice Nov 7, 2021
e3c734d
make _has_na_placeholder cache_readonly
CloseChoice Nov 10, 2021
cf48e70
make _has_na_placeholder cache_readonly
CloseChoice Nov 10, 2021
49b1d65
Fixes from pre-commit [automated commit]
CloseChoice Nov 10, 2021
d00b39f
remove commented out stuff
CloseChoice Nov 10, 2021
a97ef1a
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 10, 2021
d180901
remove unnecessary import
CloseChoice Nov 10, 2021
cc6af9d
WIP: intermediate commit for loop solution
CloseChoice Nov 16, 2021
02bf699
changes for static analysis checks
CloseChoice Nov 16, 2021
7211984
Merge branch 'master' of github.com:pandas-dev/pandas into FIX-multii…
CloseChoice Dec 8, 2021
4520444
fix imports
CloseChoice Dec 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pandas/_libs/hashtable.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ cdef class ObjectFactorizer(Factorizer):
uniques = ObjectVector()
uniques.extend(self.uniques.to_array())
self.uniques = uniques
print('WE ARE IN FACTORIZE')
labels = self.table.get_labels(values, self.uniques,
self.count, na_sentinel, na_value)
mask = (labels == na_sentinel)
Expand Down
20 changes: 18 additions & 2 deletions pandas/_libs/index.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ cdef class IndexEngine:
# We assume before we get here:
# - val is hashable
self._ensure_mapping_populated()
#print('THIS IS THE MAPPING ' + str(self.mapping))
return val in self.mapping

cpdef get_loc(self, object val):
Expand Down Expand Up @@ -158,6 +159,8 @@ cdef class IndexEngine:
return self._get_loc_duplicates(val)

try:
#print('THIS IS val: ' + str(val))
#print('THIS IS MAPPING: ' + str(self.mapping))
return self.mapping.get_item(val)
except OverflowError as err:
# GH#41775 OverflowError e.g. if we are uint64 and val is -1
Expand Down Expand Up @@ -594,13 +597,16 @@ cdef class BaseMultiIndexCodesEngine:

# Transform labels in a single array, and add 1 so that we are working
# with positive integers (-1 for NaN becomes 0):
#print('\nTHIS ARE LABELS IN INITIALIZATION OF BASEMULTIINDEXCODESENGINE: ' + str(labels) + " THIS IS THE TYPE OF LABELS: " + str(type(labels)))
codes = (np.array(labels, dtype='int64').T + 1).astype('uint64',
copy=False)
#print('\nTHIS ARE THE CODES: ' + str(codes) + " THIS IS THE TYPE OF CODES: " + str(type(codes)) + '\n')

# Map each codes combination in the index to an integer unambiguously
# (no collisions possible), based on the "offsets", which describe the
# number of bits to switch labels for each level:
lab_ints = self._codes_to_ints(codes)
#print('\nTHIS ARE THE LAB INTS: ' + str(lab_ints) + " THIS IS THE TYPE OF LAB INTS: " + str(type(codes)) + '\n')

# Initialize underlying index (e.g. libindex.UInt64Engine) with
# integers representing labels: we will use its get_loc and get_indexer
Expand Down Expand Up @@ -731,13 +737,23 @@ cdef class BaseMultiIndexCodesEngine:
return sorted_indexer[np.argsort(target_order)]

def get_loc(self, object key):
#print('get_loc, we want to get key' + str(key))
if is_definitely_invalid_key(key):
raise TypeError(f"'{key}' is an invalid key")
if not isinstance(key, tuple):
raise KeyError(key)
# print('LEVELS: ' + str(type(self.levels[0])) + 'KEYS: ' + str(key))
try:
indices = [0 if checknull(v) else lev.get_loc(v) + 1
for lev, v in zip(self.levels, key)]
# indices = [0 if checknull(v) else lev.get_loc(v) + 1
# for lev, v in zip(self.levels, key)]
indices = []
for key, lev in zip(key, self.levels):
try:
indices.append(lev.get_loc(key) + 1)
except KeyError:
indices.append(0)
# indices = [lev.get_loc(v) + 1
# for lev, v in zip(self.levels, key)]
except KeyError:
raise KeyError(key)

Expand Down
3 changes: 3 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1835,6 +1835,7 @@ def _get_label_or_level_values(self, key: str, axis: int = 0) -> np.ndarray:
axis = self._get_axis_number(axis)
other_axes = [ax for ax in range(self._AXIS_LEN) if ax != axis]

# import pdb; pdb.set_trace()
if self._is_label_reference(key, axis=axis):
self._check_label_or_level_ambiguity(key, axis=axis)
values = self.xs(key, axis=other_axes[0])._values
Expand Down Expand Up @@ -3846,6 +3847,8 @@ class animal locomotion
self._consolidate_inplace()

if isinstance(index, MultiIndex):
#import pdb; pdb.set_trace()
import traceback as tb;
loc, new_index = index._get_loc_level(key, level=0)
if not drop_level:
if lib.is_integer(loc):
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3460,8 +3460,13 @@ def get_loc(self, key, method=None, tolerance=None):
"tolerance argument only valid if using pad, "
"backfill or nearest lookups"
)
# print(f'key before casting {key}')
casted_key = self._maybe_cast_indexer(key)
try:
#import pdb; pdb.set_trace()
# print(f'casted key {casted_key}')
# import pdb; pdb.set_trace()
# import traceback as tb; tb.print_stack()
return self._engine.get_loc(casted_key)
except KeyError as err:
raise KeyError(key) from err
Expand Down
15 changes: 14 additions & 1 deletion pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,15 @@ def _codes_to_ints(self, codes):
"""
# Shift the representation of each level by the pre-calculated number
# of bits:
#print(f'codes {codes}, self.offsets {self.offsets}')
#import pdb; pdb.set_trace()
codes <<= self.offsets
#if len(codes.shape) == 1 and list(codes) == [4,0]:
# import traceback as tb; tb.print_stack()
#codes = np.array([4, 2])
#import pdb; pdb.set_trace()
#print('hi')
#print(f'codes after shifting {codes}')

# Now sum and OR are in fact interchangeable. This is a simple
# composition of the (disjunct) significant bits of each level (i.e.
Expand Down Expand Up @@ -3001,7 +3009,12 @@ def maybe_mi_droplevels(indexer, levels):
if len(key) == self.nlevels and self.is_unique:
# Complete key in unique index -> standard get_loc
try:
return (self._engine.get_loc(key), None)
engine = self._engine
#import pdb; pdb.set_trace()
# print(f'key before handing it to engine {engine} is key: {key}')
# import pdb; pdb.set_trace()
loc_res = engine.get_loc(key)
return (loc_res, None)
except KeyError as err:
raise KeyError(key) from err
except TypeError:
Expand Down
1 change: 1 addition & 0 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1110,6 +1110,7 @@ def _get_label(self, label, axis: int):

def _handle_lowerdim_multi_index_axis0(self, tup: tuple):
# we have an axis0 multi-index, handle or raise
#import pdb; pdb.set_trace()
axis = self.axis or 0
try:
# fast path for series or for tup devoid of slices
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/indexes/multi/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,7 @@ def test_get_loc_cast_bool(self):
# GH 19086 : int is casted to bool, but not vice-versa
levels = [[False, True], np.arange(2, dtype="int64")]
idx = MultiIndex.from_product(levels)
import pdb; pdb.set_trace()

assert idx.get_loc((0, 1)) == 1
assert idx.get_loc((1, 0)) == 2
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/indexing/multiindex/test_getitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -392,3 +392,17 @@ def test_loc_empty_multiindex():
result = df
expected = DataFrame([1, 2, 3, 4], index=index, columns=["value"])
tm.assert_frame_equal(result, expected)

def test_loc_nan_multiindex():
df = DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are declaring this a regression, we should maybe have a test that previously passed. In what previous version of pandas did this test pass?

{
"temp_playlist": [0, 0, 0, 0],
"objId": ["o1", np.nan, "o1", np.nan],
"x": [1, 2, 3, 4],
}
)

agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].agg(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the test is for MultiIndex.get_loc, then test that directly; shouldn't depend on groupby

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's quite strange but I can't reproduce this without the groupby. When doing something like

    arrays = [[0, 0], ["o1", np.nan]]
    names = ("temp_playlist", "objId")
    index = MultiIndex.from_arrays(arrays, names=names)
    srs = Series([4, 6], index=index)
    result = srs.loc[(0, np.nan)]

I don't get the expected key error on master as I get when using:

    df = DataFrame(
        {
            "temp_playlist": [0, 0, 0, 0],
            "objId": ["o1", np.nan, "o1", np.nan],
            "x": [1, 2, 3, 4],
        }
    )

    agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].sum()
    result = agg_df.loc[agg_df.index[-1]]

when comparing agg_df with srs using tm.assert_series_equal this results in

*** AssertionError: Series.index are different

Attribute "inferred_type" are different
[left]:  string
[right]: mixed

Any idea on this?

result = agg_df.loc[agg_df.index[-1]]
expected = [2, 4]
assert result == expected