Skip to content

Fix multiindex loc nan #43943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
549386c
fix problem without adding tests
CloseChoice Oct 9, 2021
81970c7
add test, still not all tests running
CloseChoice Oct 9, 2021
8443c62
remove print and debug statements
CloseChoice Oct 10, 2021
9eae773
Fixes from pre-commit [automated commit]
CloseChoice Oct 10, 2021
c497e34
add new test
CloseChoice Oct 10, 2021
fde027b
update test as desired in PR comments
CloseChoice Oct 10, 2021
0ef5d04
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Oct 10, 2021
246a421
add whatsnew entry
CloseChoice Oct 10, 2021
bbeeb1b
fix wrong issue number
CloseChoice Oct 10, 2021
6b38ba4
save nan encoding in new state variable and replace it before returni…
CloseChoice Oct 12, 2021
d04b7b8
remove wrong whatsnew entry
CloseChoice Oct 12, 2021
cc75b44
remove debug statement
CloseChoice Oct 12, 2021
c88b2d8
add whatsnew entry
CloseChoice Oct 13, 2021
0a3ce38
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Oct 13, 2021
ed537cc
Merge branch 'master' into FIX-multiindex-loc-nan
CloseChoice Oct 17, 2021
77e7283
fix wrong code quotes in whatsnew
CloseChoice Oct 17, 2021
72fb026
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
cf0355d
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
427a515
Fixes from pre-commit [automated commit]
CloseChoice Oct 22, 2021
29c37b6
fix problem without adding tests
CloseChoice Oct 9, 2021
d274bca
add test, still not all tests running
CloseChoice Oct 9, 2021
f72cbf7
remove print and debug statements
CloseChoice Oct 10, 2021
7ab0c9c
Fixes from pre-commit [automated commit]
CloseChoice Oct 10, 2021
a758d44
add new test
CloseChoice Oct 10, 2021
c3ec7f0
update test as desired in PR comments
CloseChoice Oct 10, 2021
245b141
add whatsnew entry
CloseChoice Oct 10, 2021
ae706c8
fix wrong issue number
CloseChoice Oct 10, 2021
30097ca
save nan encoding in new state variable and replace it before returni…
CloseChoice Oct 12, 2021
ef951a6
remove wrong whatsnew entry
CloseChoice Oct 12, 2021
0fa1784
remove debug statement
CloseChoice Oct 12, 2021
1f78023
add whatsnew entry
CloseChoice Oct 13, 2021
5b3c61d
fix wrong code quotes in whatsnew
CloseChoice Oct 17, 2021
98c8f13
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
ec218d5
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
d92cc60
Merge branch 'master' into FIX-multiindex-loc-nan
CloseChoice Nov 6, 2021
4dda020
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 6, 2021
6850b73
add tests for issue 36060
CloseChoice Nov 6, 2021
e7a52e3
Fixes from pre-commit [automated commit]
CloseChoice Nov 6, 2021
9682350
changes according to PR discussions
CloseChoice Nov 7, 2021
0ad2583
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 7, 2021
cb92f95
use assert_series_equal in test_groupby_dropna
CloseChoice Nov 7, 2021
e3c734d
make _has_na_placeholder cache_readonly
CloseChoice Nov 10, 2021
cf48e70
make _has_na_placeholder cache_readonly
CloseChoice Nov 10, 2021
49b1d65
Fixes from pre-commit [automated commit]
CloseChoice Nov 10, 2021
d00b39f
remove commented out stuff
CloseChoice Nov 10, 2021
a97ef1a
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 10, 2021
d180901
remove unnecessary import
CloseChoice Nov 10, 2021
cc6af9d
WIP: intermediate commit for loop solution
CloseChoice Nov 16, 2021
02bf699
changes for static analysis checks
CloseChoice Nov 16, 2021
7211984
Merge branch 'master' of github.com:pandas-dev/pandas into FIX-multii…
CloseChoice Dec 8, 2021
4520444
fix imports
CloseChoice Dec 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -784,6 +784,7 @@ Groupby/resample/rolling
- Bug in :meth:`GroupBy.nth` failing on ``axis=1`` (:issue:`43926`)
- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not respecting right bound on centered datetime-like windows, if the index contain duplicates (:issue:`3944`)
- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` when using a :class:`pandas.api.indexers.BaseIndexer` subclass that returned unequal start and end arrays would segfault instead of raising a ``ValueError`` (:issue:`44470`)
- Bug in :meth:`DataFrame.groupby` when grouping on multiple columns where at least one includes ``np.nan`` which resulted in a ``KeyError`` when the ``np.nan`` containing index was selected with :meth:`Series.loc` (:issue:`43814`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can move to 1.5


Reshaping
^^^^^^^^^
Expand Down
12 changes: 12 additions & 0 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
Categorical,
ExtensionArray,
)
from pandas.core.dtypes.missing import isna
import pandas.core.common as com
from pandas.core.frame import DataFrame
from pandas.core.groupby import ops
Expand Down Expand Up @@ -690,12 +691,23 @@ def _codes_and_uniques(self) -> tuple[np.ndarray, ArrayLike]:
codes, uniques = algorithms.factorize(
self.grouping_vector, sort=self._sort, na_sentinel=na_sentinel
)

return codes, uniques

@cache_readonly
def groups(self) -> dict[Hashable, np.ndarray]:
return self._index.groupby(Categorical.from_codes(self.codes, self.group_index))

@cache_readonly
def _has_na_placeholder(self) -> bool:
# GH43943, store placeholder for (np.nan, pd.NaT, np.datetime64("NaT", "ns"),
# np.timedelta64("NaT", "ns")), is used to replaced codes correctly to -1
# pandas/core/groupby:reconstructed_codes
if not self._dropna:
if isna(self.grouping_vector).any():
return True
return False


def get_grouper(
obj: NDFrameT,
Expand Down
11 changes: 10 additions & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -869,7 +869,16 @@ def ngroups(self) -> int:
def reconstructed_codes(self) -> list[np.ndarray]:
codes = self.codes
ids, obs_ids, _ = self.group_info
return decons_obs_group_ids(ids, obs_ids, self.shape, codes, xnull=True)
# get the levels in which to reconstruct codes for na
levels_with_na = [
idx
for idx, grouping in enumerate(self.groupings)
if grouping._has_na_placeholder
]
reconstructed_codes = decons_obs_group_ids(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm would be better to actually do this in decons_obs_group_ids

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dunno, not wild about core.sorting needing to know anything about self._groupings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i take this back, this now makes sorting dependent on groupby.ops which is very strange.

@CloseChoice so instead am ok to pass something do decons_obs_group_ids but the something should not be a list of grouping, rather list of codes or similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to pass only a list where we need to reconstruct the codes for na

ids, obs_ids, self.shape, codes, xnull=True, levels_with_na=levels_with_na
)
return reconstructed_codes

@final
@cache_readonly
Expand Down
26 changes: 24 additions & 2 deletions pandas/core/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,12 @@ def decons_group_index(comp_labels, shape):


def decons_obs_group_ids(
comp_ids: npt.NDArray[np.intp], obs_ids, shape, labels, xnull: bool
comp_ids: npt.NDArray[np.intp],
obs_ids,
shape,
labels,
xnull: bool,
levels_with_na: list[int] | None = None,
):
"""
Reconstruct labels from observed group ids.
Expand All @@ -254,17 +259,34 @@ def decons_obs_group_ids(
xnull : bool
If nulls are excluded; i.e. -1 labels are passed through.
"""

def reconstruct_na_in_codes(codes):
new_codes = []
if levels_with_na:
for idx, code_level in enumerate(codes):
if idx in levels_with_na:
new_codes.append(
np.where(code_level == max(code_level), -1, code_level)
)
else:
new_codes.append(code_level)
else:
new_codes = codes
return new_codes

if not xnull:
lift = np.fromiter(((a == -1).any() for a in labels), dtype="i8")
shape = np.asarray(shape, dtype="i8") + lift

if not is_int64_overflow_possible(shape):
# obs ids are deconstructable! take the fast route!
out = decons_group_index(obs_ids, shape)
out = reconstruct_na_in_codes(out)
return out if xnull or not lift.any() else [x - y for x, y in zip(out, lift)]

indexer = unique_label_indices(comp_ids)
return [lab[indexer].astype(np.intp, subok=False, copy=True) for lab in labels]
out = [lab[indexer].astype(np.intp, subok=False, copy=True) for lab in labels]
return reconstruct_na_in_codes(out)


def indexer_from_factorized(
Expand Down
56 changes: 56 additions & 0 deletions pandas/tests/groupby/test_groupby_dropna.py
Original file line number Diff line number Diff line change
Expand Up @@ -370,3 +370,59 @@ def test_groupby_nan_included():
tm.assert_numpy_array_equal(result_values, expected_values)
assert np.isnan(list(result.keys())[2])
assert list(result.keys())[0:2] == ["g1", "g2"]


@pytest.mark.parametrize(
"na",
[np.nan, pd.NaT, pd.NA, np.datetime64("NaT", "ns"), np.timedelta64("NaT", "ns")],
)
def test_groupby_codes_with_nan_in_multiindex(na):
# GH 43814
df = pd.DataFrame(
{
"temp_playlist": [0, 0, 0, 0],
"objId": ["o1", na, "o1", na],
"x": [1, 2, 3, 4],
}
)

result = df.groupby(by=["temp_playlist", "objId"], dropna=False)["x"].sum()
expected = pd.MultiIndex.from_arrays(
[[0, 0], ["o1", na]], names=["temp_playlist", "objId"]
)
expected = pd.Series(
[4, 6],
index=pd.MultiIndex(
levels=[[0], ["o1", np.nan]],
codes=[[0, 0], [0, -1]],
names=["temp_playlist", "objId"],
),
name="x",
)
tm.assert_series_equal(result, expected)


def test_groupby_multiindex_multiply_with_series():
# GH#36060
df = pd.DataFrame(
{
"animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
"type": [np.nan, np.nan, np.nan, np.nan],
"speed": [380.0, 370.0, 24.0, 26.0],
}
)
speed = df.groupby(["animal", "type"], dropna=False)["speed"].first()
# Reconstruct same index to allow for multiplication.
ix_wing = pd.MultiIndex.from_tuples(
[("Falcon", np.nan), ("Parrot", np.nan)], names=["animal", "type"]
)
wing = pd.Series([42, 44], index=ix_wing)

result = wing * speed
expected = pd.Series(
[15960.0, 1056.0],
index=pd.MultiIndex.from_tuples(
[("Falcon", np.nan), ("Parrot", np.nan)], names=["animal", "type"]
),
)
tm.assert_series_equal(result, expected)