Skip to content

Fix multiindex loc nan #43943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
549386c
fix problem without adding tests
CloseChoice Oct 9, 2021
81970c7
add test, still not all tests running
CloseChoice Oct 9, 2021
8443c62
remove print and debug statements
CloseChoice Oct 10, 2021
9eae773
Fixes from pre-commit [automated commit]
CloseChoice Oct 10, 2021
c497e34
add new test
CloseChoice Oct 10, 2021
fde027b
update test as desired in PR comments
CloseChoice Oct 10, 2021
0ef5d04
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Oct 10, 2021
246a421
add whatsnew entry
CloseChoice Oct 10, 2021
bbeeb1b
fix wrong issue number
CloseChoice Oct 10, 2021
6b38ba4
save nan encoding in new state variable and replace it before returni…
CloseChoice Oct 12, 2021
d04b7b8
remove wrong whatsnew entry
CloseChoice Oct 12, 2021
cc75b44
remove debug statement
CloseChoice Oct 12, 2021
c88b2d8
add whatsnew entry
CloseChoice Oct 13, 2021
0a3ce38
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Oct 13, 2021
ed537cc
Merge branch 'master' into FIX-multiindex-loc-nan
CloseChoice Oct 17, 2021
77e7283
fix wrong code quotes in whatsnew
CloseChoice Oct 17, 2021
72fb026
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
cf0355d
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
427a515
Fixes from pre-commit [automated commit]
CloseChoice Oct 22, 2021
29c37b6
fix problem without adding tests
CloseChoice Oct 9, 2021
d274bca
add test, still not all tests running
CloseChoice Oct 9, 2021
f72cbf7
remove print and debug statements
CloseChoice Oct 10, 2021
7ab0c9c
Fixes from pre-commit [automated commit]
CloseChoice Oct 10, 2021
a758d44
add new test
CloseChoice Oct 10, 2021
c3ec7f0
update test as desired in PR comments
CloseChoice Oct 10, 2021
245b141
add whatsnew entry
CloseChoice Oct 10, 2021
ae706c8
fix wrong issue number
CloseChoice Oct 10, 2021
30097ca
save nan encoding in new state variable and replace it before returni…
CloseChoice Oct 12, 2021
ef951a6
remove wrong whatsnew entry
CloseChoice Oct 12, 2021
0fa1784
remove debug statement
CloseChoice Oct 12, 2021
1f78023
add whatsnew entry
CloseChoice Oct 13, 2021
5b3c61d
fix wrong code quotes in whatsnew
CloseChoice Oct 17, 2021
98c8f13
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
ec218d5
WIP: add support for (pd.NA, pd.NaT, np.datetime64("NaT", "ns"), np.t…
CloseChoice Oct 22, 2021
d92cc60
Merge branch 'master' into FIX-multiindex-loc-nan
CloseChoice Nov 6, 2021
4dda020
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 6, 2021
6850b73
add tests for issue 36060
CloseChoice Nov 6, 2021
e7a52e3
Fixes from pre-commit [automated commit]
CloseChoice Nov 6, 2021
9682350
changes according to PR discussions
CloseChoice Nov 7, 2021
0ad2583
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 7, 2021
cb92f95
use assert_series_equal in test_groupby_dropna
CloseChoice Nov 7, 2021
e3c734d
make _has_na_placeholder cache_readonly
CloseChoice Nov 10, 2021
cf48e70
make _has_na_placeholder cache_readonly
CloseChoice Nov 10, 2021
49b1d65
Fixes from pre-commit [automated commit]
CloseChoice Nov 10, 2021
d00b39f
remove commented out stuff
CloseChoice Nov 10, 2021
a97ef1a
Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…
CloseChoice Nov 10, 2021
d180901
remove unnecessary import
CloseChoice Nov 10, 2021
cc6af9d
WIP: intermediate commit for loop solution
CloseChoice Nov 16, 2021
02bf699
changes for static analysis checks
CloseChoice Nov 16, 2021
7211984
Merge branch 'master' of github.com:pandas-dev/pandas into FIX-multii…
CloseChoice Dec 8, 2021
4520444
fix imports
CloseChoice Dec 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -595,7 +595,7 @@ Groupby/resample/rolling
- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` for centered datetimelike windows with uneven nanosecond (:issue:`43997`)
- Bug in :meth:`GroupBy.nth` failing on ``axis=1`` (:issue:`43926`)
- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not respecting right bound on centered datetime-like windows, if the index contain duplicates (:issue:`#3944`)

- Bug in :meth:`DataFrame.groupby` when grouping on multiple columns where at least one includes ``np.nan`` which resulted in a ``KeyError`` when the ``np.nan`` containing index was selected with :meth:`Series.loc` (:issue:`43814`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can move to 1.5


Reshaping
^^^^^^^^^
Expand Down
10 changes: 10 additions & 0 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import numpy as np

from pandas._libs.tslibs.nattype import NaTType
from pandas._typing import (
ArrayLike,
NDFrameT,
Expand All @@ -34,6 +35,7 @@
Categorical,
ExtensionArray,
)
from pandas.core.base import isna
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pandas.core.dtypes.missing import isna

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import pandas.core.common as com
from pandas.core.frame import DataFrame
from pandas.core.groupby import ops
Expand Down Expand Up @@ -486,6 +488,7 @@ def __init__(
self._observed = observed
self.in_axis = in_axis
self._dropna = dropna
self._na_placeholder = None

self._passed_categorical = False

Expand Down Expand Up @@ -690,6 +693,13 @@ def _codes_and_uniques(self) -> tuple[np.ndarray, ArrayLike]:
codes, uniques = algorithms.factorize(
self.grouping_vector, sort=self._sort, na_sentinel=na_sentinel
)
# GH43943, store placeholder for (np.nan, pd.NaT, np.datetime64("NaT", "ns"),
# np.timedelta64("NaT", "ns")), will later be replaced by -1 in
# pandas/core/groupby:reconstructed_codes
if not self._dropna:
if isna(self.grouping_vector).any():
self._na_placeholder = max(codes)
Copy link
Member

@phofl phofl Oct 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just use -2?

Copy link
Member Author

@CloseChoice CloseChoice Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean using -2 as placeholder in here? Replacing that line with

        codes = np.where(code_is_na, -2, codes)

results in a segfault.

python -m pytest -s pandas/tests/groupby/test_groupby_dropna.py 
=========================================================================================================================== test session starts ============================================================================================================================
platform linux -- Python 3.8.11, pytest-6.2.5, py-1.9.0, pluggy-0.13.1
rootdir: /home/tobias/programming/python/pandas_tests/pandas-dev, configfile: pyproject.toml
plugins: hypothesis-5.37.4
collected 164 items                                                                                                                                                                                                                                                        

pandas/tests/groupby/test_groupby_dropna.py .F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.....Fatal Python error: Segmentation fault

Current thread 0x00007f7a7fd9f0c0 (most recent call first):
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/sorting.py", line 645 in get_group_index_sorter
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/sorting.py", line 600 in get_indexer_dict
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/groupby/ops.py", line 778 in indices
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/groupby/groupby.py", line 616 in indices
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/groupby/groupby.py", line 638 in _get_indices
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/groupby/groupby.py", line 1089 in _set_result_index_ordered
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/groupby/generic.py", line 1172 in _transform_general
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/groupby/groupby.py", line 1583 in _transform
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/core/groupby/generic.py", line 1177 in transform
  File "/home/tobias/programming/python/pandas_tests/pandas-dev/pandas/tests/groupby/test_groupby_dropna.py", line 210 in test_groupby_dataframe_slice_then_transform
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/tobias/.local/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/site-packages/pytest/__main__.py", line 5 in <module>
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/runpy.py", line 87 in _run_code
  File "/home/tobias/anaconda3/envs/pandas-dev-new/lib/python3.8/runpy.py", line 194 in _run_module_as_main

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm why do we need to store this? can we not just do max(codes) on reconstrutcion?


return codes, uniques

@cache_readonly
Expand Down
16 changes: 15 additions & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -859,7 +859,21 @@ def ngroups(self) -> int:
def reconstructed_codes(self) -> list[np.ndarray]:
codes = self.codes
ids, obs_ids, _ = self.group_info
return decons_obs_group_ids(ids, obs_ids, self.shape, codes, xnull=True)
reconstructed_codes = decons_obs_group_ids(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm would be better to actually do this in decons_obs_group_ids

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dunno, not wild about core.sorting needing to know anything about self._groupings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i take this back, this now makes sorting dependent on groupby.ops which is very strange.

@CloseChoice so instead am ok to pass something do decons_obs_group_ids but the something should not be a list of grouping, rather list of codes or similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to pass only a list where we need to reconstruct the codes for na

ids, obs_ids, self.shape, codes, xnull=True
)

def transform_codes(code_level, grouping):
if grouping._na_placeholder is not None:
return np.where(code_level == grouping._na_placeholder, -1, code_level)
else:
return code_level

transformed_rec_codes = [
transform_codes(code_level, grouping)
for code_level, grouping in zip(reconstructed_codes, self._groupings)
]
return transformed_rec_codes

@final
@cache_readonly
Expand Down
48 changes: 48 additions & 0 deletions pandas/tests/groupby/test_groupby_dropna.py
Original file line number Diff line number Diff line change
Expand Up @@ -370,3 +370,51 @@ def test_groupby_nan_included():
tm.assert_numpy_array_equal(result_values, expected_values)
assert np.isnan(list(result.keys())[2])
assert list(result.keys())[0:2] == ["g1", "g2"]


@pytest.mark.parametrize(
"na",
[np.nan, pd.NaT, pd.NA, np.datetime64("NaT", "ns"), np.timedelta64("NaT", "ns")],
)
def test_groupby_codes_with_nan_in_multiindex(na):
# GH 43814
df = pd.DataFrame(
{
"temp_playlist": [0, 0, 0, 0],
"objId": ["o1", na, "o1", na],
"x": [1, 2, 3, 4],
}
)

grouped_df = df.groupby(by=["temp_playlist", "objId"], dropna=False)["x"].sum()
expected = pd.MultiIndex.from_arrays(
[[0, 0], ["o1", na]], names=["temp_playlist", "objId"]
)
result = grouped_df.index
assert all((res == ex).all() for res, ex in zip(result.codes, expected.codes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comes the actual result with an expected frame and use tm.assert_frame_equals



def test_groupby_multiindex_multiply_with_series():
# GH#36060
df = pd.DataFrame(
{
"animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
"type": [np.nan, np.nan, np.nan, np.nan],
"speed": [380.0, 370.0, 24.0, 26.0],
}
)
speed = df.groupby(["animal", "type"], dropna=False)["speed"].first()
# Reconstruct same index to allow for multiplication.
ix_wing = pd.MultiIndex.from_tuples(
[("Falcon", np.nan), ("Parrot", np.nan)], names=["animal", "type"]
)
wing = pd.Series([42, 44], index=ix_wing)

result = wing * speed
expected = pd.Series(
[15960.0, 1056.0],
index=pd.MultiIndex.from_tuples(
[("Falcon", np.nan), ("Parrot", np.nan)], names=["animal", "type"]
),
)
tm.assert_series_equal(result, expected)