Fix multiindex loc nan #43943

CloseChoice · 2021-10-09T21:12:50Z

closes Using .loc with MultiIndex containing np.nan unexpected behavior #43814
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

pep8speaks · 2021-10-09T21:12:56Z

Hello @CloseChoice! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-08 20:18:23 UTC

pandas/_libs/hashtable.pyx

jbrockmendel · 2021-10-10T00:49:31Z

pandas/tests/indexing/multiindex/test_getitem.py

+        }
+    )
+
+    agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].agg(list)


if the test is for MultiIndex.get_loc, then test that directly; shouldn't depend on groupby

it's quite strange but I can't reproduce this without the groupby. When doing something like

arrays = [[0, 0], ["o1", np.nan]] names = ("temp_playlist", "objId") index = MultiIndex.from_arrays(arrays, names=names) srs = Series([4, 6], index=index) result = srs.loc[(0, np.nan)]

I don't get the expected key error on master as I get when using:

df = DataFrame( { "temp_playlist": [0, 0, 0, 0], "objId": ["o1", np.nan, "o1", np.nan], "x": [1, 2, 3, 4], } ) agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].sum() result = agg_df.loc[agg_df.index[-1]]

when comparing agg_df with srs using tm.assert_series_equal this results in

*** AssertionError: Series.index are different Attribute "inferred_type" are different [left]: string [right]: mixed

Any idea on this?

CloseChoice · 2021-10-10T10:23:00Z

@github-actions pre-commit

jreback

can you add a whatsnew note. 1.3.x in indexing; let's try to backport.

jreback · 2021-10-10T18:06:17Z

pandas/tests/indexing/multiindex/test_getitem.py

+        }
+    )
+
+    agg_df = df.groupby(by=["temp_playlist", "objId"], dropna=False)["x"].agg(list)


can you parameterize on dropna=True & False

jreback · 2021-10-10T18:06:34Z

pandas/tests/indexing/multiindex/test_getitem.py

+
+
+def test_loc_nan_multiindex():
+    df = DataFrame(


can you add the issue number as a comment here

since we are declaring this a regression, we should maybe have a test that previously passed. In what previous version of pandas did this test pass?

jreback · 2021-10-10T18:07:52Z

cc @phofl if you can have a look

…s into FIX-multiindex-loc-nan # Conflicts: # pandas/tests/indexing/multiindex/test_getitem.py

phofl

This is not a loc bug. The MultiIndex coming from the groupby operation is corrupt. It has the codes:

[[0, 0], [0, 1]]

This should be

[[0, 0], [0, -1]]

The nans are not identified as missing values, hence the incorrect behavior in subsequent operations. We have to fix this in the groupby. I think this is coming from the na_sentinel change in factorize

simonjayhawkins · 2021-10-12T11:19:01Z

doc/source/whatsnew/v1.3.4.rst

@@ -26,6 +26,7 @@ Fixed regressions
 - Fixed regression in :meth:`Series.aggregate` attempting to pass ``args`` and ``kwargs`` multiple times to the user supplied ``func`` in certain cases (:issue:`43357`)
 - Fixed regression when iterating over a :class:`DataFrame.groupby.rolling` object causing the resulting DataFrames to have an incorrect index if the input groupings were not sorted (:issue:`43386`)
 - Fixed regression in :meth:`DataFrame.groupby.rolling.cov` and :meth:`DataFrame.groupby.rolling.corr` computing incorrect results if the input groupings were not sorted (:issue:`43386`)
+- Fixed regression in :meth:`DataFrame.loc` when `MultiIndex` contained `np.nan` (:issue`43814`)


Suggested change

- Fixed regression in :meth:`DataFrame.loc` when `MultiIndex` contained `np.nan` (:issue`43814`)

- Fixed regression in :meth:`DataFrame.loc` when ``MultiIndex`` contained ``np.nan`` following a groupby (:issue`43814`)

Loc itself works fine, the MultiIndex from the groupby is incorrect.

CloseChoice · 2021-10-12T12:56:53Z

This is not a loc bug. The MultiIndex coming from the groupby operation is corrupt. It has the codes:
[[0, 0], [0, 1]]
This should be
[[0, 0], [0, -1]]
The nans are not identified as missing values, hence the incorrect behavior in subsequent operations. We have to fix this in the groupby. I think this is coming from the na_sentinel change in factorize

The line that is responsible for setting the codes to 1 is here. But changing this so that we have the codes [[0, 0], [0, -1]] results in dropping the [0, np.nan] key. The reason for this is the function pandas._libs.hashtable.Int64HashTable.get_labels_groupby which actually returns only positive values. But fixing this wouldn't fix the problem either, that drills down to result_index (and inside that to decons_obs_group_ids). Which would set the index correctly but somehow the result of the aggregation function is still not correct. I'll look deeper into the change in factorize but at the moment it looks to me that there are more changes to be made than just in factorize

…ng the output

CloseChoice · 2021-10-12T17:48:31Z

I tried another approach: I save the np.nan encoding in codes in a newly introduced state variable _na_placeholder and after all the transformations are done, I replace the placeholder with -1 again, so that everything should be as expected.

Let me know what you think about that.

CloseChoice · 2021-10-12T18:06:12Z

@github-actions pre-commit

simonjayhawkins · 2021-10-13T10:36:41Z

If this is not a regression #43814 (comment), I would prefer to not do this for 1.3.x

CloseChoice · 2021-11-06T15:10:53Z

yes pls

added the tests. Please note one thing:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(
...     {
...         "animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
...         "type": [np.nan, np.nan, np.nan, np.nan],
...         "speed": [380.0, 370.0, 24.0, 26.0],
...     }
... )
>>> speed = df.groupby(["animal", "type"], dropna=False)["speed"].first()
>>> # Reconstruct same index to allow for multiplication.
>>> ix_wing = pd.MultiIndex.from_tuples(
...     [("Falcon", np.nan), ("Parrot", np.nan)], names=["animal", "type"]
... )
>>> wing = pd.Series([42, 44], index=ix_wing)
>>> 
>>> result = wing * speed
>>> wing.index.dtypes
animal    object
type      object
dtype: object
>>> speed.index.dtypes
animal     object
type      float64
dtype: object

# dtype of multiindex is dependent of the order of factors
>>> (wing * speed).index.dtypes
animal    object
type      object
dtype: object

>>> (speed * wing).index.dtypes
animal     object
type      float64
dtype: object

Therefore I only multiply wing * speed and not speed * wing. I guess this is a bug.

CloseChoice · 2021-11-06T15:12:04Z

@github-actions pre-commit

jreback · 2021-11-06T22:40:03Z

pandas/core/groupby/ops.py

@@ -859,7 +859,21 @@ def ngroups(self) -> int:
    def reconstructed_codes(self) -> list[np.ndarray]:
        codes = self.codes
        ids, obs_ids, _ = self.group_info
-        return decons_obs_group_ids(ids, obs_ids, self.shape, codes, xnull=True)
+        reconstructed_codes = decons_obs_group_ids(


umm would be better to actually do this in decons_obs_group_ids

i dunno, not wild about core.sorting needing to know anything about self._groupings

hmm i take this back, this now makes sorting dependent on groupby.ops which is very strange.

@CloseChoice so instead am ok to pass something do decons_obs_group_ids but the something should not be a list of grouping, rather list of codes or similar

changed this to pass only a list where we need to reconstruct the codes for na

jreback · 2021-11-06T22:40:44Z

pandas/tests/groupby/test_groupby_dropna.py

+        [[0, 0], ["o1", na]], names=["temp_playlist", "objId"]
+    )
+    result = grouped_df.index
+    assert all((res == ex).all() for res, ex in zip(result.codes, expected.codes))


comes the actual result with an expected frame and use tm.assert_frame_equals

jreback · 2021-11-06T22:42:00Z

pandas/core/groupby/grouper.py

+            # pandas/core/groupby:reconstructed_codes
+            if not self._dropna:
+                if isna(self.grouping_vector).any():
+                    self._na_placeholder = max(codes)


umm why do we need to store this? can we not just do max(codes) on reconstrutcion?

…s into FIX-multiindex-loc-nan

jbrockmendel · 2021-11-07T22:03:38Z

pandas/core/groupby/grouper.py

+            # pandas/core/groupby:reconstructed_codes
+            if not self._dropna:
+                if isna(self.grouping_vector).any():
+                    self._has_na_placeholder = True


can this be a cache_readonly? we try to avoid statefulness

CloseChoice · 2021-11-10T16:22:45Z

@github-actions pre-commit

…s into FIX-multiindex-loc-nan

jreback · 2021-11-14T03:12:59Z

pandas/core/groupby/ops.py

@@ -859,7 +859,21 @@ def ngroups(self) -> int:
    def reconstructed_codes(self) -> list[np.ndarray]:
        codes = self.codes
        ids, obs_ids, _ = self.group_info
-        return decons_obs_group_ids(ids, obs_ids, self.shape, codes, xnull=True)
+        reconstructed_codes = decons_obs_group_ids(


hmm i take this back, this now makes sorting dependent on groupby.ops which is very strange.

@CloseChoice so instead am ok to pass something do decons_obs_group_ids but the something should not be a list of grouping, rather list of codes or similar

jbrockmendel · 2021-12-08T05:36:28Z

pandas/core/groupby/grouper.py

@@ -34,6 +34,7 @@
    Categorical,
    ExtensionArray,
 )
+from pandas.core.base import isna


from pandas.core.dtypes.missing import isna

…ndex-loc-nan

github-actions · 2022-01-08T00:03:45Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback

this was good last time i looked, if you can merge master

jreback · 2022-01-16T18:03:10Z

doc/source/whatsnew/v1.4.0.rst

@@ -784,6 +784,7 @@ Groupby/resample/rolling
 - Bug in :meth:`GroupBy.nth` failing on ``axis=1`` (:issue:`43926`)
 - Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not respecting right bound on centered datetime-like windows, if the index contain duplicates (:issue:`3944`)
 - Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` when using a :class:`pandas.api.indexers.BaseIndexer` subclass that returned unequal start and end arrays would segfault instead of raising a ``ValueError`` (:issue:`44470`)
+- Bug in :meth:`DataFrame.groupby` when grouping on multiple columns where at least one includes ``np.nan`` which resulted in a ``KeyError`` when the ``np.nan`` containing index was selected with :meth:`Series.loc` (:issue:`43814`)


if you can move to 1.5

mroeschke · 2022-02-13T00:53:04Z

Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch and move the whatsnew note to 1.5, and we can reopen.

CloseChoice added 2 commits October 9, 2021 22:35

fix problem without adding tests

549386c

add test, still not all tests running

81970c7

jbrockmendel reviewed Oct 10, 2021

View reviewed changes

pandas/_libs/hashtable.pyx Outdated Show resolved Hide resolved

jbrockmendel reviewed Oct 10, 2021

View reviewed changes

remove print and debug statements

8443c62

CloseChoice added 2 commits October 10, 2021 10:23

Fixes from pre-commit [automated commit]

9eae773

add new test

c497e34

jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 10, 2021

jreback requested changes Oct 10, 2021

View reviewed changes

jreback added this to the 1.3.4 milestone Oct 10, 2021

CloseChoice added 4 commits October 10, 2021 20:21

update test as desired in PR comments

fde027b

Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…

0ef5d04

…s into FIX-multiindex-loc-nan # Conflicts: # pandas/tests/indexing/multiindex/test_getitem.py

add whatsnew entry

246a421

fix wrong issue number

bbeeb1b

phofl requested changes Oct 10, 2021

View reviewed changes

phofl added Groupby MultiIndex and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 10, 2021

simonjayhawkins reviewed Oct 12, 2021

View reviewed changes

save nan encoding in new state variable and replace it before returni…

6b38ba4

…ng the output

CloseChoice added 2 commits October 12, 2021 21:56

remove wrong whatsnew entry

d04b7b8

remove debug statement

cc75b44

Fixes from pre-commit [automated commit]

e7a52e3

jreback requested changes Nov 6, 2021

View reviewed changes

CloseChoice added 3 commits November 7, 2021 18:58

changes according to PR discussions

9682350

Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…

0ad2583

…s into FIX-multiindex-loc-nan

use assert_series_equal in test_groupby_dropna

cb92f95

jbrockmendel reviewed Nov 7, 2021

View reviewed changes

CloseChoice added 2 commits November 10, 2021 17:21

make _has_na_placeholder cache_readonly

e3c734d

make _has_na_placeholder cache_readonly

cf48e70

Fixes from pre-commit [automated commit]

49b1d65

CloseChoice requested a review from jbrockmendel November 10, 2021 16:34

CloseChoice added 3 commits November 10, 2021 17:35

remove commented out stuff

d00b39f

Merge branch 'FIX-multiindex-loc-nan' of github.com:CloseChoice/panda…

a97ef1a

…s into FIX-multiindex-loc-nan

remove unnecessary import

d180901

jreback requested changes Nov 14, 2021

View reviewed changes

CloseChoice added 2 commits November 16, 2021 05:53

WIP: intermediate commit for loop solution

cc6af9d

changes for static analysis checks

02bf699

jbrockmendel reviewed Dec 8, 2021

View reviewed changes

CloseChoice added 2 commits December 8, 2021 21:15

Merge branch 'master' of github.com:pandas-dev/pandas into FIX-multii…

7211984

…ndex-loc-nan

fix imports

4520444

github-actions bot added the Stale label Jan 8, 2022

jreback requested changes Jan 16, 2022

View reviewed changes

jbrockmendel mentioned this pull request Feb 6, 2022

WIP/BUG: Correct results for groupby(...).transform with null keys #45839

Closed

4 tasks

mroeschke closed this Feb 13, 2022

jbrockmendel mentioned this pull request Mar 15, 2022

BUG: Fix remaining cases of groupby(...).transform with dropna=True #46367

Merged

4 tasks

coroa mentioned this pull request Jun 25, 2023

isna does not work with explicit MultiIndex nan-representation coroa/pandas-indexing#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiindex loc nan #43943

Fix multiindex loc nan #43943

CloseChoice commented Oct 9, 2021 •

edited

Loading

pep8speaks commented Oct 9, 2021 •

edited

Loading

jbrockmendel Oct 10, 2021

CloseChoice Oct 10, 2021

CloseChoice commented Oct 10, 2021

jreback left a comment

jreback Oct 10, 2021

jreback Oct 10, 2021

simonjayhawkins Oct 12, 2021

jreback commented Oct 10, 2021

phofl left a comment

simonjayhawkins Oct 12, 2021

phofl Oct 12, 2021

CloseChoice commented Oct 12, 2021

CloseChoice commented Oct 12, 2021

CloseChoice commented Oct 12, 2021

simonjayhawkins commented Oct 13, 2021

CloseChoice commented Nov 6, 2021

CloseChoice commented Nov 6, 2021

jreback Nov 6, 2021

CloseChoice Nov 7, 2021

jbrockmendel Nov 7, 2021

jreback Nov 14, 2021

CloseChoice Nov 16, 2021

jreback Nov 6, 2021

jreback Nov 6, 2021

jbrockmendel Nov 7, 2021

CloseChoice commented Nov 10, 2021

jreback Nov 14, 2021

jbrockmendel Dec 8, 2021

CloseChoice Dec 8, 2021

github-actions bot commented Jan 8, 2022

jreback left a comment

jreback Jan 16, 2022

mroeschke commented Feb 13, 2022

	- Fixed regression in :meth:`DataFrame.loc` when `MultiIndex` contained `np.nan` (:issue`43814`)
	- Fixed regression in :meth:`DataFrame.loc` when ``MultiIndex`` contained ``np.nan`` following a groupby (:issue`43814`)

Fix multiindex loc nan #43943

Fix multiindex loc nan #43943

Conversation

CloseChoice commented Oct 9, 2021 • edited Loading

pep8speaks commented Oct 9, 2021 • edited Loading

Comment last updated at 2021-12-08 20:18:23 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CloseChoice commented Oct 10, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 10, 2021

phofl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CloseChoice commented Oct 12, 2021

CloseChoice commented Oct 12, 2021

CloseChoice commented Oct 12, 2021

simonjayhawkins commented Oct 13, 2021

CloseChoice commented Nov 6, 2021

CloseChoice commented Nov 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CloseChoice commented Nov 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 8, 2022

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Feb 13, 2022

CloseChoice commented Oct 9, 2021 •

edited

Loading

pep8speaks commented Oct 9, 2021 •

edited

Loading