Skip to content

Fix multiindex loc nan #43943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from

Conversation

CloseChoice
Copy link
Member

@CloseChoice CloseChoice commented Oct 9, 2021

@pep8speaks
Copy link

pep8speaks commented Oct 9, 2021

Hello @CloseChoice! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-08 20:18:23 UTC

}
)

agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].agg(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the test is for MultiIndex.get_loc, then test that directly; shouldn't depend on groupby

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's quite strange but I can't reproduce this without the groupby. When doing something like

    arrays = [[0, 0], ["o1", np.nan]]
    names = ("temp_playlist", "objId")
    index = MultiIndex.from_arrays(arrays, names=names)
    srs = Series([4, 6], index=index)
    result = srs.loc[(0, np.nan)]

I don't get the expected key error on master as I get when using:

    df = DataFrame(
        {
            "temp_playlist": [0, 0, 0, 0],
            "objId": ["o1", np.nan, "o1", np.nan],
            "x": [1, 2, 3, 4],
        }
    )

    agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].sum()
    result = agg_df.loc[agg_df.index[-1]]

when comparing agg_df with srs using tm.assert_series_equal this results in

*** AssertionError: Series.index are different

Attribute "inferred_type" are different
[left]:  string
[right]: mixed

Any idea on this?

@CloseChoice
Copy link
Member Author

@github-actions pre-commit

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 10, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note. 1.3.x in indexing; let's try to backport.

}
)

agg_df = df.groupby(by=["temp_playlist", "objId"], dropna=False)["x"].agg(list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize on dropna=True & False



def test_loc_nan_multiindex():
df = DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are declaring this a regression, we should maybe have a test that previously passed. In what previous version of pandas did this test pass?

@jreback jreback added this to the 1.3.4 milestone Oct 10, 2021
@jreback
Copy link
Contributor

jreback commented Oct 10, 2021

cc @phofl if you can have a look

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a loc bug. The MultiIndex coming from the groupby operation is corrupt. It has the codes:

[[0, 0], [0, 1]]

This should be

[[0, 0], [0, -1]]

The nans are not identified as missing values, hence the incorrect behavior in subsequent operations. We have to fix this in the groupby. I think this is coming from the na_sentinel change in factorize

@phofl phofl added Groupby MultiIndex and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 10, 2021
@@ -26,6 +26,7 @@ Fixed regressions
- Fixed regression in :meth:`Series.aggregate` attempting to pass ``args`` and ``kwargs`` multiple times to the user supplied ``func`` in certain cases (:issue:`43357`)
- Fixed regression when iterating over a :class:`DataFrame.groupby.rolling` object causing the resulting DataFrames to have an incorrect index if the input groupings were not sorted (:issue:`43386`)
- Fixed regression in :meth:`DataFrame.groupby.rolling.cov` and :meth:`DataFrame.groupby.rolling.corr` computing incorrect results if the input groupings were not sorted (:issue:`43386`)
- Fixed regression in :meth:`DataFrame.loc` when `MultiIndex` contained `np.nan` (:issue`43814`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Fixed regression in :meth:`DataFrame.loc` when `MultiIndex` contained `np.nan` (:issue`43814`)
- Fixed regression in :meth:`DataFrame.loc` when ``MultiIndex`` contained ``np.nan`` following a groupby (:issue`43814`)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loc itself works fine, the MultiIndex from the groupby is incorrect.

@CloseChoice
Copy link
Member Author

This is not a loc bug. The MultiIndex coming from the groupby operation is corrupt. It has the codes:

[[0, 0], [0, 1]]

This should be

[[0, 0], [0, -1]]

The nans are not identified as missing values, hence the incorrect behavior in subsequent operations. We have to fix this in the groupby. I think this is coming from the na_sentinel change in factorize

The line that is responsible for setting the codes to 1 is here. But changing this so that we have the codes [[0, 0], [0, -1]] results in dropping the [0, np.nan] key. The reason for this is the function pandas._libs.hashtable.Int64HashTable.get_labels_groupby which actually returns only positive values. But fixing this wouldn't fix the problem either, that drills down to result_index (and inside that to decons_obs_group_ids). Which would set the index correctly but somehow the result of the aggregation function is still not correct. I'll look deeper into the change in factorize but at the moment it looks to me that there are more changes to be made than just in factorize

@CloseChoice
Copy link
Member Author

I tried another approach: I save the np.nan encoding in codes in a newly introduced state variable _na_placeholder and after all the transformations are done, I replace the placeholder with -1 again, so that everything should be as expected.

Let me know what you think about that.

@CloseChoice
Copy link
Member Author

@github-actions pre-commit

@simonjayhawkins
Copy link
Member

If this is not a regression #43814 (comment), I would prefer to not do this for 1.3.x

@CloseChoice
Copy link
Member Author

yes pls

added the tests. Please note one thing:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(
...     {
...         "animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
...         "type": [np.nan, np.nan, np.nan, np.nan],
...         "speed": [380.0, 370.0, 24.0, 26.0],
...     }
... )
>>> speed = df.groupby(["animal", "type"], dropna=False)["speed"].first()
>>> # Reconstruct same index to allow for multiplication.
>>> ix_wing = pd.MultiIndex.from_tuples(
...     [("Falcon", np.nan), ("Parrot", np.nan)], names=["animal", "type"]
... )
>>> wing = pd.Series([42, 44], index=ix_wing)
>>> 
>>> result = wing * speed
>>> wing.index.dtypes
animal    object
type      object
dtype: object
>>> speed.index.dtypes
animal     object
type      float64
dtype: object

# dtype of multiindex is dependent of the order of factors
>>> (wing * speed).index.dtypes
animal    object
type      object
dtype: object

>>> (speed * wing).index.dtypes
animal     object
type      float64
dtype: object

Therefore I only multiply wing * speed and not speed * wing. I guess this is a bug.

@CloseChoice
Copy link
Member Author

@github-actions pre-commit

@@ -859,7 +859,21 @@ def ngroups(self) -> int:
def reconstructed_codes(self) -> list[np.ndarray]:
codes = self.codes
ids, obs_ids, _ = self.group_info
return decons_obs_group_ids(ids, obs_ids, self.shape, codes, xnull=True)
reconstructed_codes = decons_obs_group_ids(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm would be better to actually do this in decons_obs_group_ids

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dunno, not wild about core.sorting needing to know anything about self._groupings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i take this back, this now makes sorting dependent on groupby.ops which is very strange.

@CloseChoice so instead am ok to pass something do decons_obs_group_ids but the something should not be a list of grouping, rather list of codes or similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to pass only a list where we need to reconstruct the codes for na

[[0, 0], ["o1", na]], names=["temp_playlist", "objId"]
)
result = grouped_df.index
assert all((res == ex).all() for res, ex in zip(result.codes, expected.codes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comes the actual result with an expected frame and use tm.assert_frame_equals

# pandas/core/groupby:reconstructed_codes
if not self._dropna:
if isna(self.grouping_vector).any():
self._na_placeholder = max(codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm why do we need to store this? can we not just do max(codes) on reconstrutcion?

# pandas/core/groupby:reconstructed_codes
if not self._dropna:
if isna(self.grouping_vector).any():
self._has_na_placeholder = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be a cache_readonly? we try to avoid statefulness

@CloseChoice
Copy link
Member Author

@github-actions pre-commit

@@ -859,7 +859,21 @@ def ngroups(self) -> int:
def reconstructed_codes(self) -> list[np.ndarray]:
codes = self.codes
ids, obs_ids, _ = self.group_info
return decons_obs_group_ids(ids, obs_ids, self.shape, codes, xnull=True)
reconstructed_codes = decons_obs_group_ids(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i take this back, this now makes sorting dependent on groupby.ops which is very strange.

@CloseChoice so instead am ok to pass something do decons_obs_group_ids but the something should not be a list of grouping, rather list of codes or similar

@@ -34,6 +34,7 @@
Categorical,
ExtensionArray,
)
from pandas.core.base import isna
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pandas.core.dtypes.missing import isna

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 8, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was good last time i looked, if you can merge master

@@ -784,6 +784,7 @@ Groupby/resample/rolling
- Bug in :meth:`GroupBy.nth` failing on ``axis=1`` (:issue:`43926`)
- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not respecting right bound on centered datetime-like windows, if the index contain duplicates (:issue:`3944`)
- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` when using a :class:`pandas.api.indexers.BaseIndexer` subclass that returned unequal start and end arrays would segfault instead of raising a ``ValueError`` (:issue:`44470`)
- Bug in :meth:`DataFrame.groupby` when grouping on multiple columns where at least one includes ``np.nan`` which resulted in a ``KeyError`` when the ``np.nan`` containing index was selected with :meth:`Series.loc` (:issue:`43814`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can move to 1.5

@mroeschke
Copy link
Member

Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch and move the whatsnew note to 1.5, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using .loc with MultiIndex containing np.nan unexpected behavior
7 participants