-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix multiindex loc nan #43943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiindex loc nan #43943
Conversation
Hello @CloseChoice! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-12-08 20:18:23 UTC |
} | ||
) | ||
|
||
agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].agg(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the test is for MultiIndex.get_loc, then test that directly; shouldn't depend on groupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's quite strange but I can't reproduce this without the groupby. When doing something like
arrays = [[0, 0], ["o1", np.nan]]
names = ("temp_playlist", "objId")
index = MultiIndex.from_arrays(arrays, names=names)
srs = Series([4, 6], index=index)
result = srs.loc[(0, np.nan)]
I don't get the expected key error on master as I get when using:
df = DataFrame(
{
"temp_playlist": [0, 0, 0, 0],
"objId": ["o1", np.nan, "o1", np.nan],
"x": [1, 2, 3, 4],
}
)
agg_df = df.groupby(by=['temp_playlist', 'objId'], dropna=False)["x"].sum()
result = agg_df.loc[agg_df.index[-1]]
when comparing agg_df
with srs
using tm.assert_series_equal
this results in
*** AssertionError: Series.index are different
Attribute "inferred_type" are different
[left]: string
[right]: mixed
Any idea on this?
@github-actions pre-commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew note. 1.3.x in indexing; let's try to backport.
} | ||
) | ||
|
||
agg_df = df.groupby(by=["temp_playlist", "objId"], dropna=False)["x"].agg(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize on dropna=True & False
|
||
|
||
def test_loc_nan_multiindex(): | ||
df = DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are declaring this a regression, we should maybe have a test that previously passed. In what previous version of pandas did this test pass?
cc @phofl if you can have a look |
…s into FIX-multiindex-loc-nan # Conflicts: # pandas/tests/indexing/multiindex/test_getitem.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a loc bug. The MultiIndex coming from the groupby operation is corrupt. It has the codes:
[[0, 0], [0, 1]]
This should be
[[0, 0], [0, -1]]
The nans are not identified as missing values, hence the incorrect behavior in subsequent operations. We have to fix this in the groupby. I think this is coming from the na_sentinel change in factorize
doc/source/whatsnew/v1.3.4.rst
Outdated
@@ -26,6 +26,7 @@ Fixed regressions | |||
- Fixed regression in :meth:`Series.aggregate` attempting to pass ``args`` and ``kwargs`` multiple times to the user supplied ``func`` in certain cases (:issue:`43357`) | |||
- Fixed regression when iterating over a :class:`DataFrame.groupby.rolling` object causing the resulting DataFrames to have an incorrect index if the input groupings were not sorted (:issue:`43386`) | |||
- Fixed regression in :meth:`DataFrame.groupby.rolling.cov` and :meth:`DataFrame.groupby.rolling.corr` computing incorrect results if the input groupings were not sorted (:issue:`43386`) | |||
- Fixed regression in :meth:`DataFrame.loc` when `MultiIndex` contained `np.nan` (:issue`43814`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Fixed regression in :meth:`DataFrame.loc` when `MultiIndex` contained `np.nan` (:issue`43814`) | |
- Fixed regression in :meth:`DataFrame.loc` when ``MultiIndex`` contained ``np.nan`` following a groupby (:issue`43814`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loc itself works fine, the MultiIndex from the groupby is incorrect.
The line that is responsible for setting the codes to 1 is here. But changing this so that we have the codes |
I tried another approach: I save the Let me know what you think about that. |
@github-actions pre-commit |
If this is not a regression #43814 (comment), I would prefer to not do this for 1.3.x |
added the tests. Please note one thing: >>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(
... {
... "animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
... "type": [np.nan, np.nan, np.nan, np.nan],
... "speed": [380.0, 370.0, 24.0, 26.0],
... }
... )
>>> speed = df.groupby(["animal", "type"], dropna=False)["speed"].first()
>>> # Reconstruct same index to allow for multiplication.
>>> ix_wing = pd.MultiIndex.from_tuples(
... [("Falcon", np.nan), ("Parrot", np.nan)], names=["animal", "type"]
... )
>>> wing = pd.Series([42, 44], index=ix_wing)
>>>
>>> result = wing * speed
>>> wing.index.dtypes
animal object
type object
dtype: object
>>> speed.index.dtypes
animal object
type float64
dtype: object
# dtype of multiindex is dependent of the order of factors
>>> (wing * speed).index.dtypes
animal object
type object
dtype: object
>>> (speed * wing).index.dtypes
animal object
type float64
dtype: object Therefore I only multiply |
@github-actions pre-commit |
@@ -859,7 +859,21 @@ def ngroups(self) -> int: | |||
def reconstructed_codes(self) -> list[np.ndarray]: | |||
codes = self.codes | |||
ids, obs_ids, _ = self.group_info | |||
return decons_obs_group_ids(ids, obs_ids, self.shape, codes, xnull=True) | |||
reconstructed_codes = decons_obs_group_ids( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm would be better to actually do this in decons_obs_group_ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dunno, not wild about core.sorting needing to know anything about self._groupings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i take this back, this now makes sorting dependent on groupby.ops which is very strange.
@CloseChoice so instead am ok to pass something do decons_obs_group_ids
but the something should not be a list of grouping, rather list of codes or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to pass only a list where we need to reconstruct the codes for na
[[0, 0], ["o1", na]], names=["temp_playlist", "objId"] | ||
) | ||
result = grouped_df.index | ||
assert all((res == ex).all() for res, ex in zip(result.codes, expected.codes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comes the actual result with an expected frame and use tm.assert_frame_equals
pandas/core/groupby/grouper.py
Outdated
# pandas/core/groupby:reconstructed_codes | ||
if not self._dropna: | ||
if isna(self.grouping_vector).any(): | ||
self._na_placeholder = max(codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm why do we need to store this? can we not just do max(codes) on reconstrutcion?
pandas/core/groupby/grouper.py
Outdated
# pandas/core/groupby:reconstructed_codes | ||
if not self._dropna: | ||
if isna(self.grouping_vector).any(): | ||
self._has_na_placeholder = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be a cache_readonly? we try to avoid statefulness
@github-actions pre-commit |
@@ -859,7 +859,21 @@ def ngroups(self) -> int: | |||
def reconstructed_codes(self) -> list[np.ndarray]: | |||
codes = self.codes | |||
ids, obs_ids, _ = self.group_info | |||
return decons_obs_group_ids(ids, obs_ids, self.shape, codes, xnull=True) | |||
reconstructed_codes = decons_obs_group_ids( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i take this back, this now makes sorting dependent on groupby.ops which is very strange.
@CloseChoice so instead am ok to pass something do decons_obs_group_ids
but the something should not be a list of grouping, rather list of codes or similar
pandas/core/groupby/grouper.py
Outdated
@@ -34,6 +34,7 @@ | |||
Categorical, | |||
ExtensionArray, | |||
) | |||
from pandas.core.base import isna |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.core.dtypes.missing import isna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was good last time i looked, if you can merge master
@@ -784,6 +784,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`GroupBy.nth` failing on ``axis=1`` (:issue:`43926`) | |||
- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not respecting right bound on centered datetime-like windows, if the index contain duplicates (:issue:`3944`) | |||
- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` when using a :class:`pandas.api.indexers.BaseIndexer` subclass that returned unequal start and end arrays would segfault instead of raising a ``ValueError`` (:issue:`44470`) | |||
- Bug in :meth:`DataFrame.groupby` when grouping on multiple columns where at least one includes ``np.nan`` which resulted in a ``KeyError`` when the ``np.nan`` containing index was selected with :meth:`Series.loc` (:issue:`43814`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can move to 1.5
Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch and move the whatsnew note to 1.5, and we can reopen. |
.loc
with MultiIndex containingnp.nan
unexpected behavior #43814