Skip to content

BUG: Fix aligning a DataFrame with a Series with MultiIndex #46001 #46058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

johannes-mueller
Copy link
Contributor

@johannes-mueller johannes-mueller commented Feb 18, 2022

When aligning a DataFrame to a Series we are using Series.reindex() to
broadcast the Series data to the new index. That introduces NaNs when the new
index rows are not identical to the existing ones, which is not the case when
we introduce a new MultiIndex level.

In this patch we use the same approach as for aligning a Series to another
Series. That means that we have to replicate a part of Series._reindex_indexer
as DataFrame does not have it.

@johannes-mueller johannes-mueller marked this pull request as draft February 18, 2022 19:40
@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch 6 times, most recently from b62419a to 7d1e6ad Compare February 23, 2022 13:58
@johannes-mueller johannes-mueller marked this pull request as ready for review February 23, 2022 15:12
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Feb 26, 2022
@jreback jreback added this to the 1.5 milestone Feb 26, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @phofl @jbrockmendel if you can have a look

@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch 2 times, most recently from 3d723ce to 1593665 Compare February 28, 2022 12:17
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment

cc @phofl @jbrockmendel if comments

@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch from 1593665 to 233edf2 Compare February 28, 2022 14:00
@johannes-mueller johannes-mueller requested a review from jreback March 1, 2022 10:55
@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch from 233edf2 to 029982e Compare March 1, 2022 19:21
@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch from 029982e to ad6271d Compare March 2, 2022 07:34
@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch from ad6271d to 1773070 Compare March 2, 2022 17:07
left = self.copy() if copy else self
else:
data = {
c: algos.take_nd(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason not to use the BlockManager method here?

Copy link
Contributor Author

@johannes-mueller johannes-mueller Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what exactly you mean by BlockManager method. The way I came to my solution was that I tried to replicate what Series._reindex_indexer does. So what I am doing here is inspired by this.

EDIT: The CI has already bitten me with an BlockManager ArrayManager issue (see for example here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what exactly you mean by BlockManager method.

I was thinking Manager.take (definitely only use public Manager methods)

EDIT: The CI has already bitten me with an BlockManager ArrayManager issue (see for example here).

yah, ArrayManager is a PITA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried

left = self.take(lidx)
left.index = join_index

but that makes frame/test_align.py::test_align_series_combinations and a lot of tests in frame/test_arithmetic.py fail. The issue is, that when other.index has elements which are not in self.index. Then lidx contains -1 and so .take() will put in the last element instead of np.nan.

Consider the following example:

import pandas as pd

df = pd.DataFrame({'x': [1., 2., 3., 44.], 'y': [5., 6., 7., 88.]}, index=pd.Index([1, 2, 8, 9]))
series = pd.Series([10., 20., 30., 40.], index=pd.Index([1, 2, 3, 4]))

join_index, lidx, rindx = df.index.join(series.index, how="outer", return_indexers=True)

with_algos_take_nd = pd.DataFrame(
    pd.core.algorithms.take_nd(df.values, lidx), columns=['x', 'y'], index=join_index
)
with_take = df.take(lidx)
with_take.index = join_index

print("with take_nd")
print(with_algos_take_nd)
print("with take")
print(with_take)

gives

with take_nd
      x     y
1   1.0   5.0
2   2.0   6.0
3   NaN   NaN
4   NaN   NaN
8   3.0   7.0
9  44.0  88.0
with take
      x     y
1   1.0   5.0
2   2.0   6.0
3  44.0  88.0
4  44.0  88.0
8   3.0   7.0
9  44.0  88.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried [...] left = self.take(lidx)

The name overlap can be confusing: by "Manager.takei meantself._mgr.take, which behaves differently from self.take`.

else:
data = {
c: algos.take_nd(
self[c]._values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will cause a problem if columns are not unique. is that case ruled out here?

Copy link
Contributor Author

@johannes-mueller johannes-mueller Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not. Haven't thought of that. I have added a test for that case and also implemented a solution for it. However, I am not sure if the BlockManager method that you are referring to is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some further simplifications and added some more tests

@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch 2 times, most recently from 51a287a to cd0ab77 Compare March 4, 2022 07:48
@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch from cd0ab77 to 5d7c05f Compare March 4, 2022 10:19
When aligning a DataFrame to a Series we are using Series.reindex() to
broadcast the Series data to the new index.  That introduces NaNs when the new
index rows are not identical to the existing ones, which is not the case when
we introduce a new MultiIndex level.

In this patch we use the same approach as for aligning a Series to another
Series. That means that we have to replicate a part of Series._reindex_indexer
as DataFrame does not have it.
@johannes-mueller johannes-mueller force-pushed the bugfix/series-multiindex-align-46001 branch from 5d7c05f to 8b57c94 Compare March 7, 2022 15:27
@jreback jreback merged commit 41d248e into pandas-dev:main Mar 7, 2022
@jreback
Copy link
Contributor

jreback commented Mar 7, 2022

thanks @johannes-mueller

left = self.copy() if copy else self
else:
data = algos.take_nd(
self.values,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.values can be expensive (and lossy) if the dataframe is not a single block.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2022

@johannes-mueller sorry would u mind doing a follow up address comments? thanks

johannes-mueller added a commit to boschresearch/pandas that referenced this pull request Mar 8, 2022
johannes-mueller added a commit to boschresearch/pandas that referenced this pull request Mar 9, 2022
johannes-mueller added a commit to boschresearch/pandas that referenced this pull request Mar 9, 2022
johannes-mueller added a commit to boschresearch/pandas that referenced this pull request Mar 10, 2022
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: alignment of MultiIndexed DataFrame to a Series with a common index level
3 participants