-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix aligning a DataFrame with a Series with MultiIndex #46001 #46058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix aligning a DataFrame with a Series with MultiIndex #46001 #46058
Conversation
b62419a
to
7d1e6ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @phofl @jbrockmendel if you can have a look
3d723ce
to
1593665
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment
cc @phofl @jbrockmendel if comments
1593665
to
233edf2
Compare
233edf2
to
029982e
Compare
029982e
to
ad6271d
Compare
ad6271d
to
1773070
Compare
pandas/core/generic.py
Outdated
left = self.copy() if copy else self | ||
else: | ||
data = { | ||
c: algos.take_nd( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason not to use the BlockManager method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what exactly you mean by BlockManager method.
I was thinking Manager.take
(definitely only use public Manager methods)
EDIT: The CI has already bitten me with an BlockManager ArrayManager issue (see for example here).
yah, ArrayManager is a PITA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried
left = self.take(lidx)
left.index = join_index
but that makes frame/test_align.py::test_align_series_combinations
and a lot of tests in frame/test_arithmetic.py
fail. The issue is, that when other.index
has elements which are not in self.index
. Then lidx
contains -1
and so .take()
will put in the last element instead of np.nan
.
Consider the following example:
import pandas as pd
df = pd.DataFrame({'x': [1., 2., 3., 44.], 'y': [5., 6., 7., 88.]}, index=pd.Index([1, 2, 8, 9]))
series = pd.Series([10., 20., 30., 40.], index=pd.Index([1, 2, 3, 4]))
join_index, lidx, rindx = df.index.join(series.index, how="outer", return_indexers=True)
with_algos_take_nd = pd.DataFrame(
pd.core.algorithms.take_nd(df.values, lidx), columns=['x', 'y'], index=join_index
)
with_take = df.take(lidx)
with_take.index = join_index
print("with take_nd")
print(with_algos_take_nd)
print("with take")
print(with_take)
gives
with take_nd
x y
1 1.0 5.0
2 2.0 6.0
3 NaN NaN
4 NaN NaN
8 3.0 7.0
9 44.0 88.0
with take
x y
1 1.0 5.0
2 2.0 6.0
3 44.0 88.0
4 44.0 88.0
8 3.0 7.0
9 44.0 88.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried [...]
left = self.take(lidx)
The name overlap can be confusing: by "Manager.takei meant
self._mgr.take, which behaves differently from
self.take`.
pandas/core/generic.py
Outdated
else: | ||
data = { | ||
c: algos.take_nd( | ||
self[c]._values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will cause a problem if columns are not unique. is that case ruled out here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not. Haven't thought of that. I have added a test for that case and also implemented a solution for it. However, I am not sure if the BlockManager method that you are referring to is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some further simplifications and added some more tests
51a287a
to
cd0ab77
Compare
cd0ab77
to
5d7c05f
Compare
When aligning a DataFrame to a Series we are using Series.reindex() to broadcast the Series data to the new index. That introduces NaNs when the new index rows are not identical to the existing ones, which is not the case when we introduce a new MultiIndex level. In this patch we use the same approach as for aligning a Series to another Series. That means that we have to replicate a part of Series._reindex_indexer as DataFrame does not have it.
5d7c05f
to
8b57c94
Compare
thanks @johannes-mueller |
left = self.copy() if copy else self | ||
else: | ||
data = algos.take_nd( | ||
self.values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.values can be expensive (and lossy) if the dataframe is not a single block.
@johannes-mueller sorry would u mind doing a follow up address comments? thanks |
When aligning a DataFrame to a Series we are using Series.reindex() to
broadcast the Series data to the new index. That introduces NaNs when the new
index rows are not identical to the existing ones, which is not the case when
we introduce a new MultiIndex level.
In this patch we use the same approach as for aligning a Series to another
Series. That means that we have to replicate a part of Series._reindex_indexer
as DataFrame does not have it.
and BUG: alignment in operations with multi-index #43321
doc/source/whatsnew/v1.5.0.rst
.