Skip to content

API: should reindex on a level introduce NaNs for missing entries per label of other levels? #12319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Feb 13, 2016 · 6 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

Suppose the following dataframe and reindex operation:

In [65]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b'), ('B', 'a')]))

In [66]: df
Out[66]:
   A       B
   a   b   a
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

In [67]: df.reindex(columns=['a', 'b'], level=1)
Out[67]:
   A       B
   a   b   a
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

Should this give the following?

In [67]: df.reindex(columns=['a', 'b'], level=1)
Out[67]:
   A       B
   a   b   a    b
0  0   1   2  NaN
1  3   4   5  NaN
2  6   7   8  NaN
3  9  10  11  NaN

I am not sure what the exact behaviour of the level keyword should be, but eg in the following example it does the selection of columns for each of label of the other level:

In [69]: df2 = pd.DataFrame(np.arange(18).reshape(3,6), columns=pd.MultiIndex.from_product([('A', 'B'), ('a', 'b', 'c')]))
In [70]:

In [70]: df2
Out[70]:
    A           B
    a   b   c   a   b   c
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17

In [71]: df2.reindex(columns=['a', 'c'], level=1)
Out[71]:
    A       B
    a   c   a   c
0   0   2   3   5
1   6   8   9  11
2  12  14  15  17
@jreback
Copy link
Contributor

jreback commented Feb 13, 2016

xref #7895 (a lot of dupe as well), maybe close the other?
dupe of #11226, but I like your example better
xref #10549

@jreback
Copy link
Contributor

jreback commented Feb 13, 2016

If this is a leaf level then its easy to do this, but what if its not, then do you do a cartesian product of everything below that level? we could, but would want a broadcast argument I think (though this just maybe making it even more complicated)

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Feb 13, 2016
@jreback jreback added this to the Next Major Release milestone May 6, 2017
@jreback
Copy link
Contributor

jreback commented May 6, 2017

xref #7895

@jreback jreback modified the milestones: Next Major Release, Interesting Issues May 6, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Apr 23, 2021
@andenter
Copy link

Curious why the expected behavior should be different?

@ybagdasa
Copy link

ybagdasa commented Jun 2, 2022

I'm spent a few hours looking into the issue, but I am not familiar enough with index joins to know what the problem is. I think it might be happening in the function Index._join_level found in pandas/core/indexes/base.py.

Within the function new_level, left_lev_indexer, right_lev_indexer = old_level.join( right, how=how, return_indexers=True )
is called. The variable how is a right join. Not sure whether the outputs are as expected here. new_level seems to be correct.

Finally near the end of the function
join_index = MultiIndex( levels=new_levels, codes=new_codes, names=left.names, verify_integrity=False, )
is called. The new_codes appear to be the same as the old multiindex codes. I wonder if this is the issue.

Hope that helps someone willing to fix it.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@FirefoxMetzger
Copy link

I ran into this yesterday and it caught me by surprise. If you do a reindex without level it does add fill_value for the missing entries in the flat index, and I expected this behavior to carry over to the particular level I specify.

So, yes, I would have very much expected it to be a cartesian product between the new index (passed into reindex) and all unique entries of the old index after the selected level got removed. For example, if I have a dataframe with a 3-level multi index:

df = pd.DataFrame(
    np.arange(10).reshape(5, 2),
    index=pd.MultiIndex.from_tuples(
        [
            ("A", "a", 2),
            ("A", "b", 0),
            ("B", "a", 1),
            ("B", "a", 3),
            ("B", "d", 1),
        ]
    ),
)
|               |   0 |   1 |
|:--------------|----:|----:|
| ('A', 'a', 2) |   0 |   1 |
| ('A', 'b', 0) |   2 |   3 |
| ('B', 'a', 1) |   4 |   5 |
| ('B', 'a', 3) |   6 |   7 |
| ('B', 'd', 1) |   8 |   9 |

and I reindex the "middle" index using new_index = pd.Index(["a", "b", "c", "d"]) then I would expect df.reindex(new_index, level=1) to produce:

|               |   0 |   1 |
|:--------------|----:|----:|
| ('A', 'a', 2) |   0 |   1 |
| ('A', 'b', 2) | nan | nan |
| ('A', 'c', 2) | nan | nan |
| ('A', 'd', 2) | nan | nan |
| ('A', 'a', 0) | nan | nan |
| ('A', 'b', 0) |   2 |   3 |
| ('A', 'c', 0) | nan | nan |
| ('A', 'd', 0) | nan | nan |
| ('B', 'a', 1) |   4 |   5 |
| ('B', 'b', 1) | nan | nan |
| ('B', 'c', 1) | nan | nan |
| ('B', 'd', 1) |   8 |   9 |
| ('B', 'a', 3) |   6 |   7 |
| ('B', 'b', 3) | nan | nan |
| ('B', 'c', 3) | nan | nan |
| ('B', 'd', 3) | nan | nan |

Which is equivalent to this (naive, but hopefully easy to read) python implementation:

level = 1
remaining_levels = df.reset_index(level=level, drop=True).index.unique()
new_index_tuples = []
for old_levels in remaining_levels:
    for new_level in new_index:
        new_index_tuples.append((*old_levels[:level], new_level, *old_levels[level:]))
target_index = pd.MultiIndex.from_tuples(new_index_tuples)
df.reindex(target_index)

(see also: https://stackoverflow.com/questions/75106282/how-to-reindex-a-datetime-based-multiindex-in-pandas/75106323)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants