Skip to content

limiting reindex with MultiIndex ffill/bfill within levels. #10347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bwillers opened this issue Jun 13, 2015 · 6 comments
Closed

limiting reindex with MultiIndex ffill/bfill within levels. #10347

bwillers opened this issue Jun 13, 2015 · 6 comments
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex

Comments

@bwillers
Copy link
Contributor

xref #7895

When reindexing on a multiindex with method='ffill' or method='bfill', it would be very useful to be able to restrict the fill to certain groups/levels of the index.

For example, consider the following:

In [1]: dates = pd.date_range(start=pd.Timestamp('20080102'), 
                              periods=3, freq='7D')
In [2]: names = ['jane', 'john']
In [3]: index = pd.MultiIndex.from_product([names, dates], 
                                           names=['name', 'date'])
In [4]: df = pd.DataFrame(index=index, 
                          data={'best_score': [1, 2, 3, None, 5, 6]})
In [5]: df
Out[5]:
                 best_score
name date
jane 2008-01-02           1
     2008-01-09           2
     2008-01-16           3
john 2008-01-02         NaN
     2008-01-09           5
     2008-01-16           6

In [7]: new_dates = [pd.Timestamp('20080101'), pd.Timestamp('20080117')]
In [8]: new_index = pd.MultiIndex.from_tuples([('jane', new_dates[1]),
                                               ('john', new_dates[0]), 
                                               ('john', new_dates[1])], 
                                              names=['name', 'date'])

In [9]: df.reindex(new_index, method='ffill')
Out[9]:
                 best_score
name date
jane 2008-01-17           3
john 2008-01-01           3
     2008-01-17           6

Clearly john's score on 2008-01-01 is not 3, it's NaN. What would be great (ignoring the awful argument name) is something like:

df.reindex(new_index, method='ffill', fill_group_level=['name'])

                 best_score
name date
jane 2008-01-17           3
john 2008-01-01           NaN
     2008-01-17           6

This generalizes to indexes with more than two levels. In effect, it amounts to being able to specify a set of boundaries for the ffill/bfill based on changes in level values. I don't think this can be done in a straightforward way with a groupby(level='name') because the values of the index in the second level are not the same for every group.

@jreback
Copy link
Contributor

jreback commented Jun 13, 2015

this is a dupe of #7895
You want this:

In [22]: df.groupby(level='name').apply(lambda x: x.reset_index(level=0,drop=True).reindex(new_dates,method='ffill'))
Out[22]: 
                 best_score
name date                  
jane 2008-01-01         NaN
     2008-01-17           3
john 2008-01-01         NaN
     2008-01-17           6

But you actually want it to work with this syntax

df.reindex(new_dates,method='ffill',level='date')
which is not supported ATM (but a good place for it TO work).

I don't think there is an easy way to support a multi-level reindex as you have indicated (and IMHO too complicated). Better to allow a single level fill, which is what you ultimately want).

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex labels Jun 13, 2015
@jreback jreback added this to the Next Major Release milestone Jun 13, 2015
@bwillers
Copy link
Contributor Author

Thanks for the comments. The groupby approach you suggested doesn't do quite the same thing - in the example the index being used intentionally does not have the same date values for every name, whereas the groupby approach assumes the same dates for every group. I guess you could then subset the dates based on the exid, but it gets hairy pretty quickly.

I did have a look at the issue you referenced but I'm not sure these are the same.

#7895 involves taking a multiindex A, and reindexing it with a plain index
(or a multiindex with fewer levels) B (i.e. len(A.names) > len(B.names)), to broadcast across the levels that are absent in B. Coming up with sane and consistent broadcasting semantics for arbitrary multi indexes seems a very complex task.

In conrast, this issue is about taking a multiindex A and reindexing it with a multiindex C with the same number/name/type of levels (i.e. A.names == C.names), there's no broadcasting involved. The only thing thats different from a vanilla df.reindex(C, method='ffill') is changing how far back/forward the ffill and bfill methods look to find a value, based on the levels passed. So the end result ends up looking a lot like what you would get if you line up the frames with an ordered left merge by name (related: #1870).

@jreback
Copy link
Contributor

jreback commented Jun 13, 2015

well, the interface is simply as I have stated above. So this needs to be addressed in the Index.reindex method (for MultiIndex). Its not implemented ATM, so feel free to have a crack at it. Forcing the user to specify a multi-level reindex with a filler is pretty complicated. This should work with a single level specified. Its possible that the multi-reindex with fill should simply be not allowed (if it gives the 'wrong' answer).

@bwillers
Copy link
Contributor Author

Will take a crack at it, seems like a good reason to figure out how all this stuff works under the covers.

@jreback
Copy link
Contributor

jreback commented Jun 13, 2015

awesome!

@mroeschke mroeschke added Enhancement and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 18, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

Thanks for the request, but it appears there hasn't been much interest or activity in this feature for years so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

No branches or pull requests

4 participants