Skip to content

Merge on single level of MultiIndex #3662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hayd opened this issue May 20, 2013 · 20 comments · Fixed by #6363
Closed

Merge on single level of MultiIndex #3662

hayd opened this issue May 20, 2013 · 20 comments · Fixed by #6363
Labels
Enhancement MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented May 20, 2013

I don't think we can do this now, but it potentially sounds useful...

http://stackoverflow.com/questions/16650945/merge-on-single-level-of-multiindex
https://groups.google.com/forum/#!topic/pydata/LBSFq6of8ao

Example:

In [10]: df = pd.DataFrame(randn(10,5))

In [11]: df1 = df[range(0, 3)].set_index([0, 2])

In [12]: df2 = df[range(2, 5)].set_index([2, 4])

join/merge on 2.

@cpcloud
Copy link
Member

cpcloud commented May 20, 2013

is df1.join(df2, how='outer') not what ur looking for?

@hayd
Copy link
Contributor Author

hayd commented May 20, 2013

No, I don't think so...

In [13]: df1.join(df2, how='outer')
Out[13]:
                            1         3
-3.987006 -0.534963       NaN -0.521078
-1.808632 -0.227492       NaN -0.880865
-1.188311 -1.808632 -1.018368       NaN
-0.985607 -0.151162  0.815265       NaN
-0.528617  1.369224 -0.642621       NaN
-0.151162 -0.462676       NaN  0.388309
-0.115006 -3.987006  0.548307       NaN
 0.066250  0.590389  0.275410       NaN
 0.590389 -0.926148       NaN -0.999415
 0.763442  0.618186       NaN  1.212661
 0.930736 -0.856200       NaN -0.861088
 0.975342  1.824005 -0.162082       NaN
 1.006033  0.340764       NaN -1.809402
 1.169599  1.006033 -2.006705       NaN
 1.189407  0.930736  0.645642       NaN
 1.243792  0.763442 -0.972110       NaN
 1.369224  0.547792       NaN  1.515713
 1.375035  2.192195       NaN -0.728762
 1.824005 -1.364679       NaN -0.231887
 2.151640  1.375035 -0.656242       NaN

Hoped for:

In [21]: df3 = pd.DataFrame({1: df[1], 3: df[3]})

In [22]: df3.index = df[2]

In [23]: df3
Out[23]:
                  1         3
2
 0.930736  0.645642 -0.861088
 0.590389  0.275410 -0.999415
 1.375035 -0.656242 -0.728762
 0.763442 -0.972110  1.212661
 1.824005 -0.162082 -0.231887
 1.369224 -0.642621  1.515713
-0.151162  0.815265  0.388309
 1.006033 -2.006705 -1.809402
-3.987006  0.548307 -0.521078
-1.808632 -1.018368 -0.880865

@cpcloud
Copy link
Member

cpcloud commented May 20, 2013

yes this would most def be useful IMHO

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

@hayd I think that

df1.join(df2, how='inner')

actually should do what you want. Right now it doesn't do this and in fact join doesn't do what it claims to do in this regard, that is, with how='inner' return the intersection of the indices. The intersection of the indices of df1 and df2 is not the empty set, yet the empty DataFrame is returned, so this might be considered a bug...

OT: should the empty ndframe be a singleton? i haven't really checked, but these objects return a different id every time they are instantiated...

@cpcloud
Copy link
Member

cpcloud commented May 24, 2013

Down in the guts of this there's a function called _join_level defined on Index. Currently it throws an error saying that joining 2 multiindex objects on a level is ambiguous, why? If I have two MultiIndexs with two levels with the same name shouldn't I be able to compute the intersection of those indices? It should only be empty if they have no elements in common (by definition).

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

from some comments on the mailing list: https://groups.google.com/forum/#!topic/pydata/LBSFq6of8ao

What would help move this along would be:

  1. some self-contained test cases, esp corner / cases where this would fail
  2. the 'quick' and dirty soln can prob be implemented pretty easily (reset indexes and merge, reset the indexes). This may actually be a pretty good soln as their is not a lot of cost (in memory or speed) for resetting / setting indexes - and if memory is an issue that's a whole other problem (http://stackoverflow.com/questions/14614512/merging-two-tables-with-millions-of-rows-in-python/14617925#14617925)
  3. an acutal indexing merge (where right now it raises in core/index.py : again test cases are the most important here.

would love for someone to attempt 2) as then can have a speed/memory benchmark and see even if 3) is worthwhile (I don't know how much gain this would really have - so not sure how much effort it needs).

@hmgaudecker
Copy link

I'm not quite sure I understand the difference between 2) and 3) - is this just the underlying implementation or do you mean something else by "an actual indexing merge)?

As for 2), are you saying that all that needs to be done would be to hide the SO solution from the user and allow specifying the index?

Testcases: In principle this should amount to re-using test cases for joins with regular columns, no? I had a very brief look into that but couldn't find much -- any pointers?

Looking at the expected behavior above, I am not sure I'd quite agree (at least for outer joins, which seem to be the more natural use case to me) -- I would expect a MultiIndex on the resulting dataframe with the index being the union of all index constituents of the two original tables.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

  1. is like the SO question, but exactly, hiding from the user.
  2. is potentially much faster / better user of memory as its done internal in the index module (e.g. right where the raise happens now)

Maybe give me a small example of what you want to do, e.g. input and output (e.g. the normal use case)

Here's a number of issues relating to this (and why we need specific test cases):

  • normal case is number of levels match up and names on those levels match
  • what happens if the right can only match up on certain level, (e.g. right has 1 level and left has 2; match on which level?
    • the name matches then its clear (I think this is handled now)
    • what do do if their are NO matching levels

@hmgaudecker
Copy link

Here comes an example, see these slides for some context (p. 19+).

In [49]: import io
In [50]: household = pd.read_table(
    io.StringIO(
"""household_id,male,wealth
1,0,196087.3
2,1,316478.7
3,0,294750
"""
    ), 
    sep=',', index_col='household_id'
)

In [51]: household
Out[51]: 
              male    wealth
household_id                
1                0  196087.3
2                1  316478.7
3                0  294750.0

[3 rows x 2 columns]

In [52]: portfolio = pd.read_table(
    io.StringIO(
        """"household_id","asset_id","name","share"
"1","nl0000301109","ABN Amro","1.0"
"2","nl0000289783","Robeco","0.4"
"2","gb00b03mlx29","Royal Dutch Shell","0.6"
"3","gb00b03mlx29","Royal Dutch Shell","0.15"
"3","lu0197800237","AAB Eastern Europe Equity Fund","0.6"
"3","nl0000289965","Postbank BioTech Fonds","0.25"
"4",,,"1.0"
"""
    ),
    sep=',', index_col=['household_id', 'asset_id']
)

In [53]: portfolio
Out[53]: 
                                                     name  share
household_id asset_id                                           
1            nl0000301109                        ABN Amro   1.00
2            nl0000289783                          Robeco   0.40
             gb00b03mlx29               Royal Dutch Shell   0.60
3            gb00b03mlx29               Royal Dutch Shell   0.15
             lu0197800237  AAB Eastern Europe Equity Fund   0.60
             nl0000289965          Postbank BioTech Fonds   0.25
4            NaN                                      NaN   1.00

[7 rows x 2 columns]

In [54]: household.join(portfolio, how='inner')
Out[54]: 
Empty DataFrame
Columns: [male, wealth, name, share]
Index: []

[0 rows x 4 columns]

(So I would argue this is not handled)

I would expect:

                             male    wealth                            name  share
household_id asset_id                                                           
1            nl0000301109       0  196087.3                        ABN Amro   1.00
2            nl0000289783       1  316478.7                          Robeco   0.40
             gb00b03mlx29       1  316478.7               Royal Dutch Shell   0.60
3            gb00b03mlx29       0  294750.0               Royal Dutch Shell   0.15
             lu0197800237       0  294750.0  AAB Eastern Europe Equity Fund   0.60
             nl0000289965       0  294750.0          Postbank BioTech Fonds   0.25

What do do if their are NO matching levels?

I would be fine requiring the user to name the levels consistently. Else you can always fall back on the existing solution based on regular columns.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@hmgaudecker give a try with #6356

I need some more test cases for example, joining on non-level 0

I also need to see if its necessary (as I think its pretty complicated) to join 2 multi-indexes.

This joins a single to a multi on an inferred level.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

I suppose the on keyword could be hijacked to provide a specific level if their are multiple matches

@hmgaudecker
Copy link

Sweet! Exactly what I was expecting.

I need some more test cases for example, joining on non-level 0

I don't quite get what you mean here.

I also need to see if its necessary (as I think its pretty complicated) to join 2 multi-indexes.

I would think so. Testcase below.

[Sorry, gotta run now.]

In [3]: portfolio = read_table(
   ...:     StringIO(
   ...: """"household_id","asset_id","share"
   ...: "1","nl0000301109","1.0"
   ...: "2","nl0000289783","0.4"
   ...: "2","gb00b03mlx29","0.6"
   ...: "3","gb00b03mlx29","0.15"
   ...: "3","lu0197800237","0.6"
   ...: "3","nl0000289965","0.25"
   ...: "4",,"1.0"
   ...: """
   ...:     ),
   ...:     sep=",", index_col=["household_id", "asset_id"]
   ...: )

In [4]: log_return = DataFrame(dict(
   ...:         asset_id = ["gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "lu0197800237", "lu0197800237"],
   ...:         t = [233, 234, 235, 180, 181],
   ...:         log_return = [.09604978, -.06524096, .03532373, .03025441, .036997]
   ...: )).set_index(["asset_id","t"])

In [5]: portfolio.join(log_return, how='inner')
Out[5]: ValueError: cannot join between multiple multi-indexes

In [6]: inner_expected = DataFrame(dict(
   ...:         household_id = [2, 2, 2, 3, 3, 3, 3, 3],
   ...:         asset_id = ["gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "lu0197800237", "lu0197800237"],
   ...:         t = [233, 234, 235, 233, 234, 235, 180, 181],
   ...:         share = [0.6, 0.6, 0.6, 0.15, 0.15, 0.15, 0.6, 0.6],
   ...:         log_return = [.09604978, -.06524096, .03532373, .09604978, -.06524096, .03532373, .03025441, .036997]
   ...: )).set_index(["household_id", "asset_id", "t"])


In [6]: inner_expected
Out[6]: 
                               log_return  share
household_id asset_id     t                     
2            gb00b03mlx29 233    0.096050   0.60
                          234   -0.065241   0.60
                          235    0.035324   0.60
3            gb00b03mlx29 233    0.096050   0.15
                          234   -0.065241   0.15
                          235    0.035324   0.15
             lu0197800237 180    0.030254   0.60
                          181    0.036997   0.60

[8 rows x 2 columns]


In [7]: portfolio.join(log_return, how='outer')
Out[7]: ValueError: cannot join between multiple multi-indexes

In [8]: outer_expected = DataFrame(dict(
   ...:         household_id = [1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4],
   ...:         asset_id = ["nl0000301109", "nl0000289783", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "lu0197800237", "lu0197800237", "nl0000289965", None],
   ...:         t = [None, None, 233, 234, 235, 233, 234, 235, 180, 181, None, None],
   ...:         share = [1.0, 0.4, 0.6, 0.6, 0.6, 0.15, 0.15, 0.15, 0.6, 0.6, 0.25, 1.0],
   ...:         log_return = [None, None, .09604978, -.06524096, .03532373, .09604978, -.06524096, .03532373, .03025441, .036997, None, None]
   ...: )).set_index(["household_id", "asset_id", "t"])

In [9]: outer_expected
Out[9]: 
                               log_return  share
household_id asset_id     t                     
1            nl0000301109 NaN         NaN   1.00
2            nl0000289783 NaN         NaN   0.40
             gb00b03mlx29 233    0.096050   0.60
                          234   -0.065241   0.60
                          235    0.035324   0.60
3            gb00b03mlx29 233    0.096050   0.15
                          234   -0.065241   0.15
                          235    0.035324   0.15
             lu0197800237 180    0.030254   0.60
                          181    0.036997   0.60
             nl0000289965 NaN         NaN   0.25
4            NaN          NaN         NaN   1.00

[12 rows x 2 columns]

right and left would become meaningful testcases here as well, but I don't have time to set this up right now.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

@hmgaudecker so I merged in the single join, see #6363; along with some docs on how to do a multi-multi join. THat's fairly complicated to actually implement. and IMHO not worth the effort as it really doesn't change the memory usage/speed that much at all.

That said I created #6360 to track that use case. Feel free to submit a PR if you'd like.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

Here's a partial implementation: jreback@0c38215

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

@hmgaudecker
Copy link

Beautiful, thanks! @PKEuS, could you have a look into #6360?

@hayd
Copy link
Contributor Author

hayd commented Mar 12, 2014

@jreback The motivating / original example doesn't work atm (am I missing something?):

In [11]: df = pd.DataFrame(np.random.randn(10,5))
    ...: df1 = df[range(0, 3)].set_index([0, 2])
    ...: df2 = df[range(2, 5)].set_index([2, 4])
    ...: res = df1.join(df2, how='inner')  # empty. Do I need to specify something here?
    ...: exp = pd.DataFrame({1: df[1], 3: df[3]})
    ...: exp.index = df[2]

@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

that got 'moved' to #6360 (maybe put that at the top), as was using @hmgaudecker examples

maybe add in that example there as well

you can 'do' that with the doc example though (just not directly) ATM

@hayd
Copy link
Contributor Author

hayd commented Mar 12, 2014

ah, sorry I thought that one was closed. Edited into top.

How do you do it atm (indirectly)?

@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

see the docs link above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants