Skip to content

Very unintuitive to join two dataframes with indexes that overlap partially #10621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dlenski opened this issue Jul 19, 2015 · 4 comments
Closed

Comments

@dlenski
Copy link

dlenski commented Jul 19, 2015

I posted a slightly-ranting question about this on StackOverflow.

Basically, I want to use pandas to join two dataframes wherein the index levels of one are a subset of the index levels of the other. For example, the two DFs below have index.names of ('fruit', 'citrus') and ('citrus',) respectively:

fruit_max_by_date =

                    0
    fruit   citrus   
    apples  False   6
    bananas False   5
    oranges True    8
    lemons  True    9

    [4 rows x 1 columns]

citrus_max_by_date =

            1
    citrus   
    False   6
    True    9

    [2 rows x 1 columns]

The intuitive approach to joining them fails, because the indexes do not exactly match:

fruit_max_by_date.join(citrus_max_by_date) =>

                    0   1
    fruit   citrus       
    apples  False   6 NaN
    bananas False   5 NaN
    oranges True    8 NaN
    lemons  True    9 NaN

    [4 rows x 2 columns]

I found a variety of workarounds, none of which I consider very satisfactory. Eventually I found one that doesn't involve giving up on using an index altogether. It involves rebroadcasting the "smaller" index to match the "larger" index:

fruit_max_by_date.join( 
  citrus_max_by_date.reindex(fruit_max_by_date.index, level='citrus') ) =>

                    0  1
    fruit   citrus      
    apples  False   6  6
    bananas False   5  6
    oranges True    8  9
    lemons  True    9  9

    [4 rows x 2 columns]

Is there any reason not to make this behavior automatic?

@TomAugspurger
Copy link
Contributor

Do you want this?

In [8]: pd.merge(fruit_max_by_date, citrus_max_by_date, left_index=True, right_index=True)
Out[8]:
                0  1
fruit   citrus
apples  False   6  6
bananas False   5  6
oranges True    8  9
lemons  True    9  9

I typically use merge instead of join fwiw.

@TomAugspurger
Copy link
Contributor

Actually, join gives the same result for me.

In [11]: fruit_max_by_date.join(citrus_max_by_date)
Out[11]:
                0  1
fruit   citrus
apples  False   6  6
bananas False   5  6
oranges True    8  9
lemons  True    9  9

What version of pandas are you using?

@dlenski
Copy link
Author

dlenski commented Jul 19, 2015

Drat. I'm using v0.13. Looks like this was partially fixed in 0.14.

Joining a single index with a multi-index now "just works": http://pandas.pydata.org/pandas-docs/stable/merging.html#merging-join-on-mi

However, joining two multi-indexes still doesn't work according to the latest docs, which recommend removing the index as a workaround: http://pandas.pydata.org/pandas-docs/stable/merging.html#joining-with-two-multi-indexes

@dlenski dlenski closed this as completed Jul 19, 2015
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Jul 19, 2015
@jorisvandenbossche
Copy link
Member

As a reference, the issue for multi-index on multi-index join is here: #6360

Apparently there were some attempts but never made it in. Feel free to look at it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants