Skip to content

BUG: join on column with index #49360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

debnathshoham
Copy link
Member

@debnathshoham debnathshoham commented Oct 27, 2022

(below can be closed if agree on expected result)
xref #28220
xref #33232

Lot's of existing tests were changed, i think there are multiple reported bugs on this.. I might be missing a few others as well.

The changed tests could be broadly classified as:

  1. The result contains value in that was not present in the original column.
  2. The index of the merged df was different. According to the docs, when merged on a index it should pass on.
  3. The data type of the expected df was object instead of say datetime, or someother
  4. In merge_asof I think since it defaults to left join, it should inherit the index from left df (these are working fine, there was some trouble with local build initially)

One particular test, which I am little puzzled about the expected output is the below. Joining on index on one df, and partial index and partial column in the other df. I was not able to figure out, so skipped for the time

@pytest.mark.parametrize("left_index", ["inner", ["inner", "outer"]])
def test_join_indexes_and_columns_on(df1, df2, left_index, join_type):
# Construct left_df
left_df = df1.set_index(left_index)
# Construct right_df
right_df = df2.set_index(["outer", "inner"])
# Result
expected = (
left_df.reset_index()
.join(
right_df, on=["outer", "inner"], how=join_type, lsuffix="_x", rsuffix="_y"
)
.set_index(left_index)
)
# Perform join
result = left_df.join(
right_df, on=["outer", "inner"], how=join_type, lsuffix="_x", rsuffix="_y"
)
tm.assert_frame_equal(result, expected, check_like=True)

@debnathshoham debnathshoham marked this pull request as draft October 27, 2022 20:44
@debnathshoham debnathshoham marked this pull request as ready for review October 28, 2022 23:00
@debnathshoham debnathshoham added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Oct 29, 2022
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for giving this a shot. Not sure about the changed tests - so you think all of these exhibit buggy behavior?

if (self.left_index and not self.right_index) or (
self.right_index and not self.left_index
):
if key_col.equals(result.index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be misreading this but what is the point of this loop? Only ever continues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would hit continue, only when it's merging on a col and an index and index already contains correct values

@@ -415,7 +416,7 @@ def test_join_inner_multiindex(self, lexsorted_two_level_string_multiindex):
expected = expected.drop(["first", "second"], axis=1)
expected.index = joined.index

assert joined.index.is_monotonic_increasing
# assert joined.index.is_monotonic_increasing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would fail.. since to_join is being joined on the index.. the final df would contain that index

this behaviour is mentioned in the docs for merge.. but since join also uses most of the same stuff.. I think they should be consistent?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure I follow - what does this produce now? Why does the monotonicity of the index change?

Copy link
Member Author

@debnathshoham debnathshoham Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously joined inherited the index from data df. After the change it will inherit from tojoin.

My reason for this to be expected:

  • In merge, with column on one side and index on the other, the result inherits the index from the data frame merged on the index (edit: this behaviour is as per doc)
  • Since join uses merge, I think their output should be consistent with each other

@debnathshoham
Copy link
Member Author

debnathshoham commented Oct 29, 2022

Most of them the tests are. For the rest I'm not very sure about the expected behaviour.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all three of these need to be fixed together? I think it would be easier to move forward if we can fix one thing at a time - I'm not sure all these changes are desired

@@ -415,7 +416,7 @@ def test_join_inner_multiindex(self, lexsorted_two_level_string_multiindex):
expected = expected.drop(["first", "second"], axis=1)
expected.index = joined.index

assert joined.index.is_monotonic_increasing
# assert joined.index.is_monotonic_increasing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure I follow - what does this produce now? Why does the monotonicity of the index change?

@debnathshoham
Copy link
Member Author

debnathshoham commented Nov 1, 2022

All the three bugs are kind of similar.
You're joining / merging two dfs using a column and an index. These are just specific cases I believe.

EDIT: Even the two xref bugs are similar. But I am not very sure about the expected dfs from those.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 9, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
4 participants