-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Join with a list of a single element should behave as a join with a single element (#57676) #57890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Seems like the same test is failing over and over again, however it does not make sense to me, the test is:
In this example multiple DataFrames are not passed, just a list with a single DataFrame |
I've updated the test mentioned above, before any list passed with the Dataframe.join method was assumed to have 2+ elements, by adding this edge case where a list can have a single element (behaving has a Dataframe.join with a single element) that condition in the test "test_suffix_on_list_join" no longer makes sense. |
LGTM. @mroeschke would you mind taking a look? |
Thanks for the review @Aloqeely, in the meantime I've rebased my PR to address conflicts in the "whatsnew" file. I've noticed that the PR started failing the Numpy Dev tests because it was rebased from the main Pandas branch which is failing the same tests. No other changes (except the whatsnew file) were made since the last review. If any developer familiar with this area of the code could take a look, I'd appreciate it (cc @mroeschke, @WillAyd, @jbrockmendel ) I apologize for the broader ping, but I wasn't sure who would be best suited to review this change. |
Hmm unfortunately I am -1 on making this change. I'm not sure I understand why we need to special case a single element list. Generally our requirement for joins / merges is that the key be hashable |
Sorry ignore my hashability comment - I see this is in regards to the join argument not the key elements themselves. Still, I'm not sure why we need to special case this |
There's a problem, in some rare cases, that joining a dataframe x with a list containing a single dataframe y would lose information in relation to just joining dataframe x with dataframe y (one example of this on the original issue: #57676) |
Ah OK thanks @Dacops . This only solves that then though with a single element list right? What happens when there are multiple list elements? I think there is a fix somewhere else in the code that would work generally - ideally we avoid special-casing solutions like this as they don't add up well over time |
With multiple list elements the behaviour remains the same. Basically, what the code did was if the joining element is an Object deal with it in the way A, if it's a List deal with it in the way B. I've added that if-statemen that just checks if the list has a single element. If it does use way A else keep using way B. |
I think what he's saying is that a better fix would be to not treat lists with 1 item as a special case that uses different behavior, and instead fix the merge logic such that it does not fail even if the list has 1 item only |
Hmmm, yeah I did thought of that, however the line that handles that is on frame.py/10685: |
It might very well be broken logic even if it was there for 12 years, funnily enough I just saw another comment saying that some broken logic was introduced 13 years ago |
I think I've identified the problem, that expression evaluates if all inserted indexes from all objects in the list are unique so that they can be concatenated (stacked on top of each other). However if a MultiIndex is used the entire MultiIndex is compared not separated indexes inside it. A problem can arise here, for example on the original issue a Indexed dataframe is getting joined to a MultiIndexed dataframe, the pairs of indexes of the MultiIndexed dataframe are compared to the indexes of the Indexed dataframe. There's no repetition, so they get concatenated, but here's the error. One of the MultiIndexed dataframes index is 'y' same as the index of the Indexed dataframe, and those no longer are unique, which means they need to be merged. Solution I'm thinking of, after the lambda function to check if all indexes are unique, if this evaluates to True add an extra check to look for repeated indexes and if any of these are not unique change the value to False and skip to the Merge instead of a Concat. |
Hmm I'm not sure I follow - I feel like the indexes should be the same when trying to join. Assuming some kind of matching behavior between an Index and MultiIndex might have a lot of logic pitfalls |
The problem is with the MultiIndexes, for example in the original issue, the Index is |
I believe I've got the solution for this, I'll push it so you can take a look at it |
So is this a problem with If you agree with my suggested behavior then I will open an issue to fix it, if not we should at least document how it operates on |
351f139
to
fe7336b
Compare
@Aloqeely I believe the current behaviour of |
this doc build and upload check is something I did wrong or do I just re-run the checks, I haven't touched that |
It does seem related to your changes, I've summarized the example in the doc that fails: import pandas as pd
left = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"key2": ["K0", "K1", "K0", "K1"],
},
index=pd.Index(["K0", "K0", "K1", "K2"], name="key1"),
)
right = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])
result = left.join([right]) Raises
It is surprising that only the doc build failed but not any tests, so once you fix it for this case please add tests. |
We are fixing the wrong thing from the original issue. I think specifying |
Oh thanks @Aloqeely , I figured out what it was, on this line, in case "common_indexes" was empty can_concat would not be initialized and it is called in the following if statement, so it threw an error there. I initialized it to a default value before the for loop so that doesn't occur anymore. Everything's good now, @WillAyd if you could re-review the PR I would appreciate it, thanks. |
Oh just saw your comment sorry |
Hmmm, so if |
Right if you are trying to join two index objects with non-equal levels |
Ok, so I've changed the default value of "common_indexes" to a raise if "common_indexes" is empty |
@Aloqeely how can I get better errors on the Doc build and upload test like you sent instead of just |
You can open the doc file which has the errors and go to the error line, run the examples near to that line manually until you see which example causes the error. |
I think so |
I know what's happening, in this example left and right both have an index named 'key1' while right2 has an index but it is not named. This not named index is inferred to be 'key1', is this expected behaviour? Taking in consideration your input @WillAyd ?
|
I've changed the test in the documentation that was failing, instead of joining index "key 1" on "key 1" on None now they're all "key 1" |
@WillAyd the issue should be completed now |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
join
withlist
does not behave like singleton #57676Even though it does not make much sense, joining a dataframe with a list of a single element should behave as joining the dataframe with that element.