-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: allow merging on object boolean columns #21310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: allow merging on object boolean columns #21310
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21310 +/- ##
=========================================
Coverage ? 91.89%
=========================================
Files ? 153
Lines ? 49579
Branches ? 0
=========================================
Hits ? 45559
Misses ? 4020
Partials ? 0
Continue to review full report at Codecov.
|
this is not hacky it’s inline with what we normally do i would add a comment |
@@ -1526,6 +1526,17 @@ def test_merge_on_ints_floats_warning(self): | |||
result = B.merge(A, left_on='Y', right_on='X') | |||
assert_frame_equal(result, expected[['Y', 'X']]) | |||
|
|||
def test_merge_incompat_infer_object(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a great test name, it should be related to bool.
def test_merge_incompat_infer_object(self): | ||
# GH21119 | ||
df1 = DataFrame({'key': Series([True, False], dtype=object)}) | ||
df2 = DataFrame({'key': [True, False]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to have a NaN in the object dtype as well, this should test inferred and bool on both sides (with and w/o NaN).
df2 = DataFrame({'key': [True, False]}) | ||
|
||
expected = DataFrame({'key': [True, False]}, dtype=object) | ||
result = pd.merge(df1, df2, on='key') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to test the error case (e.g. bool / inferred bool) on 1 side and numeric on the other (I think this will raise), prob have an existing test, just need to add to it
expected = DataFrame({'key': [True, False]}, dtype=object) | ||
result = pd.merge(df1, df2, on='key') | ||
assert_frame_equal(result, expected) | ||
result = pd.merge(df2, df1, on='key') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank lines between cases
result = pd.merge(df1, df2, on='key') | ||
assert_frame_equal(result, expected) | ||
result = pd.merge(df2, df1, on='key') | ||
assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally use the how fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not used for the other tests on this, so leaving this for now
What I mean is that it only solves a very specific case, while there might be other cases that pop-up later that might need similar fixes (although that the boolean case probably is the most important one). There are other reasons why you could have for some reason a column of A more 'permissive' solution would be to only error for "clear" errors (eg the int on datetime, or tz-aware with non-tz aware), and leave anything with object columns as it was before (leave it to the merge algorithm to see whether there is something to merge). |
@jorisvandenbossche I don't know about that. We are pretty strict and only error on obvious things (e.g. we allow ints to merge regardless of kinds). so ok with this for now. |
object dtyped columns are not that obvious, they can occur for many reasons. Eg in geopandas I had the problem that we created a columns of NaNs that was of object dtype in a specific case (due to a previous merge of an empty and non empty frame (and if you create an empty frame it gets object dtype if you don't specify anything specific), and then merging that again gave this error, while before it perfectly worked, as merging object NaNs on another numerical column is no problem. Typically those issues are solvable (as my example case), but can be tricky to find, and leads to annoying errors in working code in downstream users before it gets fixed. |
For example, we could also have it as warnings instead of errors, to indicate something might be going wrong with your dtypes. |
we have been erroring in this for a while |
I don't think that is correct. As far as I can see, we only started erroring in 0.23.0: #18674 |
that was in 0.22 |
That's weird. The test case works in |
@jreback on the time of merging it was 0.22, but that became 0.23 eventually :) Anyway, so it is only from the last release. |
To be more explicit about the the alternative I was thinking about, something like this (using pseudo-code):
that would keep the dubious cases (where one of both columns is object dtype) working but with a warning to indicate to users they need to check their dtypes. But also keeps the informative errors introduced in 0.23.0 for those cases where we are sure it is incompatible. |
I don't think warnings are a good idea. if you want to merge this, pls make the changes. |
Can you explain why? |
because we don't do this for anything else, it is simply out of place here. If you start warning for this, then you need to warn for practiclaly any non-matched but ok to merge errors. that would be so much noise it would be incomprehensible. Either a merge is good or its an error. |
The thing is that a merge is often good (meaning "working as expected", eg this broke a case in geopandas that was perfectly working before), and that it is very difficult for pandas to judge wether a merge is good or not. |
how is object type with not object type not an error? note i wouldn’t object to allowing extension types to be a separate clause here (which may solve the issue for geopandas) |
Because object can be by definition "anything". Whether we regard that as an error or not is something that we can discuss, but it is not that it "should" be an error (it can work, it worked before for certain cases).
merging on two extension types will already work if they are equal dtypes. |
Could we agree on shipping a minimal solution that fixes the described bug and discuss a broader / more general solution in another issue? I could take over this PR if @jorisvandenbossche wants. |
Yes, this will certainly be merged for 0.23.1 so at least the boolean case will be fixed. Updated the PR now (was just waiting with it in case we could get agreement on a more thorough solution). |
lgtm @jorisvandenbossche if you want to create an issue to show a warning as suggested above then ok, otherwise merge on pass. |
(travis passed) |
(cherry picked from commit 8d5032a)
(cherry picked from commit 8d5032a)
Fixes #21119
This is a quick/hacky fix for the specific case in that issue, i.e. merging boolean / object boolean columns.
However, I have the feeling we should do a more general solution for cases where object dtyped columns might be actually mergeable. The problem is that inferring the type can be costly? (and eg when having actual strings object columns, this inferring was also unnecessary and thus will unneeded slow down
merge
?)