Skip to content

REGR: allow merging on object boolean columns #21310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

Fixes #21119

This is a quick/hacky fix for the specific case in that issue, i.e. merging boolean / object boolean columns.

However, I have the feeling we should do a more general solution for cases where object dtyped columns might be actually mergeable. The problem is that inferring the type can be costly? (and eg when having actual strings object columns, this inferring was also unnecessary and thus will unneeded slow down merge?)

@jorisvandenbossche jorisvandenbossche added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version labels Jun 4, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.1 milestone Jun 4, 2018
@codecov
Copy link

codecov bot commented Jun 4, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@05e55aa). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #21310   +/-   ##
=========================================
  Coverage          ?   91.89%           
=========================================
  Files             ?      153           
  Lines             ?    49579           
  Branches          ?        0           
=========================================
  Hits              ?    45559           
  Misses            ?     4020           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.29% <100%> (?)
#single 41.85% <0%> (?)
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.25% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05e55aa...ad845d7. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

this is not hacky it’s inline with what we normally do

i would add a comment

@@ -1526,6 +1526,17 @@ def test_merge_on_ints_floats_warning(self):
result = B.merge(A, left_on='Y', right_on='X')
assert_frame_equal(result, expected[['Y', 'X']])

def test_merge_incompat_infer_object(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a great test name, it should be related to bool.

def test_merge_incompat_infer_object(self):
# GH21119
df1 = DataFrame({'key': Series([True, False], dtype=object)})
df2 = DataFrame({'key': [True, False]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have a NaN in the object dtype as well, this should test inferred and bool on both sides (with and w/o NaN).

df2 = DataFrame({'key': [True, False]})

expected = DataFrame({'key': [True, False]}, dtype=object)
result = pd.merge(df1, df2, on='key')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to test the error case (e.g. bool / inferred bool) on 1 side and numeric on the other (I think this will raise), prob have an existing test, just need to add to it

expected = DataFrame({'key': [True, False]}, dtype=object)
result = pd.merge(df1, df2, on='key')
assert_frame_equal(result, expected)
result = pd.merge(df2, df1, on='key')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank lines between cases

result = pd.merge(df1, df2, on='key')
assert_frame_equal(result, expected)
result = pd.merge(df2, df1, on='key')
assert_frame_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally use the how fixture

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not used for the other tests on this, so leaving this for now

@jorisvandenbossche
Copy link
Member Author

this is not hacky it’s inline with what we normally do

What I mean is that it only solves a very specific case, while there might be other cases that pop-up later that might need similar fixes (although that the boolean case probably is the most important one).

There are other reasons why you could have for some reason a column of object dtype, where you could perfectly merge it with another column of properly typed data, was it not for us prematurely erroring on it.

A more 'permissive' solution would be to only error for "clear" errors (eg the int on datetime, or tz-aware with non-tz aware), and leave anything with object columns as it was before (leave it to the merge algorithm to see whether there is something to merge).

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

@jorisvandenbossche I don't know about that. We are pretty strict and only error on obvious things (e.g. we allow ints to merge regardless of kinds). so ok with this for now.

@jorisvandenbossche
Copy link
Member Author

only error on obvious things

object dtyped columns are not that obvious, they can occur for many reasons.

Eg in geopandas I had the problem that we created a columns of NaNs that was of object dtype in a specific case (due to a previous merge of an empty and non empty frame (and if you create an empty frame it gets object dtype if you don't specify anything specific), and then merging that again gave this error, while before it perfectly worked, as merging object NaNs on another numerical column is no problem.

Typically those issues are solvable (as my example case), but can be tricky to find, and leads to annoying errors in working code in downstream users before it gets fixed.

@jorisvandenbossche
Copy link
Member Author

For example, we could also have it as warnings instead of errors, to indicate something might be going wrong with your dtypes.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

we have been erroring in this for a while
warnings are a backwards step and imho generally ignored anyhow

@jorisvandenbossche
Copy link
Member Author

we have been erroring in this for a while

I don't think that is correct. As far as I can see, we only started erroring in 0.23.0: #18674

@jreback
Copy link
Contributor

jreback commented Jun 5, 2018

that was in 0.22

@crepererum
Copy link
Contributor

that was in 0.22

That's weird. The test case works in 0.22.0 but fails in 0.23.0.

@jorisvandenbossche
Copy link
Member Author

@jreback on the time of merging it was 0.22, but that became 0.23 eventually :) Anyway, so it is only from the last release.

@jorisvandenbossche
Copy link
Member Author

To be more explicit about the the alternative I was thinking about, something like this (using pseudo-code):

if (is_object_dtype(left) and not is_object_dtype(right)) or (the other way around):
    warn("You are trying to merge columns of object and {other.dtype} dtype.
          This might indicate an error ...")
    coerce non-object to object (which is what was happening before)
    and continue merging
else:
    ...
    the existing more specific checks for dtypes that are *certainly* incompatible that raise errors (eg in case of merging int on datetime64)

that would keep the dubious cases (where one of both columns is object dtype) working but with a warning to indicate to users they need to check their dtypes. But also keeps the informative errors introduced in 0.23.0 for those cases where we are sure it is incompatible.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2018

I don't think warnings are a good idea. if you want to merge this, pls make the changes.

@jorisvandenbossche
Copy link
Member Author

I don't think warnings are a good idea.

Can you explain why?
A warning instead of error can give a similar indication of "there is something wrong with your dtypes", but without breaking working code.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2018

because we don't do this for anything else, it is simply out of place here. If you start warning for this, then you need to warn for practiclaly any non-matched but ok to merge errors. that would be so much noise it would be incomprehensible. Either a merge is good or its an error.

@jorisvandenbossche
Copy link
Member Author

Either a merge is good or its an error.

The thing is that a merge is often good (meaning "working as expected", eg this broke a case in geopandas that was perfectly working before), and that it is very difficult for pandas to judge wether a merge is good or not.
I proposed a warning to at least be somewhat informative that there is something fishy about the types.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2018

how is object type with not object type not an error?

note i wouldn’t object to allowing extension types to be a separate clause here (which may solve the issue for geopandas)

@jorisvandenbossche
Copy link
Member Author

how is object type with not object type not an error?

Because object can be by definition "anything". Whether we regard that as an error or not is something that we can discuss, but it is not that it "should" be an error (it can work, it worked before for certain cases).

note i wouldn’t object to allowing extension types to be a separate clause here (which may solve the issue for geopandas)

merging on two extension types will already work if they are equal dtypes.
But do you mean with the above that you would not error on merging "object + extension type" ?

@crepererum
Copy link
Contributor

Could we agree on shipping a minimal solution that fixes the described bug and discuss a broader / more general solution in another issue? I could take over this PR if @jorisvandenbossche wants.

@jorisvandenbossche
Copy link
Member Author

Yes, this will certainly be merged for 0.23.1 so at least the boolean case will be fixed. Updated the PR now (was just waiting with it in case we could get agreement on a more thorough solution).

@jreback
Copy link
Contributor

jreback commented Jun 8, 2018

lgtm @jorisvandenbossche if you want to create an issue to show a warning as suggested above then ok, otherwise merge on pass.

@jorisvandenbossche
Copy link
Member Author

(travis passed)

@jorisvandenbossche jorisvandenbossche merged commit 8d5032a into pandas-dev:master Jun 8, 2018
@jorisvandenbossche jorisvandenbossche deleted the merge-dtype-warning branch June 8, 2018 17:44
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 12, 2018
TomAugspurger pushed a commit that referenced this pull request Jun 12, 2018
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas 0.23 fails on column<->index boolean merge
4 participants