REGR: allow merging on object boolean columns #21310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

jorisvandenbossche merged 6 commits into pandas-dev:master from jorisvandenbossche:merge-dtype-warning

Jun 8, 2018

Member

jorisvandenbossche commented Jun 4, 2018

Fixes #21119

This is a quick/hacky fix for the specific case in that issue, i.e. merging boolean / object boolean columns.

However, I have the feeling we should do a more general solution for cases where object dtyped columns might be actually mergeable. The problem is that inferring the type can be costly? (and eg when having actual strings object columns, this inferring was also unnecessary and thus will unneeded slow down merge?)


          REGR: allow merging on object boolean columns

0dfd01e

jorisvandenbossche added Reshaping Regression labels

jorisvandenbossche added this to the 0.23.1 milestone

jorisvandenbossche mentioned this pull request

pandas 0.23 fails on column<->index boolean merge #21119

Closed

codecov bot commented Jun 4, 2018 •

edited

Loading

Codecov Report

❗ No coverage uploaded for pull request base (master@05e55aa). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #21310   +/-   ##
=========================================
  Coverage          ?   91.89%           
=========================================
  Files             ?      153           
  Lines             ?    49579           
  Branches          ?        0           
=========================================
  Hits              ?    45559           
  Misses            ?     4020           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.29% <100%> (?)`
#single	`41.85% <0%> (?)`

Impacted Files	Coverage Δ
pandas/core/reshape/merge.py	`94.25% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05e55aa...ad845d7. Read the comment docs.

Contributor

jreback commented Jun 4, 2018

this is not hacky it’s inline with what we normally do

i would add a comment

jreback requested changes

View reviewed changes

pandas/tests/reshape/merge/test_merge.py Outdated

@@ @@ -1526,6 +1526,17 @@ def test_merge_on_ints_floats_warning(self): @@
                           result = B.merge(A, left_on='Y', right_on='X')
                           assert_frame_equal(result, expected[['Y', 'X']])
+                  def test_merge_incompat_infer_object(self):

Contributor

jreback Jun 4, 2018

this is not a great test name, it should be related to bool.

pandas/tests/reshape/merge/test_merge.py Outdated

+                  def test_merge_incompat_infer_object(self):
+                      # GH21119
+                      df1 = DataFrame({'key': Series([True, False], dtype=object)})
+                      df2 = DataFrame({'key': [True, False]})

Contributor

jreback Jun 4, 2018

you need to have a NaN in the object dtype as well, this should test inferred and bool on both sides (with and w/o NaN).

pandas/tests/reshape/merge/test_merge.py

+                      df2 = DataFrame({'key': [True, False]})
+                      expected = DataFrame({'key': [True, False]}, dtype=object)
+                      result = pd.merge(df1, df2, on='key')

Contributor

jreback Jun 4, 2018

need to test the error case (e.g. bool / inferred bool) on 1 side and numeric on the other (I think this will raise), prob have an existing test, just need to add to it

pandas/tests/reshape/merge/test_merge.py

+                      expected = DataFrame({'key': [True, False]}, dtype=object)
+                      result = pd.merge(df1, df2, on='key')
+                      assert_frame_equal(result, expected)
+                      result = pd.merge(df2, df1, on='key')

Contributor

jreback Jun 4, 2018

blank lines between cases

pandas/tests/reshape/merge/test_merge.py

+                      result = pd.merge(df1, df2, on='key')
+                      assert_frame_equal(result, expected)
+                      result = pd.merge(df2, df1, on='key')
+                      assert_frame_equal(result, expected)

Contributor

jreback Jun 4, 2018

ideally use the how fixture

Member Author

jorisvandenbossche Jun 8, 2018

This is also not used for the other tests on this, so leaving this for now

Member Author

jorisvandenbossche commented Jun 4, 2018

this is not hacky it’s inline with what we normally do

What I mean is that it only solves a very specific case, while there might be other cases that pop-up later that might need similar fixes (although that the boolean case probably is the most important one).

There are other reasons why you could have for some reason a column of object dtype, where you could perfectly merge it with another column of properly typed data, was it not for us prematurely erroring on it.

A more 'permissive' solution would be to only error for "clear" errors (eg the int on datetime, or tz-aware with non-tz aware), and leave anything with object columns as it was before (leave it to the merge algorithm to see whether there is something to merge).

Contributor

jreback commented Jun 4, 2018

@jorisvandenbossche I don't know about that. We are pretty strict and only error on obvious things (e.g. we allow ints to merge regardless of kinds). so ok with this for now.

Member Author

jorisvandenbossche commented Jun 4, 2018

only error on obvious things

object dtyped columns are not that obvious, they can occur for many reasons.

Eg in geopandas I had the problem that we created a columns of NaNs that was of object dtype in a specific case (due to a previous merge of an empty and non empty frame (and if you create an empty frame it gets object dtype if you don't specify anything specific), and then merging that again gave this error, while before it perfectly worked, as merging object NaNs on another numerical column is no problem.

Typically those issues are solvable (as my example case), but can be tricky to find, and leads to annoying errors in working code in downstream users before it gets fixed.

Member Author

jorisvandenbossche commented Jun 4, 2018

For example, we could also have it as warnings instead of errors, to indicate something might be going wrong with your dtypes.

Contributor

jreback commented Jun 4, 2018

we have been erroring in this for a while
warnings are a backwards step and imho generally ignored anyhow

Member Author

jorisvandenbossche commented Jun 5, 2018

we have been erroring in this for a while

I don't think that is correct. As far as I can see, we only started erroring in 0.23.0: #18674

Contributor

jreback commented Jun 5, 2018

that was in 0.22

Contributor

crepererum commented Jun 5, 2018

that was in 0.22

That's weird. The test case works in 0.22.0 but fails in 0.23.0.

Member Author

jorisvandenbossche commented Jun 5, 2018

@jreback on the time of merging it was 0.22, but that became 0.23 eventually :) Anyway, so it is only from the last release.

Member Author

jorisvandenbossche commented Jun 5, 2018

To be more explicit about the the alternative I was thinking about, something like this (using pseudo-code):

if (is_object_dtype(left) and not is_object_dtype(right)) or (the other way around):
    warn("You are trying to merge columns of object and {other.dtype} dtype.
          This might indicate an error ...")
    coerce non-object to object (which is what was happening before)
    and continue merging
else:
    ...
    the existing more specific checks for dtypes that are *certainly* incompatible that raise errors (eg in case of merging int on datetime64)

that would keep the dubious cases (where one of both columns is object dtype) working but with a warning to indicate to users they need to check their dtypes. But also keeps the informative errors introduced in 0.23.0 for those cases where we are sure it is incompatible.

Contributor

jreback commented Jun 7, 2018

I don't think warnings are a good idea. if you want to merge this, pls make the changes.

Member Author

jorisvandenbossche commented Jun 7, 2018

I don't think warnings are a good idea.

Can you explain why?
A warning instead of error can give a similar indication of "there is something wrong with your dtypes", but without breaking working code.

Contributor

jreback commented Jun 7, 2018

because we don't do this for anything else, it is simply out of place here. If you start warning for this, then you need to warn for practiclaly any non-matched but ok to merge errors. that would be so much noise it would be incomprehensible. Either a merge is good or its an error.

Member Author

jorisvandenbossche commented Jun 7, 2018

Either a merge is good or its an error.

The thing is that a merge is often good (meaning "working as expected", eg this broke a case in geopandas that was perfectly working before), and that it is very difficult for pandas to judge wether a merge is good or not.
I proposed a warning to at least be somewhat informative that there is something fishy about the types.

Contributor

jreback commented Jun 7, 2018

how is object type with not object type not an error?

note i wouldn’t object to allowing extension types to be a separate clause here (which may solve the issue for geopandas)

Member Author

jorisvandenbossche commented Jun 7, 2018

how is object type with not object type not an error?

Because object can be by definition "anything". Whether we regard that as an error or not is something that we can discuss, but it is not that it "should" be an error (it can work, it worked before for certain cases).

note i wouldn’t object to allowing extension types to be a separate clause here (which may solve the issue for geopandas)

merging on two extension types will already work if they are equal dtypes.
But do you mean with the above that you would not error on merging "object + extension type" ?

Contributor

crepererum commented Jun 8, 2018

Could we agree on shipping a minimal solution that fixes the described bug and discuss a broader / more general solution in another issue? I could take over this PR if @jorisvandenbossche wants.

jorisvandenbossche added 4 commits

June 8, 2018 18:12


          add more test cases


          add comment

fe45b89


          Merge remote-tracking branch 'upstream/master' into merge-dtype-warning

1bd8661


          add whatsnew

58b72c5

Member Author

jorisvandenbossche commented Jun 8, 2018

Yes, this will certainly be merged for 0.23.1 so at least the boolean case will be fixed. Updated the PR now (was just waiting with it in case we could get agreement on a more thorough solution).

jreback approved these changes

View reviewed changes

Contributor

jreback commented Jun 8, 2018

lgtm @jorisvandenbossche if you want to create an issue to show a warning as suggested above then ok, otherwise merge on pass.

Member Author

jorisvandenbossche commented Jun 8, 2018

(travis passed)


          Merge branch 'master' into merge-dtype-warning

ad845d7

jorisvandenbossche merged commit 8d5032a into pandas-dev:master

jorisvandenbossche deleted the merge-dtype-warning branch

June 8, 2018 17:44

jorisvandenbossche added the Needs Backport label

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request


          REGR: allow merging on object boolean columns (pandas-dev#21310)

64409d9

(cherry picked from commit 8d5032a)

TomAugspurger pushed a commit that referenced this pull request


          REGR: allow merging on object boolean columns (#21310)

9440efe

(cherry picked from commit 8d5032a)

TomAugspurger removed the Needs Backport label

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request


          REGR: allow merging on object boolean columns (pandas-dev#21310)

941cc8e

jorisvandenbossche mentioned this pull request

Allow merging on object / non-object column #21681

Merged

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request


          REGR: allow merging on object boolean columns (pandas-dev#21310)

7e36e4f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Regression Reshaping