-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Allow merging on object / non-object column #21681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow merging on object / non-object column #21681
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21681 +/- ##
=======================================
Coverage 92.31% 92.31%
=======================================
Files 166 166
Lines 52391 52404 +13
=======================================
+ Hits 48363 48375 +12
- Misses 4028 4029 +1
Continue to review full report at Codecov.
|
-1 in this - if u want to merge an object column with a non object column then coerce to object this will propagate objects columns very easily - against the principle of strongly dtyping things |
The concrete case was a regression in the released version of geopandas because of the change in pandas to start raising errors on merging with unequal dtypes. This actual case was caused because we first merge an empty frame with another one, which creates object columns of NaN, and then merging the result yet with another, where the object column of NaN is merged on a float column (integers with NaNs), raising the error. But my main reason is what I said in the top post: object dtype can be anything => pandas should not make assumptions on what it thinks is valid/invalid for operating on such data. That's up to the user. It also boils down a bit to that the original change to start erroring should ideally have been done with a deprecation warning instead of directly erroring, so it would not cause regression directly. |
There was actually a question on SO today reporting a problem in geopandas caused by this: https://stackoverflow.com/questions/51138156/how-can-i-use-sjoin-to-merge-two-geodataframes-without-an-error. In a certain way, you could also see this PR as a regression fix. We probably should have made such a change with a deprecation warning first (if we want the original change). |
I disagree here. merging on non-alike dtypes is simply wrong. objects cannot merge with non-object, full stop. Sure we have a special case for boolean, but allowing object with integer is just a disaster. ' If you want to try a bit more inference I am ok with that. e.g. say you have an object column, then am ok with running |
They can perfectly well, depending on what is inside the object dtype column. It's not about "cannot", it is about "do we allow it or not". And for the record, until one month ago pandas has always allowed it.
I have thought about that as well, but inference can be costly, so I would rather not do it. Are you OK with raising an informative warning instead? Where we clearly indicate to the user what is happening and this typically points to a problem with their types? (but at least it allow people to ignore the warning, or to fix it only later) (I thought you hinted on being OK with that in the previous PR, but not sure) |
BTW, input from some other devs is certainly welcome, Jeff and I clearly disagree on this .. :-) |
I guess that is ok. The original reason that this was changed is to avoid the all-NaN problem (e.g. you merge with something that has no matches because of the dtype mismatch). I don't want to go back to that as its worse from a user POV. open to compromise but would rather be strict here. (and inference is not generally expensive compared to a borked merge :>) |
I'm +1 on being strict. At the very least if the end user is explicit about types they stand to benefit performance- and resource-wise instead of pandas falling back to |
yeah, that is true .. :)
I just realized that I actually think a warning can be better from a user POV. Instead of the error, you would then see the NaNs, together with the informative warning explaining why the NaNs are there and how to solve it. I think this could actually be better to get what is happening than only the error. And this also alleviates the direct code breakages caused by the change in 0.23.0 to be more strict.
Yes, but note that there can be good reasons to want or to end up with object data, and not all use cases of pandas are performance sensitive (eg small data). |
@jorisvandenbossche can you rebase this |
pandas/core/reshape/merge.py
Outdated
@@ -946,11 +947,14 @@ def _maybe_coerce_merge_keys(self): | |||
"you should use pd.concat".format(lk_dtype=lk.dtype, | |||
rk_dtype=rk.dtype)) | |||
|
|||
coerce_to_object = False | |||
if is_object_dtype(lk) or is_object_dtype(rk): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually necessary here? e.g. it would hit the else clause if this is true as well.
Then can just do the conversions in the else. (e.g. stuff on line 1011)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this we hit
956 elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk))
957 and not is_numeric_dtype(rk)):
--> 958 raise ValueError(msg)
and raise.
@jorisvandenbossche can you merge master |
I've had colleages come and complain to me about this recently - especially since this used to work. I would advocate to be strict, but based on the inferred type. The case where it's an all-integer index hiding behind an object dtype would be easily solved through that, and the other cases would continue to emit the warning. |
@jorisvandenbossche can you merge master |
…e-merge-object-columns
Hello @jorisvandenbossche! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 27, 2018 at 12:56 Hours UTC |
Fixed the merge conflicts. Haven't taken a look at the changes yet. |
I pushed a commit which fixes this up in a reasonably clean way. if folks can look at the cases and see if anything glaring pops out. We have a few odd cases already, e.g. bools are inferred if possible and we actually allow stringlike non-unicode strings to infer, but these worked before. cc @pandas-dev/pandas-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
May leave this open for a day in case @jorisvandenbossche is around to look tomorrow, but I think this is good to go.
@jreback So basically you reverted to allow again all combinations of an object dtype with a non-object dtype? |
yes anything with object just forces conversion to object we |
added a commit to be more restrictive. |
@jorisvandenbossche if you have a chance |
@TomAugspurger can you look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks @jorisvandenbossche ! |
@jreback and thanks for finishing the PR! One thing I note now, after your last comment to be a bit more strict on merging object and numeric columns, the whatsnew note is not fully correct any more. I am also wondering a bit how costly this inferring of the type for object columns is. |
yeah ok we are a bit more strict so whatsnew could be sligthly better. inference is << cost of actually merging things, its a no-brainer and we do it all over the place. |
I am running into this issue and found this thread. Was this change pushed to pandas or can we not merge object/non-object type. Here is my code:
Error traceback:
|
Follow-up on discussion in #21310 (cc @jreback )
What this does: allow
merge
on an object and a non-object column (eg object and int, object and categorical, ..), by coercing the non-object column to object dtype.Reasoning:
The main disadvantage is that we no longer detect the case of merging object strings with eg integer column, but IMO we cannot avoid such cases as long as we don't have a proper string dtype but use object dtype for this (you can also mix string and ints in a column, we also don't raise for that for a similar reason).
Additional option would be to issue a warning in this case.
Closes #23733