-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Raise ValueError for non numerical join columns in merge_asof #34488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Raise ValueError for non numerical join columns in merge_asof #34488
Conversation
pandas/core/reshape/merge.py
Outdated
if is_object_dtype(right_values): | ||
raise ValueError("Right input is not from numerical dtype") | ||
|
||
if is_object_dtype(left_values.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be loosened to include other non-numeric dtypes for which this should raise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exist tests for Timedelta and Datetime DataFrames, so this is expected to work I think?. Is there anything you would like to exclude?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be right, was just curious if this was indeed capturing all the "non-numeric" cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope so
I think I found a solution. I moved it as you requested to the overwritten _get_merge_keys function. We can only check the on columns, not the by columns for dtype object. So I take the length of the by columns and check all following columns. The check for equal dtypes is performed a few lines before that. |
pandas/core/reshape/merge.py
Outdated
@@ -1661,6 +1663,13 @@ def _get_merge_keys(self): | |||
) | |||
raise MergeError(msg) | |||
|
|||
if i >= len_by_cols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we don't want to do this.
put your changes in the super call to _get_merge_keys. just do it upfront (that is what was not clear before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean _maybe_coerce_merge_keys then? The dtype validation for the regular merge is done there.
The _get_merge_keys function I changed is already from _AsOfMerge, so I could not overwrite this again. This function overwrites the original _get_merge_keys function and calls the super() function before checking if both join cols have the same dtype
@phofl Can you merge master to fix conflicts? |
@phofl can you merge master and move the note to 1.3 |
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
Done |
pandas/core/reshape/merge.py
Outdated
@@ -1798,6 +1799,24 @@ def _get_merge_keys(self): | |||
|
|||
return left_join_keys, right_join_keys, join_names | |||
|
|||
def _maybe_coerce_merge_keys(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not just a part of _validate_specification? seems like re-inventing the wheel here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is way easier after knowing the join keys than before. But I think we could probably move this into validate_specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah would really like it to be part of the main check
� Conflicts: � doc/source/whatsnew/v1.3.0.rst
pandas/core/reshape/merge.py
Outdated
@@ -1798,6 +1799,24 @@ def _get_merge_keys(self): | |||
|
|||
return left_join_keys, right_join_keys, join_names | |||
|
|||
def _maybe_coerce_merge_keys(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah would really like it to be part of the main check
� Conflicts: � doc/source/whatsnew/v1.3.0.rst
pandas/core/reshape/merge.py
Outdated
else self.right.index.dtype | ||
) | ||
if is_object_dtype(lo_dtype) or is_object_dtype(ro_dtype): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a MergeError (as everthing else is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, issue mentioned ValueError, but MergeError is probably better
|
||
left = pd.DataFrame({"left_val": [1, 5, 10], "a": ["a", "b", "c"]}) | ||
right = pd.DataFrame({"right_val": [1, 2, 3, 6, 7], "a": [1, 2, 3, 6, 7]}) | ||
|
||
with pytest.raises(MergeError, match=msg): | ||
with pytest.raises(ValueError, match=msg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert (see above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
thanks @phofl |
Hello, thought I'd add here a bug(?) that happened with the above change that was just released with pandas 1.3.0. Seems the "GH#29130 Check that merge keys do not have dtype object" makes an assumption that a merge key is not in the index, so the "self.right[self.right_on[0]].dtype" fails when self.right_on[0] is not a column of the dataframe (and equivalent for the left_on side). Perhaps those fields should never be in the index and this just exposes that? |
Could you provide a short example please? |
Sure, here is an example:
`~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in merge_asof(left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, suffixes, tolerance, allow_exact_matches, direction) ~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in init(self, left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, axis, suffixes, copy, fill_method, how, tolerance, allow_exact_matches, direction) ~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in init(self, left, right, on, left_on, right_on, left_index, right_index, axis, suffixes, copy, fill_method, how) ~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in init(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate) ~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in _validate_specification(self) ~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/frame.py in getitem(self, key) ~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) KeyError: 'date' |
I have fixed this already, will be in 1.3.1 |
lol, wow, that was a fast reply, thank you! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
I checked for dtype object. Normally, I would try to check for allowed dtypes, but that would require a lot of dtype checks instead of only one. I hope object dtype is sufficient to exclude