Skip to content

BUG: Raise ValueError for non numerical join columns in merge_asof #34488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented May 30, 2020

I checked for dtype object. Normally, I would try to check for allowed dtypes, but that would require a lot of dtype checks instead of only one. I hope object dtype is sufficient to exclude

if is_object_dtype(right_values):
raise ValueError("Right input is not from numerical dtype")

if is_object_dtype(left_values.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be loosened to include other non-numeric dtypes for which this should raise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exist tests for Timedelta and Datetime DataFrames, so this is expected to work I think?. Is there anything you would like to exclude?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be right, was just curious if this was indeed capturing all the "non-numeric" cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope so

@dsaxton dsaxton added Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 30, 2020
@pep8speaks
Copy link

pep8speaks commented May 30, 2020

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-05 18:33:30 UTC

@phofl phofl changed the title BUG: Raise ValueError for non numerical join columns in merge_asof BUG: Raise TypeError for non numerical join columns in merge_asof May 30, 2020
@phofl phofl changed the title BUG: Raise TypeError for non numerical join columns in merge_asof BUG: Raise ValueError for non numerical join columns in merge_asof May 31, 2020
@phofl
Copy link
Member Author

phofl commented Jun 1, 2020

@jreback

I think I found a solution. I moved it as you requested to the overwritten _get_merge_keys function. We can only check the on columns, not the by columns for dtype object. So I take the length of the by columns and check all following columns. The check for equal dtypes is performed a few lines before that.

@@ -1661,6 +1663,13 @@ def _get_merge_keys(self):
)
raise MergeError(msg)

if i >= len_by_cols:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we don't want to do this.

put your changes in the super call to _get_merge_keys. just do it upfront (that is what was not clear before).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean _maybe_coerce_merge_keys then? The dtype validation for the regular merge is done there.

The _get_merge_keys function I changed is already from _AsOfMerge, so I could not overwrite this again. This function overwrites the original _get_merge_keys function and calls the super() function before checking if both join cols have the same dtype

@dsaxton
Copy link
Member

dsaxton commented Sep 16, 2020

@phofl Can you merge master to fix conflicts?

@github-actions github-actions bot added the Stale label Nov 5, 2020
@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

@phofl can you merge master and move the note to 1.3

@phofl
Copy link
Member Author

phofl commented Dec 29, 2020

Done

@jreback jreback removed the Stale label Dec 29, 2020
@@ -1798,6 +1799,24 @@ def _get_merge_keys(self):

return left_join_keys, right_join_keys, join_names

def _maybe_coerce_merge_keys(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not just a part of _validate_specification? seems like re-inventing the wheel here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is way easier after knowing the join keys than before. But I think we could probably move this into validate_specification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah would really like it to be part of the main check

@@ -1798,6 +1799,24 @@ def _get_merge_keys(self):

return left_join_keys, right_join_keys, join_names

def _maybe_coerce_merge_keys(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah would really like it to be part of the main check

else self.right.index.dtype
)
if is_object_dtype(lo_dtype) or is_object_dtype(ro_dtype):
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a MergeError (as everthing else is)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, issue mentioned ValueError, but MergeError is probably better


left = pd.DataFrame({"left_val": [1, 5, 10], "a": ["a", "b", "c"]})
right = pd.DataFrame({"right_val": [1, 2, 3, 6, 7], "a": [1, 2, 3, 6, 7]})

with pytest.raises(MergeError, match=msg):
with pytest.raises(ValueError, match=msg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert (see above)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback jreback added this to the 1.3 milestone Jan 6, 2021
@jreback jreback merged commit cd0224d into pandas-dev:master Jan 6, 2021
@jreback
Copy link
Contributor

jreback commented Jan 6, 2021

thanks @phofl

@phofl phofl deleted the 29130_raise_value_error_non_numerical branch January 6, 2021 00:17
@paulg-quantinno
Copy link

Hello, thought I'd add here a bug(?) that happened with the above change that was just released with pandas 1.3.0. Seems the "GH#29130 Check that merge keys do not have dtype object" makes an assumption that a merge key is not in the index, so the "self.right[self.right_on[0]].dtype" fails when self.right_on[0] is not a column of the dataframe (and equivalent for the left_on side). Perhaps those fields should never be in the index and this just exposes that?

@phofl
Copy link
Member Author

phofl commented Jul 23, 2021

Could you provide a short example please?

@paulg-quantinno
Copy link

Sure, here is an example:

data1 = pd.DataFrame({
            'name': ['mary', 'randy'],
            'age': [33, 23],
            'city': ['nyc', 'chi'],
            'date': [pd.to_datetime('20210303'), pd.to_datetime('20210303')],
        })

data2 = pd.DataFrame({
            'name': ['pete', 'mary'],
            'age': [22, 44],
            'city': ['nyc', 'phi'],
            'date': [pd.to_datetime('20210303'), pd.to_datetime('20210505')],
        })

data2= data2.set_index('date')

together = pd.merge_asof(data1.sort_values('date'), data2.sort_values('date'), left_on='date', right_on='date', left_by='city', right_by='city', tolerance=pd.Timedelta('2D'))

`~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in merge_asof(left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, suffixes, tolerance, allow_exact_matches, direction)
579 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
580 """
--> 581 op = _AsOfMerge(
582 left,
583 right,

~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in init(self, left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, axis, suffixes, copy, fill_method, how, tolerance, allow_exact_matches, direction)
1734 self.direction = direction
1735
-> 1736 _OrderedMerge.init(
1737 self,
1738 left,

~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in init(self, left, right, on, left_on, right_on, left_index, right_index, axis, suffixes, copy, fill_method, how)
1617
1618 self.fill_method = fill_method
-> 1619 _MergeOperation.init(
1620 self,
1621 left,

~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in init(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
680 warnings.warn(msg, FutureWarning, stacklevel=3)
681
--> 682 self._validate_specification()
683
684 cross_col = None

~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/reshape/merge.py in _validate_specification(self)
1782 )
1783 ro_dtype = (
-> 1784 self.right[self.right_on[0]].dtype
1785 if not self.right_index
1786 else self.right.index.dtype

~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/frame.py in getitem(self, key)
3453 if self.columns.nlevels > 1:
3454 return self._getitem_multilevel(key)
-> 3455 indexer = self.columns.get_loc(key)
3456 if is_integer(indexer):
3457 indexer = [indexer]

~/anaconda3/envs/dailytestenv/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'date'
`

@phofl
Copy link
Member Author

phofl commented Jul 25, 2021

I have fixed this already, will be in 1.3.1

@paulg-quantinno
Copy link

lol, wow, that was a fast reply, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

merge_asof() returns cryptic error with bad argument types
6 participants