Skip to content

Commit 496e915

Browse files
author
Nick Eubank
committed
add validate argument to merge
1 parent ba60321 commit 496e915

File tree

6 files changed

+236
-16
lines changed

6 files changed

+236
-16
lines changed

doc/source/merging.rst

+49-3
Original file line numberDiff line numberDiff line change
@@ -513,7 +513,8 @@ standard database join operations between DataFrame objects:
513513

514514
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
515515
left_index=False, right_index=False, sort=True,
516-
suffixes=('_x', '_y'), copy=True, indicator=False)
516+
suffixes=('_x', '_y'), copy=True, indicator=False,
517+
validate=None)
517518

518519
- ``left``: A DataFrame object
519520
- ``right``: Another DataFrame object
@@ -551,6 +552,18 @@ standard database join operations between DataFrame objects:
551552

552553
.. versionadded:: 0.17.0
553554

555+
- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None
556+
If specified, checks if merge is of specified type.
557+
* "one_to_one" or "1:1": check if merge keys are unique in both
558+
left and right dataset.
559+
* "one_to_many" or "1:m": check if merge keys are unique in left
560+
dataset.
561+
* "many_to_one" or "m:1": check if merge keys are unique in right
562+
dataset.
563+
564+
.. versionadded:: 0.21.0
565+
566+
554567
The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
555568
and ``right`` is a subclass of DataFrame, the return type will still be
556569
``DataFrame``.
@@ -711,10 +724,43 @@ Here is another example with duplicate join keys in DataFrames:
711724
labels=['left', 'right'], vertical=False);
712725
plt.close('all');
713726
727+
714728
.. warning::
715729

716-
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions,
717-
may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
730+
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
731+
732+
.. _merging.validation:
733+
734+
Checking for duplicate keys
735+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
736+
737+
.. versionadded:: 0.21.0
738+
739+
Users can use the ``validate`` argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.
740+
741+
In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.
742+
743+
.. code-block:: python
744+
745+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
746+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
747+
result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");
748+
749+
ValueError: Merge keys are not unique in either left or right dataset; not a one-to-one merge
750+
751+
752+
If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.
753+
754+
.. ipython:: python
755+
:suppress:
756+
757+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
758+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
759+
760+
.. ipython:: python
761+
762+
pd.merge(left, right, on='B', how='outer', validate="one_to_many")
763+
718764
719765
.. _merging.indicator:
720766

doc/source/whatsnew/v0.21.0.txt

+9-2
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,20 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
2020
New features
2121
~~~~~~~~~~~~
2222

23-
24-
2523
.. _whatsnew_0210.enhancements.other:
2624

2725
Other Enhancements
2826
^^^^^^^^^^^^^^^^^^
2927

28+
.. _whatsnew_0210.enhancements.other.merge_validate:
29+
30+
``validate`` argument checks merge key uniqueness
31+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
32+
33+
The ``validate`` argument for :func:`merge` function now checks whether a merge is
34+
one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not
35+
be an example of specified merge type, an exception will be raised. (:issue:`16270`)
36+
3037

3138

3239
.. _whatsnew_0210.api_breaking:

pandas/core/frame.py

+15-2
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,18 @@
174174
175175
.. versionadded:: 0.17.0
176176
177+
validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
178+
"many_to_one", "many_to_many"}, default None
179+
If specified, checks if merge is of specified type.
180+
* "one_to_one" or "1:1": check if merge keys are unique in both
181+
left and right dataset.
182+
* "one_to_many" or "1:m": check if merge keys are unique in left
183+
dataset.
184+
* "many_to_one" or "m:1": check if merge keys are unique in right
185+
dataset.
186+
187+
.. versionadded:: 0.21.0
188+
177189
Examples
178190
--------
179191
@@ -4812,12 +4824,13 @@ def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',
48124824
@Appender(_merge_doc, indents=2)
48134825
def merge(self, right, how='inner', on=None, left_on=None, right_on=None,
48144826
left_index=False, right_index=False, sort=False,
4815-
suffixes=('_x', '_y'), copy=True, indicator=False):
4827+
suffixes=('_x', '_y'), copy=True, indicator=False,
4828+
validate=None):
48164829
from pandas.core.reshape.merge import merge
48174830
return merge(self, right, how=how, on=on, left_on=left_on,
48184831
right_on=right_on, left_index=left_index,
48194832
right_index=right_index, sort=sort, suffixes=suffixes,
4820-
copy=copy, indicator=indicator)
4833+
copy=copy, indicator=indicator, validate=validate)
48214834

48224835
def round(self, decimals=0, *args, **kwargs):
48234836
"""

pandas/core/reshape/merge.py

+80-9
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,13 @@
4646
@Appender(_merge_doc, indents=0)
4747
def merge(left, right, how='inner', on=None, left_on=None, right_on=None,
4848
left_index=False, right_index=False, sort=False,
49-
suffixes=('_x', '_y'), copy=True, indicator=False):
49+
suffixes=('_x', '_y'), copy=True, indicator=False,
50+
validate=None):
5051
op = _MergeOperation(left, right, how=how, on=on, left_on=left_on,
5152
right_on=right_on, left_index=left_index,
5253
right_index=right_index, sort=sort, suffixes=suffixes,
53-
copy=copy, indicator=indicator)
54+
copy=copy, indicator=indicator,
55+
validate=validate)
5456
return op.get_result()
5557

5658

@@ -263,7 +265,8 @@ def merge_asof(left, right, on=None,
263265
suffixes=('_x', '_y'),
264266
tolerance=None,
265267
allow_exact_matches=True,
266-
direction='backward'):
268+
direction='backward',
269+
validate=None):
267270
"""Perform an asof merge. This is similar to a left-join except that we
268271
match on nearest key rather than equal keys.
269272
@@ -341,6 +344,19 @@ def merge_asof(left, right, on=None,
341344
342345
.. versionadded:: 0.20.0
343346
347+
validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
348+
"many_to_one", "many_to_many"}, default None
349+
If specified, checks if merge is of specified type.
350+
* "one_to_one" or "1:1": check if merge keys are unique in both
351+
left and right dataset.
352+
* "one_to_many" or "1:m": check if merge keys are unique in left
353+
dataset.
354+
* "many_to_one" or "m:1": check if merge keys are unique in right
355+
dataset.
356+
357+
.. versionadded:: 0.21.0
358+
359+
344360
Returns
345361
-------
346362
merged : DataFrame
@@ -482,7 +498,7 @@ def merge_asof(left, right, on=None,
482498
suffixes=suffixes,
483499
how='asof', tolerance=tolerance,
484500
allow_exact_matches=allow_exact_matches,
485-
direction=direction)
501+
direction=direction, validate=validate)
486502
return op.get_result()
487503

488504

@@ -498,7 +514,8 @@ class _MergeOperation(object):
498514
def __init__(self, left, right, how='inner', on=None,
499515
left_on=None, right_on=None, axis=1,
500516
left_index=False, right_index=False, sort=True,
501-
suffixes=('_x', '_y'), copy=True, indicator=False):
517+
suffixes=('_x', '_y'), copy=True, indicator=False,
518+
validate=None):
502519
self.left = self.orig_left = left
503520
self.right = self.orig_right = right
504521
self.how = how
@@ -561,6 +578,12 @@ def __init__(self, left, right, how='inner', on=None,
561578
# to avoid incompat dtypes
562579
self._maybe_coerce_merge_keys()
563580

581+
# If argument passed to validate,
582+
# check if columns specified as unique
583+
# are in fact unique.
584+
if validate is not None:
585+
self._validate(validate)
586+
564587
def get_result(self):
565588
if self.indicator:
566589
self.left, self.right = self._indicator_pre_merge(
@@ -952,6 +975,51 @@ def _validate_specification(self):
952975
if len(self.right_on) != len(self.left_on):
953976
raise ValueError("len(right_on) must equal len(left_on)")
954977

978+
def _validate(self, validate):
979+
980+
# Check uniqueness of each
981+
if self.left_index:
982+
left_unique = not (self.orig_left.index.duplicated()).any()
983+
else:
984+
left_unique = MultiIndex.from_arrays(self.left_join_keys
985+
).is_unique
986+
987+
if self.right_index:
988+
right_unique = not (self.orig_right.index.duplicated()).any()
989+
else:
990+
right_unique = MultiIndex.from_arrays(self.right_join_keys
991+
).is_unique
992+
993+
# Check valid arg
994+
if validate not in ['one_to_one', '1:1',
995+
'one_to_many', '1:m',
996+
'many_to_one', 'm:1',
997+
'many_to_many', 'm:m']:
998+
999+
raise ValueError("Not a valid argument for validate")
1000+
1001+
# Check data integrity
1002+
if validate in ["one_to_one", "1:1"]:
1003+
if not left_unique or not right_unique:
1004+
raise ValueError("Merge keys are not unique in either left"
1005+
" or right dataset; not a one-to-one merge")
1006+
if not left_unique:
1007+
raise ValueError("Merge keys are not unique in left dataset;"
1008+
" not a one-to-one merge")
1009+
if not right_unique:
1010+
raise ValueError("Merge keys are not unique in right dataset;"
1011+
" not a one-to-one merge")
1012+
1013+
if validate in ["one_to_many", "1:m"]:
1014+
if not left_unique:
1015+
raise ValueError("Merge keys are not unique in left dataset;"
1016+
"not a one-to-many merge")
1017+
1018+
if validate in ["many_to_one", "m:1"]:
1019+
if not right_unique:
1020+
raise ValueError("Merge keys are not unique in right dataset;"
1021+
" not a many-to-one merge")
1022+
9551023

9561024
def _get_join_indexers(left_keys, right_keys, sort=False, how='inner',
9571025
**kwargs):
@@ -1004,15 +1072,17 @@ class _OrderedMerge(_MergeOperation):
10041072
def __init__(self, left, right, on=None, left_on=None, right_on=None,
10051073
left_index=False, right_index=False, axis=1,
10061074
suffixes=('_x', '_y'), copy=True,
1007-
fill_method=None, how='outer'):
1075+
fill_method=None, how='outer',
1076+
validate=None):
10081077

10091078
self.fill_method = fill_method
10101079
_MergeOperation.__init__(self, left, right, on=on, left_on=left_on,
10111080
left_index=left_index,
10121081
right_index=right_index,
10131082
right_on=right_on, axis=axis,
10141083
how=how, suffixes=suffixes,
1015-
sort=True # factorize sorts
1084+
sort=True, # factorize sorts
1085+
validate=validate
10161086
)
10171087

10181088
def get_result(self):
@@ -1109,7 +1179,7 @@ def __init__(self, left, right, on=None, left_on=None, right_on=None,
11091179
fill_method=None,
11101180
how='asof', tolerance=None,
11111181
allow_exact_matches=True,
1112-
direction='backward'):
1182+
direction='backward', validate=None):
11131183

11141184
self.by = by
11151185
self.left_by = left_by
@@ -1122,7 +1192,8 @@ def __init__(self, left, right, on=None, left_on=None, right_on=None,
11221192
right_on=right_on, left_index=left_index,
11231193
right_index=right_index, axis=axis,
11241194
how=how, suffixes=suffixes,
1125-
fill_method=fill_method)
1195+
fill_method=fill_method,
1196+
validate=validate)
11261197

11271198
def _validate_specification(self):
11281199
super(_AsOfMerge, self)._validate_specification()

pandas/tests/reshape/test_merge.py

+57
Original file line numberDiff line numberDiff line change
@@ -724,6 +724,63 @@ def test_indicator(self):
724724
how='outer', indicator=True)
725725
assert_frame_equal(test5, hand_coded_result)
726726

727+
def test_validation(self):
728+
left = DataFrame({'a': ['a', 'b', 'c', 'd'],
729+
'b': ['cat', 'dog', 'weasel', 'horse']},
730+
index=range(4))
731+
732+
right = DataFrame({'a': ['a', 'b', 'c', 'd', 'e'],
733+
'c': ['meow', 'bark', 'um... weasel noise?',
734+
'nay', 'chirp']},
735+
index=range(5))
736+
737+
merge(left, right, left_index=True, right_index=True, validate='1:1')
738+
merge(left, right, left_index=True, right_index=True,
739+
validate='one_to_one')
740+
merge(left, right, on='a', validate='1:1')
741+
merge(left, right, on='a', validate='one_to_one')
742+
743+
# Dups on right
744+
right_w_dups = right.append(pd.DataFrame({'a': ['e'], 'c': ['moo']},
745+
index=[4]))
746+
merge(left, right_w_dups, left_index=True, right_index=True,
747+
validate='one_to_many')
748+
749+
with pytest.raises(ValueError):
750+
merge(left, right_w_dups, left_index=True, right_index=True,
751+
validate='one_to_one')
752+
753+
with pytest.raises(ValueError):
754+
merge(left, right_w_dups, on='a', validate='one_to_one')
755+
756+
# Dups on left
757+
left_w_dups = left.append(pd.DataFrame({'a': ['a'], 'c': ['cow']},
758+
index=[3]))
759+
merge(left_w_dups, right, left_index=True, right_index=True,
760+
validate='many_to_one')
761+
762+
with pytest.raises(ValueError):
763+
merge(left_w_dups, right, left_index=True, right_index=True,
764+
validate='one_to_one')
765+
766+
with pytest.raises(ValueError):
767+
merge(left_w_dups, right, on='a', validate='one_to_one')
768+
769+
# Dups on both
770+
merge(left_w_dups, right_w_dups, on='a', validate='many_to_many')
771+
772+
with pytest.raises(ValueError):
773+
merge(left_w_dups, right_w_dups, left_index=True,
774+
right_index=True, validate='many_to_one')
775+
776+
with pytest.raises(ValueError):
777+
merge(left_w_dups, right_w_dups, on='a',
778+
validate='one_to_many')
779+
780+
# Check invalid arguments
781+
with pytest.raises(ValueError):
782+
merge(left, right, on='a', validate='jibberish')
783+
727784

728785
def _check_merge(x, y):
729786
for how in ['inner', 'left', 'outer']:

pandas/tests/reshape/test_merge_asof.py

+26
Original file line numberDiff line numberDiff line change
@@ -973,3 +973,29 @@ def test_on_float_by_int(self):
973973
columns=['symbol', 'exch', 'price', 'mpv'])
974974

975975
assert_frame_equal(result, expected)
976+
977+
def test_validate(self):
978+
979+
left = pd.DataFrame({'a': [1, 5, 10],
980+
'left_val': ['a', 'b', 'c']})
981+
right = pd.DataFrame({'a': [1, 2, 3, 6, 7],
982+
'right_val': [1, 2, 3, 6, 7]})
983+
# Simple run 1:1
984+
pd.merge_asof(left, right, on='a', validate="1:1")
985+
986+
# Dups on right
987+
right_w_dups = right.append(pd.DataFrame({'a': [7],
988+
'right_val': [-2]}))
989+
right_w_dups = right_w_dups.sort_values('a')
990+
991+
pd.merge_asof(left, right_w_dups, on='a', validate="1:m")
992+
with pytest.raises(ValueError):
993+
pd.merge_asof(left, right_w_dups, on='a', validate="1:1")
994+
995+
# Dups on left
996+
left_w_dups = left.append(pd.DataFrame({'a': [1],
997+
'left_val': [-2]}))
998+
left_w_dups = left_w_dups.sort_values('a')
999+
pd.merge_asof(left_w_dups, right, on='a', validate="m:1")
1000+
with pytest.raises(ValueError):
1001+
pd.merge_asof(left_w_dups, right_w_dups, on='a', validate="1:1")

0 commit comments

Comments
 (0)