Skip to content

Commit b2e87ea

Browse files
author
Nick Eubank
committed
add validate argument to merge
1 parent 0f55de1 commit b2e87ea

File tree

5 files changed

+250
-12
lines changed

5 files changed

+250
-12
lines changed

doc/source/merging.rst

+51-3
Original file line numberDiff line numberDiff line change
@@ -513,7 +513,8 @@ standard database join operations between DataFrame objects:
513513

514514
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
515515
left_index=False, right_index=False, sort=True,
516-
suffixes=('_x', '_y'), copy=True, indicator=False)
516+
suffixes=('_x', '_y'), copy=True, indicator=False,
517+
validate=None)
517518

518519
- ``left``: A DataFrame object
519520
- ``right``: Another DataFrame object
@@ -551,6 +552,21 @@ standard database join operations between DataFrame objects:
551552

552553
.. versionadded:: 0.17.0
553554

555+
- ``validate`` : string, default None
556+
If specified, checks if merge is of specified type.
557+
558+
* "one_to_one" or "1:1": checks if merge keys are unique in both
559+
left and right datasets.
560+
* "one_to_many" or "1:m": checks if merge keys are unique in left
561+
dataset.
562+
* "many_to_one" or "m:1": checks if merge keys are unique in right
563+
dataset.
564+
* "many_to_many" or "m:m": allowed, but does not result in checks.
565+
566+
567+
.. versionadded:: 0.21.0
568+
569+
554570
The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
555571
and ``right`` is a subclass of DataFrame, the return type will still be
556572
``DataFrame``.
@@ -711,10 +727,42 @@ Here is another example with duplicate join keys in DataFrames:
711727
labels=['left', 'right'], vertical=False);
712728
plt.close('all');
713729
730+
714731
.. warning::
715732

716-
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions,
717-
may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
733+
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
734+
735+
.. _merging.validation:
736+
737+
Checking for duplicate keys
738+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
739+
740+
.. versionadded:: 0.21.0
741+
742+
Users can use the ``validate`` argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.
743+
744+
In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.
745+
746+
747+
.. ipython:: python
748+
749+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
750+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
751+
752+
.. code-block:: python
753+
754+
In [53]: result = pd.merge(left, right, on='B', how='outer', validate="one_to_one")
755+
Out [53]:
756+
---------------------------------------------------------------------------
757+
758+
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
759+
760+
If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `validate='one_to_many'` argument instead, which will not raise an exception.
761+
762+
.. ipython:: python
763+
764+
pd.merge(left, right, on='B', how='outer', validate="one_to_many")
765+
718766
719767
.. _merging.indicator:
720768

doc/source/whatsnew/v0.21.0.txt

+3-4
Original file line numberDiff line numberDiff line change
@@ -25,16 +25,15 @@ New features
2525
- Added `__fspath__` method to :class`:pandas.HDFStore`, :class:`pandas.ExcelFile`,
2626
and :class:`pandas.ExcelWriter` to work properly with the file system path protocol (:issue:`13823`)
2727

28-
2928
.. _whatsnew_0210.enhancements.other:
3029

3130
Other Enhancements
3231
^^^^^^^^^^^^^^^^^^
32+
33+
- The ``validate`` argument for :func:`merge` function now checks whether a merge is one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not be an example of specified merge type, an exception will be raised. For more, see :ref:`here <merging.validation>` (:issue:`16270`)
3334
- ``Series.to_dict()`` and ``DataFrame.to_dict()`` now support an ``into`` keyword which allows you to specify the ``collections.Mapping`` subclass that you would like returned. The default is ``dict``, which is backwards compatible. (:issue:`16122`)
3435
- ``RangeIndex.append`` now returns a ``RangeIndex`` object when possible (:issue:`16212`)
35-
36-
- :func:`to_pickle` has gained a protocol parameter (:issue:`16252`). By default,
37-
this parameter is set to `HIGHEST_PROTOCOL <https://docs.python.org/3/library/pickle.html#data-stream-format>`__
36+
- :func:`to_pickle` has gained a protocol parameter (:issue:`16252`). By default, this parameter is set to `HIGHEST_PROTOCOL <https://docs.python.org/3/library/pickle.html#data-stream-format>`__
3837

3938
.. _whatsnew_0210.api_breaking:
4039

pandas/core/frame.py

+16-2
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,19 @@
175175
176176
.. versionadded:: 0.17.0
177177
178+
validate : string, default None
179+
If specified, checks if merge is of specified type.
180+
181+
* "one_to_one" or "1:1": check if merge keys are unique in both
182+
left and right datasets.
183+
* "one_to_many" or "1:m": check if merge keys are unique in left
184+
dataset.
185+
* "many_to_one" or "m:1": check if merge keys are unique in right
186+
dataset.
187+
* "many_to_may" or "m:m": allowed, but does not result in checks.
188+
189+
.. versionadded:: 0.21.0
190+
178191
Examples
179192
--------
180193
@@ -4868,12 +4881,13 @@ def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',
48684881
@Appender(_merge_doc, indents=2)
48694882
def merge(self, right, how='inner', on=None, left_on=None, right_on=None,
48704883
left_index=False, right_index=False, sort=False,
4871-
suffixes=('_x', '_y'), copy=True, indicator=False):
4884+
suffixes=('_x', '_y'), copy=True, indicator=False,
4885+
validate=None):
48724886
from pandas.core.reshape.merge import merge
48734887
return merge(self, right, how=how, on=on, left_on=left_on,
48744888
right_on=right_on, left_index=left_index,
48754889
right_index=right_index, sort=sort, suffixes=suffixes,
4876-
copy=copy, indicator=indicator)
4890+
copy=copy, indicator=indicator, validate=validate)
48774891

48784892
def round(self, decimals=0, *args, **kwargs):
48794893
"""

pandas/core/reshape/merge.py

+56-3
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,13 @@
4646
@Appender(_merge_doc, indents=0)
4747
def merge(left, right, how='inner', on=None, left_on=None, right_on=None,
4848
left_index=False, right_index=False, sort=False,
49-
suffixes=('_x', '_y'), copy=True, indicator=False):
49+
suffixes=('_x', '_y'), copy=True, indicator=False,
50+
validate=None):
5051
op = _MergeOperation(left, right, how=how, on=on, left_on=left_on,
5152
right_on=right_on, left_index=left_index,
5253
right_index=right_index, sort=sort, suffixes=suffixes,
53-
copy=copy, indicator=indicator)
54+
copy=copy, indicator=indicator,
55+
validate=validate)
5456
return op.get_result()
5557

5658

@@ -341,6 +343,7 @@ def merge_asof(left, right, on=None,
341343
342344
.. versionadded:: 0.20.0
343345
346+
344347
Returns
345348
-------
346349
merged : DataFrame
@@ -504,7 +507,8 @@ class _MergeOperation(object):
504507
def __init__(self, left, right, how='inner', on=None,
505508
left_on=None, right_on=None, axis=1,
506509
left_index=False, right_index=False, sort=True,
507-
suffixes=('_x', '_y'), copy=True, indicator=False):
510+
suffixes=('_x', '_y'), copy=True, indicator=False,
511+
validate=None):
508512
self.left = self.orig_left = left
509513
self.right = self.orig_right = right
510514
self.how = how
@@ -567,6 +571,12 @@ def __init__(self, left, right, how='inner', on=None,
567571
# to avoid incompat dtypes
568572
self._maybe_coerce_merge_keys()
569573

574+
# If argument passed to validate,
575+
# check if columns specified as unique
576+
# are in fact unique.
577+
if validate is not None:
578+
self._validate(validate)
579+
570580
def get_result(self):
571581
if self.indicator:
572582
self.left, self.right = self._indicator_pre_merge(
@@ -958,6 +968,49 @@ def _validate_specification(self):
958968
if len(self.right_on) != len(self.left_on):
959969
raise ValueError("len(right_on) must equal len(left_on)")
960970

971+
def _validate(self, validate):
972+
973+
# Check uniqueness of each
974+
if self.left_index:
975+
left_unique = self.orig_left.index.is_unique
976+
else:
977+
left_unique = MultiIndex.from_arrays(self.left_join_keys
978+
).is_unique
979+
980+
if self.right_index:
981+
right_unique = self.orig_right.index.is_unique
982+
else:
983+
right_unique = MultiIndex.from_arrays(self.right_join_keys
984+
).is_unique
985+
986+
# Check data integrity
987+
if validate in ["one_to_one", "1:1"]:
988+
if not left_unique and not right_unique:
989+
raise ValueError("Merge keys are not unique in either left"
990+
" or right dataset; not a one-to-one merge")
991+
elif not left_unique:
992+
raise ValueError("Merge keys are not unique in left dataset;"
993+
" not a one-to-one merge")
994+
elif not right_unique:
995+
raise ValueError("Merge keys are not unique in right dataset;"
996+
" not a one-to-one merge")
997+
998+
elif validate in ["one_to_many", "1:m"]:
999+
if not left_unique:
1000+
raise ValueError("Merge keys are not unique in left dataset;"
1001+
"not a one-to-many merge")
1002+
1003+
elif validate in ["many_to_one", "m:1"]:
1004+
if not right_unique:
1005+
raise ValueError("Merge keys are not unique in right dataset;"
1006+
" not a many-to-one merge")
1007+
1008+
elif validate in ['many_to_many', 'm:m']:
1009+
pass
1010+
1011+
else:
1012+
raise ValueError("Not a valid argument for validate")
1013+
9611014

9621015
def _get_join_indexers(left_keys, right_keys, sort=False, how='inner',
9631016
**kwargs):

pandas/tests/reshape/test_merge.py

+124
Original file line numberDiff line numberDiff line change
@@ -724,6 +724,130 @@ def test_indicator(self):
724724
how='outer', indicator=True)
725725
assert_frame_equal(test5, hand_coded_result)
726726

727+
def test_validation(self):
728+
left = DataFrame({'a': ['a', 'b', 'c', 'd'],
729+
'b': ['cat', 'dog', 'weasel', 'horse']},
730+
index=range(4))
731+
732+
right = DataFrame({'a': ['a', 'b', 'c', 'd', 'e'],
733+
'c': ['meow', 'bark', 'um... weasel noise?',
734+
'nay', 'chirp']},
735+
index=range(5))
736+
737+
# Make sure no side effects.
738+
left_copy = left.copy()
739+
right_copy = right.copy()
740+
741+
result = merge(left, right, left_index=True, right_index=True,
742+
validate='1:1')
743+
assert_frame_equal(left, left_copy)
744+
assert_frame_equal(right, right_copy)
745+
746+
# make sure merge still correct
747+
expected = DataFrame({'a_x': ['a', 'b', 'c', 'd'],
748+
'b': ['cat', 'dog', 'weasel', 'horse'],
749+
'a_y': ['a', 'b', 'c', 'd'],
750+
'c': ['meow', 'bark', 'um... weasel noise?',
751+
'nay']},
752+
index=range(4),
753+
columns=['a_x', 'b', 'a_y', 'c'])
754+
755+
result = merge(left, right, left_index=True, right_index=True,
756+
validate='one_to_one')
757+
assert_frame_equal(result, expected)
758+
759+
expected_2 = DataFrame({'a': ['a', 'b', 'c', 'd'],
760+
'b': ['cat', 'dog', 'weasel', 'horse'],
761+
'c': ['meow', 'bark', 'um... weasel noise?',
762+
'nay']},
763+
index=range(4))
764+
765+
result = merge(left, right, on='a', validate='1:1')
766+
assert_frame_equal(left, left_copy)
767+
assert_frame_equal(right, right_copy)
768+
assert_frame_equal(result, expected_2)
769+
770+
result = merge(left, right, on='a', validate='one_to_one')
771+
assert_frame_equal(result, expected_2)
772+
773+
# One index, one column
774+
expected_3 = DataFrame({'b': ['cat', 'dog', 'weasel', 'horse'],
775+
'a': ['a', 'b', 'c', 'd'],
776+
'c': ['meow', 'bark', 'um... weasel noise?',
777+
'nay']},
778+
columns=['b', 'a', 'c'],
779+
index=range(4))
780+
781+
left_index_reset = left.set_index('a')
782+
result = merge(left_index_reset, right, left_index=True,
783+
right_on='a', validate='one_to_one')
784+
assert_frame_equal(result, expected_3)
785+
786+
# Dups on right
787+
right_w_dups = right.append(pd.DataFrame({'a': ['e'], 'c': ['moo']},
788+
index=[4]))
789+
merge(left, right_w_dups, left_index=True, right_index=True,
790+
validate='one_to_many')
791+
792+
with pytest.raises(ValueError):
793+
merge(left, right_w_dups, left_index=True, right_index=True,
794+
validate='one_to_one')
795+
796+
with pytest.raises(ValueError):
797+
merge(left, right_w_dups, on='a', validate='one_to_one')
798+
799+
# Dups on left
800+
left_w_dups = left.append(pd.DataFrame({'a': ['a'], 'c': ['cow']},
801+
index=[3]))
802+
merge(left_w_dups, right, left_index=True, right_index=True,
803+
validate='many_to_one')
804+
805+
with pytest.raises(ValueError):
806+
merge(left_w_dups, right, left_index=True, right_index=True,
807+
validate='one_to_one')
808+
809+
with pytest.raises(ValueError):
810+
merge(left_w_dups, right, on='a', validate='one_to_one')
811+
812+
# Dups on both
813+
merge(left_w_dups, right_w_dups, on='a', validate='many_to_many')
814+
815+
with pytest.raises(ValueError):
816+
merge(left_w_dups, right_w_dups, left_index=True,
817+
right_index=True, validate='many_to_one')
818+
819+
with pytest.raises(ValueError):
820+
merge(left_w_dups, right_w_dups, on='a',
821+
validate='one_to_many')
822+
823+
# Check invalid arguments
824+
with pytest.raises(ValueError):
825+
merge(left, right, on='a', validate='jibberish')
826+
827+
# Two column merge, dups in both, but jointly no dups.
828+
left = DataFrame({'a': ['a', 'a', 'b', 'b'],
829+
'b': [0, 1, 0, 1],
830+
'c': ['cat', 'dog', 'weasel', 'horse']},
831+
index=range(4))
832+
833+
right = DataFrame({'a': ['a', 'a', 'b'],
834+
'b': [0, 1, 0],
835+
'd': ['meow', 'bark', 'um... weasel noise?']},
836+
index=range(3))
837+
838+
expected_multi = DataFrame({'a': ['a', 'a', 'b'],
839+
'b': [0, 1, 0],
840+
'c': ['cat', 'dog', 'weasel'],
841+
'd': ['meow', 'bark',
842+
'um... weasel noise?']},
843+
index=range(3))
844+
845+
with pytest.raises(ValueError):
846+
merge(left, right, on='a', validate='1:1')
847+
848+
result = merge(left, right, on=['a', 'b'], validate='1:1')
849+
assert_frame_equal(result, expected_multi)
850+
727851

728852
def _check_merge(x, y):
729853
for how in ['inner', 'left', 'outer']:

0 commit comments

Comments
 (0)