Skip to content

Commit f3e565d

Browse files
author
Nick Eubank
committed
tweak left_join_key and docs
1 parent 081bd87 commit f3e565d

File tree

2 files changed

+21
-12
lines changed

2 files changed

+21
-12
lines changed

doc/source/merging.rst

+17-6
Original file line numberDiff line numberDiff line change
@@ -729,26 +729,37 @@ Here is another example with duplicate join keys in DataFrames:
729729

730730
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
731731

732+
.. _merging.validation:
733+
732734
Checking for duplicate keys
733735
~~~~~~~~~~~~~~~~~~~~~~~~~~~
734736

737+
.. versionadded:: 0.21.0
738+
735739
Users can use the ``validate`` argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.
736740

737741
In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.
738742

739-
.. ipython:: python
740-
741-
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
743+
.. code-block:: python
742744
743-
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
745+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
746+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
747+
result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");
748+
749+
ValueError: Merge keys are not unique in either left or right dataset; not a one-to-one merge
744750
745-
result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");
746751
747752
If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.
748753

754+
.. ipython:: python
755+
:suppress:
756+
757+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
758+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
759+
749760
.. ipython:: python
750761
751-
result = pd.merge(left, right, on='B', how='outer', validate="one_to_many")
762+
pd.merge(left, right, on='B', how='outer', validate="one_to_many")
752763
753764
754765
.. _merging.indicator:

pandas/core/reshape/merge.py

+4-6
Original file line numberDiff line numberDiff line change
@@ -977,20 +977,18 @@ def _validate_specification(self):
977977

978978
def _validate(self, validate):
979979

980-
# Get axes
981-
left_key = self.left_on if self.left_on is not None else self.on
982-
right_key = self.right_on if self.right_on is not None else self.on
983-
984980
# Check uniqueness of each
985981
if self.left_index:
986982
left_unique = not (self.orig_left.index.duplicated()).any()
987983
else:
988-
left_unique = not (self.orig_left[left_key].duplicated()).any()
984+
left_unique = MultiIndex.from_arrays(self.left_join_keys
985+
).is_unique
989986

990987
if self.right_index:
991988
right_unique = not (self.orig_right.index.duplicated()).any()
992989
else:
993-
right_unique = not (self.orig_right[right_key].duplicated()).any()
990+
right_unique = MultiIndex.from_arrays(self.right_join_keys
991+
).is_unique
994992

995993
# Check valid arg
996994
if validate not in ['one_to_one', '1:1',

0 commit comments

Comments
 (0)