You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/source/merging.rst
+17-6
Original file line number
Diff line number
Diff line change
@@ -729,26 +729,37 @@ Here is another example with duplicate join keys in DataFrames:
729
729
730
730
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
731
731
732
+
.. _merging.validation:
733
+
732
734
Checking for duplicate keys
733
735
~~~~~~~~~~~~~~~~~~~~~~~~~~~
734
736
737
+
.. versionadded:: 0.21.0
738
+
735
739
Users can use the ``validate`` argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.
736
740
737
741
In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.
738
742
739
-
.. ipython:: python
740
-
741
-
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
743
+
.. code-block:: python
742
744
743
-
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
745
+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
746
+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
747
+
result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");
748
+
749
+
ValueError: Merge keys are not unique in either left or right dataset; not a one-to-one merge
744
750
745
-
result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");
746
751
747
752
If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.
748
753
754
+
.. ipython:: python
755
+
:suppress:
756
+
757
+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
758
+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
759
+
749
760
.. ipython:: python
750
761
751
-
result =pd.merge(left, right, on='B', how='outer', validate="one_to_many")
0 commit comments