You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If specified, checks if merge is of specified type.
557
+
* "one_to_one" or "1:1": check if merge keys are unique in both
558
+
left and right dataset.
559
+
* "one_to_many" or "1:m": check if merge keys are unique in left
560
+
dataset.
561
+
* "many_to_one" or "m:1": check if merge keys are unique in right
562
+
dataset.
563
+
564
+
.. versionadded:: 0.21.0
565
+
566
+
554
567
The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
555
568
and ``right`` is a subclass of DataFrame, the return type will still be
556
569
``DataFrame``.
@@ -711,10 +724,43 @@ Here is another example with duplicate join keys in DataFrames:
711
724
labels=['left', 'right'], vertical=False);
712
725
plt.close('all');
713
726
727
+
714
728
.. warning::
715
729
716
-
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions,
717
-
may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
730
+
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
731
+
732
+
.. _merging.validation:
733
+
734
+
Checking for duplicate keys
735
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
736
+
737
+
.. versionadded:: 0.21.0
738
+
739
+
Users can use the ``validate`` argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.
740
+
741
+
In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.
742
+
743
+
.. code-block:: python
744
+
745
+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
746
+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
747
+
result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");
748
+
749
+
ValueError: Merge keys are not unique in either left or right dataset; not a one-to-one merge
750
+
751
+
752
+
If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.
753
+
754
+
.. ipython:: python
755
+
:suppress:
756
+
757
+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
758
+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
0 commit comments