You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -551,6 +552,21 @@ standard database join operations between DataFrame objects:
551
552
552
553
.. versionadded:: 0.17.0
553
554
555
+
- ``validate`` : String, default None
556
+
If specified, checks if merge is of specified type.
557
+
558
+
* "one_to_one" or "1:1": checks if merge keys are unique in both
559
+
left and right datasets.
560
+
* "one_to_many" or "1:m": checks if merge keys are unique in left
561
+
dataset.
562
+
* "many_to_one" or "m:1": checks if merge keys are unique in right
563
+
dataset.
564
+
* "many_to_many" or "m:m": allowed, but does not result in checks.
565
+
566
+
567
+
.. versionadded:: 0.21.0
568
+
569
+
554
570
The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
555
571
and ``right`` is a subclass of DataFrame, the return type will still be
556
572
``DataFrame``.
@@ -711,10 +727,42 @@ Here is another example with duplicate join keys in DataFrames:
711
727
labels=['left', 'right'], vertical=False);
712
728
plt.close('all');
713
729
730
+
714
731
.. warning::
715
732
716
-
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions,
717
-
may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
733
+
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
734
+
735
+
.. _merging.validation:
736
+
737
+
Checking for duplicate keys
738
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
739
+
740
+
.. versionadded:: 0.21.0
741
+
742
+
Users can use the ``validate`` argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.
743
+
744
+
In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.
745
+
746
+
747
+
.. ipython:: python
748
+
749
+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
750
+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
751
+
752
+
.. code-block:: python
753
+
754
+
result = pd.merge(left, right, on='B', how='outer', validate="one_to_one")
ValueError: Merge keys are not unique in right dataset; not a one-to-one merge
759
+
760
+
If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.
Copy file name to clipboardExpand all lines: doc/source/whatsnew/v0.21.0.txt
+4-2
Original file line number
Diff line number
Diff line change
@@ -25,14 +25,16 @@ New features
25
25
- Added `__fspath__` method to :class`:pandas.HDFStore`, :class:`pandas.ExcelFile`,
26
26
and :class:`pandas.ExcelWriter` to work properly with the file system path protocol (:issue:`13823`)
27
27
28
-
29
28
.. _whatsnew_0210.enhancements.other:
30
29
31
30
Other Enhancements
32
31
^^^^^^^^^^^^^^^^^^
32
+
33
+
- The ``validate`` argument for :func:`merge` function now checks whether a merge is
34
+
one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not
35
+
be an example of specified merge type, an exception will be raised. (:issue:`16270`)
33
36
- ``Series.to_dict()`` and ``DataFrame.to_dict()`` now support an ``into`` keyword which allows you to specify the ``collections.Mapping`` subclass that you would like returned. The default is ``dict``, which is backwards compatible. (:issue:`16122`)
34
37
- ``RangeIndex.append`` now returns a ``RangeIndex`` object when possible (:issue:`16212`)
35
-
36
38
- :func:`to_pickle` has gained a protocol parameter (:issue:`16252`). By default,
37
39
this parameter is set to `HIGHEST_PROTOCOL <https://docs.python.org/3/library/pickle.html#data-stream-format>`__
0 commit comments