You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -551,6 +552,20 @@ standard database join operations between DataFrame objects:
551
552
552
553
.. versionadded:: 0.17.0
553
554
555
+
- ``validate`` : string, default None.
556
+
If specified, checks if merge is of specified type.
557
+
558
+
* "one_to_one" or "1:1": checks if merge keys are unique in both
559
+
left and right datasets.
560
+
* "one_to_many" or "1:m": checks if merge keys are unique in left
561
+
dataset.
562
+
* "many_to_one" or "m:1": checks if merge keys are unique in right
563
+
dataset.
564
+
* "many_to_many" or "m:m": allowed, but does not result in checks.
565
+
566
+
.. versionadded:: 0.21.0
567
+
568
+
554
569
The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
555
570
and ``right`` is a subclass of DataFrame, the return type will still be
556
571
``DataFrame``.
@@ -711,10 +726,40 @@ Here is another example with duplicate join keys in DataFrames:
711
726
labels=['left', 'right'], vertical=False);
712
727
plt.close('all');
713
728
729
+
714
730
.. warning::
715
731
716
-
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions,
717
-
may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
732
+
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
733
+
734
+
.. _merging.validation:
735
+
736
+
Checking for duplicate keys
737
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
738
+
739
+
.. versionadded:: 0.21.0
740
+
741
+
Users can use the ``validate`` argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.
742
+
743
+
In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.
744
+
745
+
746
+
.. ipython:: python
747
+
748
+
left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
749
+
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
750
+
751
+
.. code-block:: ipython
752
+
753
+
In [53]: result = pd.merge(left, right, on='B', how='outer', validate="one_to_one")
754
+
...
755
+
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
756
+
757
+
If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `validate='one_to_many'` argument instead, which will not raise an exception.
Copy file name to clipboardExpand all lines: doc/source/whatsnew/v0.21.0.txt
+3-4
Original file line number
Diff line number
Diff line change
@@ -25,16 +25,15 @@ New features
25
25
- Added `__fspath__` method to :class`:pandas.HDFStore`, :class:`pandas.ExcelFile`,
26
26
and :class:`pandas.ExcelWriter` to work properly with the file system path protocol (:issue:`13823`)
27
27
28
-
29
28
.. _whatsnew_0210.enhancements.other:
30
29
31
30
Other Enhancements
32
31
^^^^^^^^^^^^^^^^^^
32
+
33
+
- The ``validate`` argument for :func:`merge` function now checks whether a merge is one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not be an example of specified merge type, an exception will be raised. For more, see :ref:`here <merging.validation>` (:issue:`16270`)
33
34
- ``Series.to_dict()`` and ``DataFrame.to_dict()`` now support an ``into`` keyword which allows you to specify the ``collections.Mapping`` subclass that you would like returned. The default is ``dict``, which is backwards compatible. (:issue:`16122`)
34
35
- ``RangeIndex.append`` now returns a ``RangeIndex`` object when possible (:issue:`16212`)
35
-
36
-
- :func:`to_pickle` has gained a protocol parameter (:issue:`16252`). By default,
37
-
this parameter is set to `HIGHEST_PROTOCOL <https://docs.python.org/3/library/pickle.html#data-stream-format>`__
36
+
- :func:`to_pickle` has gained a protocol parameter (:issue:`16252`). By default, this parameter is set to `HIGHEST_PROTOCOL <https://docs.python.org/3/library/pickle.html#data-stream-format>`__
0 commit comments