Skip to content

Commit b975455

Browse files
KalyanGokhalejorisvandenbossche
authored andcommitted
ENH: Merge DataFrame and Series using on (GH21220) (#21223)
1 parent 716efd3 commit b975455

File tree

6 files changed

+82
-40
lines changed

6 files changed

+82
-40
lines changed

doc/source/merging.rst

+20-20
Original file line numberDiff line numberDiff line change
@@ -506,8 +506,8 @@ You can also pass a list of dicts or Series:
506506
507507
.. _merging.join:
508508

509-
Database-style DataFrame joining/merging
510-
----------------------------------------
509+
Database-style DataFrame or named Series joining/merging
510+
--------------------------------------------------------
511511

512512
pandas has full-featured, **high performance** in-memory join operations
513513
idiomatically very similar to relational databases like SQL. These methods
@@ -522,7 +522,7 @@ Users who are familiar with SQL but new to pandas might be interested in a
522522
:ref:`comparison with SQL<compare_with_sql.join>`.
523523

524524
pandas provides a single function, :func:`~pandas.merge`, as the entry point for
525-
all standard database join operations between ``DataFrame`` objects:
525+
all standard database join operations between ``DataFrame`` or named ``Series`` objects:
526526

527527
::
528528

@@ -531,40 +531,40 @@ all standard database join operations between ``DataFrame`` objects:
531531
suffixes=('_x', '_y'), copy=True, indicator=False,
532532
validate=None)
533533

534-
* ``left``: A DataFrame object.
535-
* ``right``: Another DataFrame object.
534+
* ``left``: A DataFrame or named Series object.
535+
* ``right``: Another DataFrame or named Series object.
536536
* ``on``: Column or index level names to join on. Must be found in both the left
537-
and right DataFrame objects. If not passed and ``left_index`` and
537+
and right DataFrame and/or Series objects. If not passed and ``left_index`` and
538538
``right_index`` are ``False``, the intersection of the columns in the
539-
DataFrames will be inferred to be the join keys.
540-
* ``left_on``: Columns or index levels from the left DataFrame to use as
539+
DataFrames and/or Series will be inferred to be the join keys.
540+
* ``left_on``: Columns or index levels from the left DataFrame or Series to use as
541541
keys. Can either be column names, index level names, or arrays with length
542-
equal to the length of the DataFrame.
543-
* ``right_on``: Columns or index levels from the right DataFrame to use as
542+
equal to the length of the DataFrame or Series.
543+
* ``right_on``: Columns or index levels from the right DataFrame or Series to use as
544544
keys. Can either be column names, index level names, or arrays with length
545-
equal to the length of the DataFrame.
545+
equal to the length of the DataFrame or Series.
546546
* ``left_index``: If ``True``, use the index (row labels) from the left
547-
DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex
547+
DataFrame or Series as its join key(s). In the case of a DataFrame or Series with a MultiIndex
548548
(hierarchical), the number of levels must match the number of join keys
549-
from the right DataFrame.
550-
* ``right_index``: Same usage as ``left_index`` for the right DataFrame
549+
from the right DataFrame or Series.
550+
* ``right_index``: Same usage as ``left_index`` for the right DataFrame or Series
551551
* ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``. Defaults
552552
to ``inner``. See below for more detailed description of each method.
553553
* ``sort``: Sort the result DataFrame by the join keys in lexicographical
554554
order. Defaults to ``True``, setting to ``False`` will improve performance
555555
substantially in many cases.
556556
* ``suffixes``: A tuple of string suffixes to apply to overlapping
557557
columns. Defaults to ``('_x', '_y')``.
558-
* ``copy``: Always copy data (default ``True``) from the passed DataFrame
558+
* ``copy``: Always copy data (default ``True``) from the passed DataFrame or named Series
559559
objects, even when reindexing is not necessary. Cannot be avoided in many
560560
cases but may improve performance / memory usage. The cases where copying
561561
can be avoided are somewhat pathological but this option is provided
562562
nonetheless.
563563
* ``indicator``: Add a column to the output DataFrame called ``_merge``
564564
with information on the source of each row. ``_merge`` is Categorical-type
565565
and takes on a value of ``left_only`` for observations whose merge key
566-
only appears in ``'left'`` DataFrame, ``right_only`` for observations whose
567-
merge key only appears in ``'right'`` DataFrame, and ``both`` if the
566+
only appears in ``'left'`` DataFrame or Series, ``right_only`` for observations whose
567+
merge key only appears in ``'right'`` DataFrame or Series, and ``both`` if the
568568
observation's merge key is found in both.
569569

570570
* ``validate`` : string, default None.
@@ -584,10 +584,10 @@ all standard database join operations between ``DataFrame`` objects:
584584

585585
Support for specifying index levels as the ``on``, ``left_on``, and
586586
``right_on`` parameters was added in version 0.23.0.
587+
Support for merging named ``Series`` objects was added in version 0.24.0.
587588

588-
The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
589-
and ``right`` is a subclass of DataFrame, the return type will still be
590-
``DataFrame``.
589+
The return type will be the same as ``left``. If ``left`` is a ``DataFrame`` or named ``Series``
590+
and ``right`` is a subclass of ``DataFrame``, the return type will still be ``DataFrame``.
591591

592592
``merge`` is a function in the pandas namespace, and it is also available as a
593593
``DataFrame`` instance method :meth:`~DataFrame.merge`, with the calling

doc/source/whatsnew/v0.24.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ v0.24.0 (Month XX, 2018)
1212

1313
New features
1414
~~~~~~~~~~~~
15+
- :func:`merge` now directly allows merge between objects of type ``DataFrame`` and named ``Series``, without the need to convert the ``Series`` object into a ``DataFrame`` beforehand (:issue:`21220`)
1516

1617

1718
- ``ExcelWriter`` now accepts ``mode`` as a keyword argument, enabling append to existing workbooks when using the ``openpyxl`` engine (:issue:`3441`)

pandas/core/frame.py

+4-3
Original file line numberDiff line numberDiff line change
@@ -137,16 +137,16 @@
137137
"""
138138

139139
_merge_doc = """
140-
Merge DataFrame objects by performing a database-style join operation by
141-
columns or indexes.
140+
Merge DataFrame or named Series objects by performing a database-style join
141+
operation by columns or indexes.
142142
143143
If joining columns on columns, the DataFrame indexes *will be
144144
ignored*. Otherwise if joining indexes on indexes or indexes on a column or
145145
columns, the index will be passed on.
146146
147147
Parameters
148148
----------%s
149-
right : DataFrame, Series or dict
149+
right : DataFrame or named Series
150150
Object to merge with.
151151
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
152152
Type of merge to be performed.
@@ -217,6 +217,7 @@
217217
-----
218218
Support for specifying index levels as the `on`, `left_on`, and
219219
`right_on` parameters was added in version 0.23.0
220+
Support for merging named Series objects was added in version 0.24.0
220221
221222
See Also
222223
--------

pandas/core/reshape/merge.py

+16-8
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import pandas.compat as compat
1212

1313
from pandas import (Categorical, DataFrame,
14-
Index, MultiIndex, Timedelta)
14+
Index, MultiIndex, Timedelta, Series)
1515
from pandas.core.arrays.categorical import _recode_for_categories
1616
from pandas.core.frame import _merge_doc
1717
from pandas.core.dtypes.common import (
@@ -493,6 +493,8 @@ def __init__(self, left, right, how='inner', on=None,
493493
left_index=False, right_index=False, sort=True,
494494
suffixes=('_x', '_y'), copy=True, indicator=False,
495495
validate=None):
496+
left = validate_operand(left)
497+
right = validate_operand(right)
496498
self.left = self.orig_left = left
497499
self.right = self.orig_right = right
498500
self.how = how
@@ -519,13 +521,6 @@ def __init__(self, left, right, how='inner', on=None,
519521
raise ValueError(
520522
'indicator option can only accept boolean or string arguments')
521523

522-
if not isinstance(left, DataFrame):
523-
raise ValueError('can not merge DataFrame with instance of '
524-
'type {left}'.format(left=type(left)))
525-
if not isinstance(right, DataFrame):
526-
raise ValueError('can not merge DataFrame with instance of '
527-
'type {right}'.format(right=type(right)))
528-
529524
if not is_bool(left_index):
530525
raise ValueError(
531526
'left_index parameter must be of type bool, not '
@@ -1645,3 +1640,16 @@ def _should_fill(lname, rname):
16451640

16461641
def _any(x):
16471642
return x is not None and com._any_not_none(*x)
1643+
1644+
1645+
def validate_operand(obj):
1646+
if isinstance(obj, DataFrame):
1647+
return obj
1648+
elif isinstance(obj, Series):
1649+
if obj.name is None:
1650+
raise ValueError('Cannot merge a Series without a name')
1651+
else:
1652+
return obj.to_frame()
1653+
else:
1654+
raise TypeError('Can only merge Series or DataFrame objects, '
1655+
'a {obj} was passed'.format(obj=type(obj)))

pandas/tests/reshape/merge/test_join.py

+11-9
Original file line numberDiff line numberDiff line change
@@ -228,16 +228,18 @@ def test_join_on_fails_with_different_column_counts(self):
228228
index=tm.makeCustomIndex(10, 2))
229229
merge(df, df2, right_on='a', left_on=['a', 'b'])
230230

231-
def test_join_on_fails_with_wrong_object_type(self):
232-
# GH12081
233-
wrongly_typed = [Series([0, 1]), 2, 'str', None, np.array([0, 1])]
234-
df = DataFrame({'a': [1, 1]})
231+
@pytest.mark.parametrize("wrong_type", [2, 'str', None, np.array([0, 1])])
232+
def test_join_on_fails_with_wrong_object_type(self, wrong_type):
233+
# GH12081 - original issue
234+
235+
# GH21220 - merging of Series and DataFrame is now allowed
236+
# Edited test to remove the Series object from test parameters
235237

236-
for obj in wrongly_typed:
237-
with tm.assert_raises_regex(ValueError, str(type(obj))):
238-
merge(obj, df, left_on='a', right_on='a')
239-
with tm.assert_raises_regex(ValueError, str(type(obj))):
240-
merge(df, obj, left_on='a', right_on='a')
238+
df = DataFrame({'a': [1, 1]})
239+
with tm.assert_raises_regex(TypeError, str(type(wrong_type))):
240+
merge(wrong_type, df, left_on='a', right_on='a')
241+
with tm.assert_raises_regex(TypeError, str(type(wrong_type))):
242+
merge(df, wrong_type, left_on='a', right_on='a')
241243

242244
def test_join_on_pass_vector(self):
243245
expected = self.target.join(self.source, on='C')

pandas/tests/reshape/merge/test_merge.py

+30
Original file line numberDiff line numberDiff line change
@@ -1887,3 +1887,33 @@ def test_merge_index_types(index):
18871887
OrderedDict([('left_data', [1, 2]), ('right_data', [1.0, 2.0])]),
18881888
index=index)
18891889
assert_frame_equal(result, expected)
1890+
1891+
1892+
@pytest.mark.parametrize("on,left_on,right_on,left_index,right_index,nms,nm", [
1893+
(['outer', 'inner'], None, None, False, False, ['outer', 'inner'], 'B'),
1894+
(None, None, None, True, True, ['outer', 'inner'], 'B'),
1895+
(None, ['outer', 'inner'], None, False, True, None, 'B'),
1896+
(None, None, ['outer', 'inner'], True, False, None, 'B'),
1897+
(['outer', 'inner'], None, None, False, False, ['outer', 'inner'], None),
1898+
(None, None, None, True, True, ['outer', 'inner'], None),
1899+
(None, ['outer', 'inner'], None, False, True, None, None),
1900+
(None, None, ['outer', 'inner'], True, False, None, None)])
1901+
def test_merge_series(on, left_on, right_on, left_index, right_index, nms, nm):
1902+
# GH 21220
1903+
a = pd.DataFrame({"A": [1, 2, 3, 4]},
1904+
index=pd.MultiIndex.from_product([['a', 'b'], [0, 1]],
1905+
names=['outer', 'inner']))
1906+
b = pd.Series([1, 2, 3, 4],
1907+
index=pd.MultiIndex.from_product([['a', 'b'], [1, 2]],
1908+
names=['outer', 'inner']), name=nm)
1909+
expected = pd.DataFrame({"A": [2, 4], "B": [1, 3]},
1910+
index=pd.MultiIndex.from_product([['a', 'b'], [1]],
1911+
names=nms))
1912+
if nm is not None:
1913+
result = pd.merge(a, b, on=on, left_on=left_on, right_on=right_on,
1914+
left_index=left_index, right_index=right_index)
1915+
tm.assert_frame_equal(result, expected)
1916+
else:
1917+
with tm.assert_raises_regex(ValueError, 'a Series without a name'):
1918+
result = pd.merge(a, b, on=on, left_on=left_on, right_on=right_on,
1919+
left_index=left_index, right_index=right_index)

0 commit comments

Comments
 (0)