Skip to content

ENH: Merge DataFrame and Series using on (GH21220) #21223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 23, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import pandas.compat as compat

from pandas import (Categorical, DataFrame,
Index, MultiIndex, Timedelta)
Index, MultiIndex, Timedelta, Series)
from pandas.core.arrays.categorical import _recode_for_categories
from pandas.core.frame import _merge_doc
from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -492,6 +492,10 @@ def __init__(self, left, right, how='inner', on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None):
if isinstance(left, Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u are actually going to do this
then you need to strip a ton of logic from below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this comment? What logic would need to be stripped?

Currently, this just raises when you pass a Series (see line 525 a bit below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both - will make edits to this based on @jreback 's clarification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm on the code removal, we do extension promotion in concat, but nothing here.

left = left.to_frame()
if isinstance(right, Series):
right = right.to_frame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to validate that the input is a Series or Dataframe and nothing else

Copy link
Contributor Author

@KalyanGokhale KalyanGokhale May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - Had initially addeded these validations at 525 and 528 and also before 541 (if memory serves right) on my local machine - but then it fails at 571...

Thus after trying to fix the same issue (checking if input is Series or DataFrame) at multiple places, the current change seemed like the most logical fix as we are changing the type of incoming Series object to DataFrame and then avoiding any susequent changes to the code....

However open to suggestions...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok, but what I mean is validation that you have a Series/DataFrame and nothing else. (and a test to go with that), something like

left = validate_operand(left)
right = validate_operand(right)


......

def validate_operand(obj):
     if isinstance(obj, DataFrame): 
         return obj
     elif isinstance(obj, Series):
        return obj.to_frame()
     raise ValueError(.....)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to test this with left_index=True, right_index=True (which should be fine). And a test that a Series w/o a name raises (it will convert just fine, but then you would have to specify the on in a funny way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review - done.
also raises a ValueError for Series w/o a name

self.left = self.orig_left = left
self.right = self.orig_right = right
self.how = how
Expand Down
5 changes: 4 additions & 1 deletion pandas/tests/reshape/merge/test_join.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,10 @@ def test_join_on_fails_with_different_column_counts(self):

def test_join_on_fails_with_wrong_object_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this test

# GH12081
wrongly_typed = [Series([0, 1]), 2, 'str', None, np.array([0, 1])]
""" GH21220 - merging of Series and DataFrame is now allowed
Edited the test to remove the Series object from 'wrongly_typed'
"""
wrongly_typed = [2, 'str', None, np.array([0, 1])]
df = DataFrame({'a': [1, 1]})

for obj in wrongly_typed:
Expand Down
21 changes: 21 additions & 0 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -1864,3 +1864,24 @@ def test_merge_index_types(index):
OrderedDict([('left_data', [1, 2]), ('right_data', [1.0, 2.0])]),
index=index)
assert_frame_equal(result, expected)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also parameterize this as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def test_merge_series():
# GH 21220
a = pd.DataFrame({"A": [1, 2, 3, 4]},
index=pd.MultiIndex.from_product([['a', 'b'], [0, 1]],
names=['outer', 'inner']))
b = pd.Series([1, 2, 3, 4],
index=pd.MultiIndex.from_product([['a', 'b'], [1, 2]],
names=['outer', 'inner']), name='B')
expected = pd.DataFrame({"A": [2, 4], "B": [1, 3]},
index=pd.MultiIndex.from_product([['a', 'b'], [1]],
names=['outer', 'inner']))

# Testing current merge behvaior is as before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these comments are not meaningful pls change them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

result = pd.merge(a, b.to_frame(), on=['outer', 'inner'])
tm.assert_frame_equal(result, expected)

# Testing changed merge behvaior is as expected
result = pd.merge(a, b, on=['outer', 'inner'])
tm.assert_frame_equal(result, expected)