Skip to content

ENH: Merge DataFrame and Series using on (GH21220) #21223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 23, 2018
Merged

ENH: Merge DataFrame and Series using on (GH21220) #21223

merged 19 commits into from
Jul 23, 2018

Conversation

KalyanGokhale
Copy link
Contributor

@KalyanGokhale KalyanGokhale commented May 27, 2018

@KalyanGokhale
Copy link
Contributor Author

Not sure if any another changes / functionality changes are required

Also, not sure whether this PR title should go under 'CLN' or 'ENH'?

@TomAugspurger
Copy link
Contributor

Looks like we had some tests that asserted this did not work: https://travis-ci.org/pandas-dev/pandas/jobs/384393959#L2444. Can you go through those and see what the original motivation was?

Will also need a release note in 0.24.0

@TomAugspurger TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 27, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone May 27, 2018
@TomAugspurger
Copy link
Contributor

cc @jmmease

@TomAugspurger
Copy link
Contributor

On thing to nail down: what's the type of merge(Series, Series)? Do we want merge to always return a DataFrame?

@KalyanGokhale
Copy link
Contributor Author

Thanks @TomAugspurger

Looks like we had some tests that asserted this did not work: ... Can you go through those and see what the original motivation was?

test_join_on_fails_with_wrong_object_type inpandas/tests/reshape/merge/test_join.py was failing as it was checking that a ValueError is raised on merging Series with DataFrame - have now edited out the portion for Series from that test

Will also need a release note in 0.24.0

will make changes to release note once the tests run successfully and also there is consensus on whether merge should always return a DataFrame

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 28, 2018

On thing to nail down: what's the type of merge(Series, Series)? Do we want merge to always return a DataFrame?

That would only be a question when merging on the values and not the index I think?
But personally I would go for consistent output of always DataFrame.

We will need to add tests for this as well.

@codecov
Copy link

codecov bot commented May 28, 2018

Codecov Report

Merging #21223 into master will decrease coverage by <.01%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21223      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         153      153              
  Lines       49505    49509       +4     
==========================================
+ Hits        45466    45469       +3     
- Misses       4039     4040       +1
Flag Coverage Δ
#multiple 90.23% <75%> (-0.01%) ⬇️
#single 41.87% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.14% <75%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c2844a...b09ffb7. Read the comment docs.

@codecov
Copy link

codecov bot commented May 28, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@27ebb3e). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #21223   +/-   ##
=========================================
  Coverage          ?      92%           
=========================================
  Files             ?      168           
  Lines             ?    50603           
  Branches          ?        0           
=========================================
  Hits              ?    46555           
  Misses            ?     4048           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.4% <100%> (?)
#single 42.19% <10%> (?)
Impacted Files Coverage Δ
pandas/core/frame.py 97.19% <ø> (ø)
pandas/core/reshape/merge.py 94.16% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27ebb3e...d099fd6. Read the comment docs.

@@ -492,6 +492,10 @@ def __init__(self, left, right, how='inner', on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None):
if isinstance(left, Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u are actually going to do this
then you need to strip a ton of logic from below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this comment? What logic would need to be stripped?

Currently, this just raises when you pass a Series (see line 525 a bit below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both - will make edits to this based on @jreback 's clarification

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm on the code removal, we do extension promotion in concat, but nothing here.

index=pd.MultiIndex.from_product([['a', 'b'], [1]],
names=['outer', 'inner']))

# Testing current merge behvaior is as before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these comments are not meaningful pls change them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if isinstance(left, Series):
left = left.to_frame()
if isinstance(right, Series):
right = right.to_frame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to validate that the input is a Series or Dataframe and nothing else

Copy link
Contributor Author

@KalyanGokhale KalyanGokhale May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - Had initially addeded these validations at 525 and 528 and also before 541 (if memory serves right) on my local machine - but then it fails at 571...

Thus after trying to fix the same issue (checking if input is Series or DataFrame) at multiple places, the current change seemed like the most logical fix as we are changing the type of incoming Series object to DataFrame and then avoiding any susequent changes to the code....

However open to suggestions...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok, but what I mean is validation that you have a Series/DataFrame and nothing else. (and a test to go with that), something like

left = validate_operand(left)
right = validate_operand(right)


......

def validate_operand(obj):
     if isinstance(obj, DataFrame): 
         return obj
     elif isinstance(obj, Series):
        return obj.to_frame()
     raise ValueError(.....)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to test this with left_index=True, right_index=True (which should be fine). And a test that a Series w/o a name raises (it will convert just fine, but then you would have to specify the on in a funny way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review - done.
also raises a ValueError for Series w/o a name

@@ -1864,3 +1864,24 @@ def test_merge_index_types(index):
OrderedDict([('left_data', [1, 2]), ('right_data', [1.0, 2.0])]),
index=index)
assert_frame_equal(result, expected)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also parameterize this as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor

jreback commented May 29, 2018

also need a whatsnew note

@jreback jreback removed this from the 0.24.0 milestone May 29, 2018
@KalyanGokhale KalyanGokhale changed the title CLN: Merge DataFrame and Series using on (GH21220) ENH: Merge DataFrame and Series using on (GH21220) May 29, 2018
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but the release note should be in 0.24.0 I think.

@KalyanGokhale
Copy link
Contributor Author

KalyanGokhale commented Jun 7, 2018

Thanks @TomAugspurger
Done - moved the whatsnew note to 0.24.0
As this PR was removed from 0.24.0 milestone had earlier added whatsnew note to 0.23.1

@KalyanGokhale
Copy link
Contributor Author

Are any additional changes needed on this PR? Thanks

@@ -7,6 +7,7 @@ v0.24.0

New features
~~~~~~~~~~~~
- :func: merge now allows a ``DataFrame`` and a ``Series`` with a name as inputs (:issue:`21220`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't render, use :func:`merge` . also not clear what the change is from a user POV here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - updated the comment as:

:func:`merge` now directly allows merge between objects of type ``DataFrame`` and ``Series`` with a name, without need to convert the ``Series`` object into a ``DataFrame`` beforehand (:issue:`21220`)

raise ValueError('Cannot merge a Series without a name')
else:
return obj.to_frame()
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just raise, and more like

Can only merge Series or DataFrame objects, a {obj} was passed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - done - updated the statement and made into a TypeError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you changed this to TypeError?
I would personally keep it as ValueError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Thanks - I thought TypeError is more relevant for the statement below, as the message is telling the user "only objects of type Series or DataFrame can be merged"
Should I revert it back to ValueError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the one is necessarily much better than the other in this case, so I would personally rather keep it as it was to be conservative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although eg concat raises TypeError if the input is not series or dataframe, so maybe TypeError is more consistent with pandas.

Copy link
Contributor Author

@KalyanGokhale KalyanGokhale Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - not changing then and keeping it (i.e. TypeError)

@@ -230,7 +230,9 @@ def test_join_on_fails_with_different_column_counts(self):

def test_join_on_fails_with_wrong_object_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this test

@@ -1864,3 +1864,42 @@ def test_merge_index_types(index):
OrderedDict([('left_data', [1, 2]), ('right_data', [1.0, 2.0])]),
index=index)
assert_frame_equal(result, expected)


@pytest.mark.parametrize("on,left_on,right_on,left_index,right_index,nms,nm", [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you format this so it lines up, its pretty unreadable now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - done

@KalyanGokhale
Copy link
Contributor Author

@jreback please let me know if any other edits are needed on this one. Thanks.

@TomAugspurger
Copy link
Contributor

Could you update the docstring _merge_doc in pandas/core/frame.py to reflect the ability to merge DataFrames and named series? merging.rst also has a few references to joining DataFrame objects, that should be updated to DataFrame or named series.

@TomAugspurger
Copy link
Contributor

Looks good otherwise.

@KalyanGokhale
Copy link
Contributor Author

@TomAugspurger Thanks - made the updates. Any other edits needed?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 20, 2018 via email

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Jun 20, 2018
@KalyanGokhale
Copy link
Contributor Author

Any additional work needed on this? Thanks

@TomAugspurger
Copy link
Contributor

There's a merge conflict in core/frame.py now. Can you merge in master and resolve that, and let us know if you see that all the CI pass?

If you can't get to fixing the merge conflict, I'll push a fix when I get a chance.


If joining columns on columns, the DataFrame indexes *will be
ignored*. Otherwise if joining indexes on indexes or indexes on a column or
columns, the index will be passed on.

Parameters
----------%s
right : DataFrame, Series or dict
Copy link
Contributor Author

@KalyanGokhale KalyanGokhale Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of my prior commit 69d8bda initially this only said right : DataFrame, not sure when the part around dict was added. Have removed mention of dict as this PR only ensures that DataFrame and Series can be merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not tested merge with dict - as recently changed local machine - is this needed in this PR?

@KalyanGokhale
Copy link
Contributor Author

@TomAugspurger resolved the merge conflict - all CI tests passed. Any other changes needed? Thanks

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now in general, added 2 minor comments

raise ValueError('Cannot merge a Series without a name')
else:
return obj.to_frame()
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you changed this to TypeError?
I would personally keep it as ValueError


# GH21220 - merging of Series and DataFrame is now allowed
# Edited the test to remove the Series object from test parameters
# Also, parameterized the original test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this last line of the comment is not really needed (in the sense of: it is not really useful for future readers of this code, they will see that the test is parametrized)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - makes sense, will remove the redundant comment - will make the change together with any changes needed for your ValueError/TypeError comment above.

@KalyanGokhale
Copy link
Contributor Author

@jorisvandenbossche made the suggested change - all CI tests passed. Thanks

@jorisvandenbossche jorisvandenbossche merged commit b975455 into pandas-dev:master Jul 23, 2018
@jorisvandenbossche
Copy link
Member

@KalyanGokhale Thanks a lot for this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge DataFrame and Series using on
4 participants