-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Merge DataFrame and Series using on
(GH21220)
#21223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
GH21220
Not sure if any another changes / functionality changes are required Also, not sure whether this PR title should go under 'CLN' or 'ENH'? |
Looks like we had some tests that asserted this did not work: https://travis-ci.org/pandas-dev/pandas/jobs/384393959#L2444. Can you go through those and see what the original motivation was? Will also need a release note in 0.24.0 |
cc @jmmease |
On thing to nail down: what's the type of |
GH21220
Thanks @TomAugspurger
will make changes to release note once the tests run successfully and also there is consensus on whether merge should always return a DataFrame |
That would only be a question when merging on the values and not the index I think? We will need to add tests for this as well. |
Codecov Report
@@ Coverage Diff @@
## master #21223 +/- ##
==========================================
- Coverage 91.84% 91.83% -0.01%
==========================================
Files 153 153
Lines 49505 49509 +4
==========================================
+ Hits 45466 45469 +3
- Misses 4039 4040 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #21223 +/- ##
=========================================
Coverage ? 92%
=========================================
Files ? 168
Lines ? 50603
Branches ? 0
=========================================
Hits ? 46555
Misses ? 4048
Partials ? 0
Continue to review full report at Codecov.
|
pandas/core/reshape/merge.py
Outdated
@@ -492,6 +492,10 @@ def __init__(self, left, right, how='inner', on=None, | |||
left_index=False, right_index=False, sort=True, | |||
suffixes=('_x', '_y'), copy=True, indicator=False, | |||
validate=None): | |||
if isinstance(left, Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if u are actually going to do this
then you need to strip a ton of logic from below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this comment? What logic would need to be stripped?
Currently, this just raises when you pass a Series (see line 525 a bit below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both - will make edits to this based on @jreback 's clarification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm on the code removal, we do extension promotion in concat
, but nothing here.
index=pd.MultiIndex.from_product([['a', 'b'], [1]], | ||
names=['outer', 'inner'])) | ||
|
||
# Testing current merge behvaior is as before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these comments are not meaningful pls change them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
GH21220
pandas/core/reshape/merge.py
Outdated
if isinstance(left, Series): | ||
left = left.to_frame() | ||
if isinstance(right, Series): | ||
right = right.to_frame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to validate that the input is a Series or Dataframe and nothing else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - Had initially addeded these validations at 525 and 528 and also before 541 (if memory serves right) on my local machine - but then it fails at 571...
Thus after trying to fix the same issue (checking if input is Series or DataFrame) at multiple places, the current change seemed like the most logical fix as we are changing the type of incoming Series object to DataFrame and then avoiding any susequent changes to the code....
However open to suggestions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok, but what I mean is validation that you have a Series/DataFrame and nothing else. (and a test to go with that), something like
left = validate_operand(left)
right = validate_operand(right)
......
def validate_operand(obj):
if isinstance(obj, DataFrame):
return obj
elif isinstance(obj, Series):
return obj.to_frame()
raise ValueError(.....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also need to test this with left_index=True, right_index=True (which should be fine). And a test that a Series w/o a name raises (it will convert just fine, but then you would have to specify the on in a funny way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review - done.
also raises a ValueError for Series w/o a name
@@ -1864,3 +1864,24 @@ def test_merge_index_types(index): | |||
OrderedDict([('left_data', [1, 2]), ('right_data', [1.0, 2.0])]), | |||
index=index) | |||
assert_frame_equal(result, expected) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also parameterize this as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
also need a whatsnew note |
on
(GH21220)on
(GH21220)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but the release note should be in 0.24.0 I think.
GH21220
GH21220
Thanks @TomAugspurger |
GH21220
Are any additional changes needed on this PR? Thanks |
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -7,6 +7,7 @@ v0.24.0 | |||
|
|||
New features | |||
~~~~~~~~~~~~ | |||
- :func: merge now allows a ``DataFrame`` and a ``Series`` with a name as inputs (:issue:`21220`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't render, use :func:`merge`
. also not clear what the change is from a user POV here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - updated the comment as:
:func:`merge` now directly allows merge between objects of type ``DataFrame`` and ``Series`` with a name, without need to convert the ``Series`` object into a ``DataFrame`` beforehand (:issue:`21220`)
raise ValueError('Cannot merge a Series without a name') | ||
else: | ||
return obj.to_frame() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just raise, and more like
Can only merge Series or DataFrame objects, a {obj} was passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - done - updated the statement and made into a TypeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you changed this to TypeError?
I would personally keep it as ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Thanks - I thought TypeError is more relevant for the statement below, as the message is telling the user "only objects of type Series or DataFrame can be merged"
Should I revert it back to ValueError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the one is necessarily much better than the other in this case, so I would personally rather keep it as it was to be conservative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although eg concat
raises TypeError if the input is not series or dataframe, so maybe TypeError is more consistent with pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - not changing then and keeping it (i.e. TypeError)
@@ -230,7 +230,9 @@ def test_join_on_fails_with_different_column_counts(self): | |||
|
|||
def test_join_on_fails_with_wrong_object_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize this test
@@ -1864,3 +1864,42 @@ def test_merge_index_types(index): | |||
OrderedDict([('left_data', [1, 2]), ('right_data', [1.0, 2.0])]), | |||
index=index) | |||
assert_frame_equal(result, expected) | |||
|
|||
|
|||
@pytest.mark.parametrize("on,left_on,right_on,left_index,right_index,nms,nm", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you format this so it lines up, its pretty unreadable now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - done
@jreback please let me know if any other edits are needed on this one. Thanks. |
Could you update the docstring |
Looks good otherwise. |
GH21220
GH21220
@TomAugspurger Thanks - made the updates. Any other edits needed? |
Waiting for CI to finish.
…On Wed, Jun 20, 2018 at 10:29 AM, Kalyan Gokhale ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> Thanks - made the
updates. Any other edits needed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21223 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIq9_4MXB4Hm-4Ppp7n5zLUmuNx_Jks5t-mpFgaJpZM4UPUDU>
.
|
Any additional work needed on this? Thanks |
There's a merge conflict in If you can't get to fixing the merge conflict, I'll push a fix when I get a chance. |
|
||
If joining columns on columns, the DataFrame indexes *will be | ||
ignored*. Otherwise if joining indexes on indexes or indexes on a column or | ||
columns, the index will be passed on. | ||
|
||
Parameters | ||
----------%s | ||
right : DataFrame, Series or dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of my prior commit 69d8bda initially this only said right : DataFrame
, not sure when the part around dict
was added. Have removed mention of dict
as this PR only ensures that DataFrame
and Series
can be merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have not tested merge with dict - as recently changed local machine - is this needed in this PR?
@TomAugspurger resolved the merge conflict - all CI tests passed. Any other changes needed? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now in general, added 2 minor comments
raise ValueError('Cannot merge a Series without a name') | ||
else: | ||
return obj.to_frame() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you changed this to TypeError?
I would personally keep it as ValueError
|
||
# GH21220 - merging of Series and DataFrame is now allowed | ||
# Edited the test to remove the Series object from test parameters | ||
# Also, parameterized the original test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this last line of the comment is not really needed (in the sense of: it is not really useful for future readers of this code, they will see that the test is parametrized)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - makes sense, will remove the redundant comment - will make the change together with any changes needed for your ValueError/TypeError comment above.
@jorisvandenbossche made the suggested change - all CI tests passed. Thanks |
@KalyanGokhale Thanks a lot for this PR! |
on
#21220git diff upstream/master -u -- "*.py" | flake8 --diff