-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ERR automatic broadcast for merging different levels, #9455 #12219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
its a correct result, but not very useful. merging between different levels cannot automatically broadcast. |
I would like to see this raise a |
I thought it was related to the multi-index not being lexsorted, but it is not:
I'm updating the name of the pull request. |
@jreback I'm a bit confused about what should or should not be allowed. How is this different from #2024, which was closed by this commit? |
oh, I think we out to change that behavior and raise. I think this should be a hard error, or at the very least a useful Warning (I agree its not PerformanceWarning which is prob just triggered). Are there any situations where this is NOT an error? |
@jreback What do you mean by "not an error"? |
I mean, can you come up with a usecase where this is actually useful? e.g. merging something with multi-levels with a single level frame? (so you have mixed tuples and column labels). |
Wanting to merge a single-level dataframe with a multi-level one seems natural to me (and, apparently, to other users, since this functionality was explicitly implemented). If this is to be prevented with a
which is not really difficult to do (or to guess?) On the other hand, what are the drawbacks to keeping the feature? As far as I'm concerned, I was a bit surprised with the result, because I would naively have expected to get a multilevel dataframe, like
But maybe that's because I'm still not familiar enough with the spirit of the API. Anyway, raising an error here will break things. Following the principle "be liberal in what you accept, be conservative in what you do", I would suggest to keep the API as it is. |
@nbonnotte you are totally missing my point. I am not asking about merging a single level to a SINGLE level of a multi-level at all. from the example above
I can't imagine this is useful in any way and would always be an error. However if someone speaks up and says, hey this might be useful? I want to see that case. |
@jreback I feel like none of us understand what the other is saying. I am not talking either about actually merging a single level to a single level of a multi-level. What I am saying is: what are the drawbacks to letting the feature? From my (still naive) point of view, I would find it difficult to guess the result of the operation: here, the multi-level is "flattened" (not sure the term is correct), but it could have been possible to instead merge the single-level index to a single level of the multi-level, right? So I'm not saying we should actually merge a single level to a multi-level, just that we could, and it would seem to me at least as natural as the current result... which therefore might appear as unpredictable. That's a drawback, for me. Are there any other drawbacks? Why should it be an error? As for a use case, what about #2024 ? Otherwise, I have an example. I often use pandas to work on the features I'm going to feed to scikit-learn, and I have features from many different origins: for instance, for each day weather data (temperature, pressure, etc.), calendar data (is it a bank holiday? a school holiday?) and let's say the target value of the previous day. At some point, I thus have multiple multi-level dataframes on the one hand, e.g. with Sure, I can think of some ways to arrive to the same result without merging a single-level dataframe to a multi-level one. But it is simpler if it's directly possible. Otherwise, we can raise an error, and wait to see if anyone complains. You tell me. |
my point is that the default of this operation is not generally wanted (IOW a user almost certainly would be suprised that this DOES not merge on a particular level). So I would like to see this raise a helpful error message and force the user to be proactive (e.g. pass an option) to actually merge a combined multi-level with a single level. The default is too flexible here. We could simply show a warning, but warnings are often ignored. I don't have a concrete idea of how to do this ATM. |
@jreback Here we go. Is that ok? Hum, I should add a note in a what's new with the API change |
@@ -193,6 +193,10 @@ def __init__(self, left, right, how='inner', on=None, | |||
'can not merge DataFrame with instance of ' | |||
'type {0}'.format(type(right))) | |||
|
|||
# prevent merging between different levels | |||
if left.columns.nlevels != right.columns.nlevels: | |||
raise ValueError('can not merge between different levels') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this even more expressive, e.g. indicate which/how many on left & right.
comments on this |
I find this a difficult one (in the sense of "there is not clear 'best' solution" IMO). |
I would be ok with a warning This is really a bit unexpected result I think in almost all cases and not an intended result. (and if it is, then the user could explicity construct an |
@jreback like that? |
I like that. |
I'm still trying to wrap my head around the issue, and what the ideal outcome is, but yeah I think that looks right. |
I've just detected an inconsistency with
This is just wrong. But I guess it can be treated as a separate issue. |
@nbonnotte that's the same exact issue. Note that I doubt this is tested very well. So might be good to add all of these examples as tests. |
Ah, but I was wrong, the result is totally correct! My bad, I'm adding some tests |
@jreback Tests added, all green. |
df = DataFrame([(1, 2, 3), (4, 5, 6)], columns=['a', 'b', 'c']) | ||
new_df = df.groupby(['a']).agg({'b': [np.mean, np.sum]}) | ||
other_df = DataFrame( | ||
[(1, 2, 3), (7, 10, 6)], columns=['a', 'b', 'd']) | ||
other_df.set_index('a', inplace=True) | ||
|
||
result = merge(new_df, other_df, left_index=True, right_index=True) | ||
# GH 9455, 12219 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the only time this warning is shown in the entire codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
The message is shown when pandas.tools.merge:merge
is used with different levels, that is in DataFrame.merge
and DataFrame.join
, and I think that's pretty much it.
ok, let's add a whats note for this API changes. |
add #12219 (comment) as tests as well. |
I don't understand what the tests just failed with Python 3.4:
|
@jreback All green |
can you rebase/update |
thanks! |
@@ -193,6 +195,13 @@ def __init__(self, left, right, how='inner', on=None, | |||
'can not merge DataFrame with instance of ' | |||
'type {0}'.format(type(right))) | |||
|
|||
# warn user when merging between different levels | |||
if left.columns.nlevels != right.columns.nlevels: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh didn't even notice, this is ONLY checking the levels
, it should be doing something like:
if left._get_axis(axis).nlevels != right._get_axis(axis).nlevels:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I've been away for some time, but I just went back. I can do a PR for that.
I'm adding tests to close #9455 after pull request #12158 solved the issue
But I confess I'm a bit surprised by the result:
Is it all right ?