-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: concat of MultiIndex with names passed #15787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
not sure what you mean |
can you update your example to show exactly what is the problem. you seems to be doing lots of stuff, but its not clear what the issue is (if any). and show a constructed frame that matches your expectations. IOW if this is a bug, then ultimately we need an example that we can do something like
|
The label of the second column of the index is being changed to None. Run the example, uncomment the line that says "Uncomment for example of correct behavior" to see what is expected. Yes, the example is doing lots of "stuff", but that's what it takes to reproduce the bug. |
@seth-a can you post an expected result and then run your example and post. the more information you show and the more obvious the quicker this will get looked at. |
I've updated the formatting in the expected results section, are you looking for something beyond what is there? |
slightly modified.
The the issue is that you are concatting 2 MultiIndexes with names of
so you are expecting the passed in named to 'automatically' figure out that the new names are the existing first level name plus the passed in names? how does that make sense? I suppose that its actually an error to pass in any names if you have |
FYI here is the code for creating the concatted index. It is slightly non-trivial! |
Hmm, clearly you've missed the point. In fact pandas does automatically figure out that the new names should be the existing index plus the passed in names, as can be shown by un-commenting the line in my example saying "Uncomment for example of correct behavior". Here's an example that may make it a little easier to see, def concat_multiple(bug=False):
"""
Make a mess with DataFrames, if bug is True change behavior so that bug shows
"""
res = []
for _ in range(2):
res1 = []
# Only occurs when dataframe is used with measure
data = np.zeros((30, 21))
idx = np.random.randint(0, 5, 30)
df = pd.DataFrame(data, index=idx)
# Bug is right here. Same dataframe, just different method of indexing
# This is the only thing that changes
if bug:
df = df.loc[3]
res1.append(pd.DataFrame(sum(data.dot(df.T))))
tmp = pd.concat(res1, keys=[1], names=['level1'])
res.append(tmp)
return pd.concat(res, keys=[i for i in range(2)], names=['level2'])
res_bad = concat_multiple(bug=True) # Results when bug is present
res_good = concat_multiple(bug=False) # Results when bug is not present
print("These indexes should be the same")
print (res_bad.index.names)
print (res_good.index.names)
if (res_bad.index.names != res_good.index.names):
print("But they aren't")
# It gets worse, let's run it several times and see how often the bug is present
good = [concat_multiple(bug=True).index.names ==res_good.index.names for _ in range(20)]
print("pandas is broken {}% of the time".format((1-sum(good)/len(good))*100 )) And here's the output:
And when a bug like this exists you don't need to link to the source code to convince me it's a mess. |
this is really really odd thing to do: of course this doesn't work |
you are creating a non-unique index. sure it might be a bug, but you are by-definition making this non-deterministic.
these have different shapes so different things happen. |
It may be a bug, likely around concatting non-unique MI's. But please produce an example that is deterministic within these parameters. (or at the very least use a random seed). |
What do you mean it doesn't work? Is it throwing an error? |
please read my responses. |
Ah, so you mean the code works fine, it just isn't a typical use case. Ok, here's the code using a seed so the behavior is 100% deterministic, aside from whatever pandas is doing:
Output:
|
Much easier reproducible example:
@seth-a your example is really complex, which makes it much harder to see what is exactly going on, or what you think is wrong. |
thanks @jorisvandenbossche yeah the name handling is prob not very robust. |
@jorisvandenbossche, It's a much-simplified version of a script I was working on. I'm glad you found an easier case where it appears. |
Additional observation: it only occurs when the name of the other level is None, otherwise it works fine:
and (as already see in the example above) only when the index values are not exactly the same (different length in the example above, or different values of same length also triggers it) |
This is a fix attempt for issue pandas-dev#15787. The discrepancy between definition and corresponding implementation of so-called non-none names in function _get_consensus_names leads to this bug.
Code Sample, a copy-pastable example if possible
Problem description
In python, datatypes generally don't matter. A dataframe is a dataframe, but as shown in the example code concat'ing dataframes with an index does not have the same behavior as dataframes without an index. The label for a level of the index is dropped. This is a small bug. Run it several times (10-12 seems to do it) and you will see a much more worrisome issue: on occasion, the label is not dropped. Yes, the output of concat is random.
Expected Output
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: