Skip to content

BUG: concat of MultiIndex with names passed #15787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
seth-a opened this issue Mar 23, 2017 · 18 comments
Closed

BUG: concat of MultiIndex with names passed #15787

seth-a opened this issue Mar 23, 2017 · 18 comments
Labels
Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@seth-a
Copy link

seth-a commented Mar 23, 2017

Code Sample, a copy-pastable example if possible

    import numpy as np
    import pandas as pd

    res = []
    for _ in range(2):
        res1 = []
        # Only occurs when dataframe is used with measure
        data = np.zeros((30, 21))
        idx = np.random.randint(0, 5, 30)
        df = pd.DataFrame(data, index=idx).loc[3]
        #df = pd.DataFrame(data[::5, :])  # Uncomment for example of correct behavior

        res1.append(pd.DataFrame(sum(data.dot(df.T))))
        tmp = pd.concat(res1, keys=[1], names=['level1'])

        res.append(tmp)
    final = pd.concat(res, keys=[i for i in range(2)], names=['level2'])
    print(final)

Problem description

In python, datatypes generally don't matter. A dataframe is a dataframe, but as shown in the example code concat'ing dataframes with an index does not have the same behavior as dataframes without an index. The label for a level of the index is dropped. This is a small bug. Run it several times (10-12 seems to do it) and you will see a much more worrisome issue: on occasion, the label is not dropped. Yes, the output of concat is random.

Expected Output

level2 level1       
0      1      0  0.0
              1  0.0
              2  0.0
              3  0.0
              4  0.0
              5  0.0
1      1      0  0.0
              1  0.0
              2  0.0
              3  0.0
              4  0.0
              5  0.0

Output of pd.show_versions()

level2         
0      1 0  0.0
         1  0.0
         2  0.0
         3  0.0
         4  0.0
         5  0.0
1      1 0  0.0
         1  0.0
         2  0.0
         3  0.0
@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

In python, datatypes generally don't matter. A dataframe is a dataframe, but as shown in the example

not sure what you mean

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

can you update your example to show exactly what is the problem. you seems to be doing lots of stuff, but its not clear what the issue is (if any).

and show a constructed frame that matches your expectations. IOW if this is a bug, then ultimately we need an example that we can do something like

from pandas.util import testing as tm
result = .....
expected = DataFrame(....)
tm.assert_frame_equal(result, expected)

@seth-a
Copy link
Author

seth-a commented Mar 23, 2017

The label of the second column of the index is being changed to None. Run the example, uncomment the line that says "Uncomment for example of correct behavior" to see what is expected. Yes, the example is doing lots of "stuff", but that's what it takes to reproduce the bug.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@seth-a can you post an expected result and then run your example and post. the more information you show and the more obvious the quicker this will get looked at.

@seth-a
Copy link
Author

seth-a commented Mar 23, 2017

I've updated the formatting in the expected results section, are you looking for something beyond what is there?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

slightly modified.

In [1]: import numpy as np
   ...: import pandas as pd
   ...: 
   ...: res = []
   ...: for _ in range(2):
   ...:     res1 = []
   ...:     # Only occurs when dataframe is used with measure
   ...:     data = np.zeros((30, 21))
   ...:     idx = np.random.randint(0, 5, 30)
   ...:     df = pd.DataFrame(data, index=idx).loc[3]
   ...:     #df = pd.DataFrame(data[::5, :])  # Uncomment for example of correct behavior
   ...: 
   ...:     df2 = pd.DataFrame(sum(data.dot(df.T)))
   ...:     df2.index.name = 'level2'
   ...:     res1.append(df2)
   ...:     tmp = pd.concat(res1, keys=[1], names=['level1'])
   ...: 
   ...:     res.append(tmp)
   ...: final = pd.concat(res, keys=[i for i in range(2)])
   ...: 

In [2]: final
Out[2]: 
                   0
  level1 level2     
0 1      0       0.0
         1       0.0
         2       0.0
         3       0.0
         4       0.0
1 1      0       0.0
         1       0.0
         2       0.0

The the issue is that you are concatting 2 MultiIndexes with names of ['level1', None]
your original res is this:

In [2]: res
Out[2]: 
[            0
 level1       
 1      0  0.0
        1  0.0
        2  0.0
        3  0.0,             0
 level1       
 1      0  0.0
        1  0.0
        2  0.0
        3  0.0
        4  0.0
        5  0.0
        6  0.0
        7  0.0]

so you are expecting the passed in named to 'automatically' figure out that the new names are the existing first level name plus the passed in names?

how does that make sense?

I suppose that its actually an error to pass in any names if you have MultiIndex that you are concatting in the first place. So would take a PR for that.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Mar 23, 2017
@jreback jreback added this to the Next Major Release milestone Mar 23, 2017
@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

FYI here is the code for creating the concatted index. It is slightly non-trivial!

@jreback jreback changed the title Index names from concat have non-deterministic behavior ERR: concat of MultiIndex with names passed Mar 23, 2017
@seth-a
Copy link
Author

seth-a commented Mar 23, 2017

Hmm, clearly you've missed the point. In fact pandas does automatically figure out that the new names should be the existing index plus the passed in names, as can be shown by un-commenting the line in my example saying "Uncomment for example of correct behavior". Here's an example that may make it a little easier to see,

    def concat_multiple(bug=False):
        """
        Make a mess with DataFrames, if bug is True change behavior so that bug shows
        """
        res = []
        for _ in range(2):
            res1 = []
            # Only occurs when dataframe is used with measure
            data = np.zeros((30, 21))
            idx = np.random.randint(0, 5, 30)

            df = pd.DataFrame(data, index=idx)
            # Bug is right here.  Same dataframe, just different method of indexing
            # This is the only thing that changes
            if bug:
                df = df.loc[3]

            res1.append(pd.DataFrame(sum(data.dot(df.T))))
            tmp = pd.concat(res1, keys=[1], names=['level1'])

            res.append(tmp)
        return pd.concat(res, keys=[i for i in range(2)], names=['level2'])

    res_bad = concat_multiple(bug=True)   # Results when bug is present
    res_good = concat_multiple(bug=False) # Results when bug is not present

    print("These indexes should be the same")
    print (res_bad.index.names)
    print (res_good.index.names)
    if (res_bad.index.names != res_good.index.names):
        print("But they aren't")

    # It gets worse, let's run it several times and see how often the bug is present
    good = [concat_multiple(bug=True).index.names ==res_good.index.names for _ in range(20)]
    print("pandas is broken  {}% of the time".format((1-sum(good)/len(good))*100 ))

And here's the output:

These indexes should be the same
['level2', None, None]
['level2', 'level1', None]
But they aren't
pandas is broken  95.0% of the time

And when a bug like this exists you don't need to link to the source code to convince me it's a mess.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@seth-a

this is really really odd thing to do: of course this doesn't work
idx = np.random.randint(0, 5, 30)

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

you are creating a non-unique index. sure it might be a bug, but you are by-definition making this non-deterministic. df.loc[3] depends on the uniqueness of the result

In [18]: df = DataFrame({'A': [1,2,3]},index=[1,2,3])

In [19]: df.loc[3]
Out[19]: 
A    3
Name: 3, dtype: int64

In [20]: df = DataFrame({'A': [1,2,3]},index=[1,3,3])

In [21]: df.loc[3]
Out[21]: 
   A
3  2
3  3

these have different shapes so different things happen.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

It may be a bug, likely around concatting non-unique MI's. But please produce an example that is deterministic within these parameters. (or at the very least use a random seed).

@seth-a
Copy link
Author

seth-a commented Mar 23, 2017

What do you mean it doesn't work? Is it throwing an error?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

please read my responses.

@seth-a
Copy link
Author

seth-a commented Mar 23, 2017

Ah, so you mean the code works fine, it just isn't a typical use case. Ok, here's the code using a seed so the behavior is 100% deterministic, aside from whatever pandas is doing:

    def concat_multiple(bug=False):
        """
        Make a mess with DataFrames, if bug is True change behavior so that bug shows
        """
        np.random.seed(5)
        res = []
        for _ in range(2):
            res1 = []
            # Only occurs when dataframe is used with measure
            data = np.zeros((30, 21))
            idx = np.random.randint(0, 5, 30)

            df = pd.DataFrame(data, index=idx)
            # Bug is right here.  Same dataframe, just different method of indexing
            if bug:
                df = df.loc[3]

            res1.append(pd.DataFrame(sum(data.dot(df.T))))
            tmp = pd.concat(res1, keys=[1], names=['level1'])

            res.append(tmp)
        return pd.concat(res, keys=[i for i in range(2)], names=['level2'])

    res_bad = concat_multiple(bug=True)   # Results when bug is present
    res_good = concat_multiple(bug=False) # Results when bug is not present

    print("These indexes should be the same")
    print (res_bad.index.names)
    print (res_good.index.names)
    if (res_bad.index.names != res_good.index.names):
        print("But they aren't")

    # It gets worse, let's run it several times and see how often the bug is present
    good = [concat_multiple(bug=True).index.names ==res_good.index.names for _ in range(20)]
    print("pandas is broken  {}% of the time".format((1-sum(good)/len(good))*100 ))

Output:

These indexes should be the same
['level2', None, None]
['level2', 'level1', None]
But they aren't
pandas is broken  100.0% of the time

@jorisvandenbossche
Copy link
Member

Much easier reproducible example:

In [32]: df = pd.DataFrame({'col': range(5)}, index=pd.MultiIndex.from_product([[1], range(5)], names=['level1', None]
    ...: ))

In [33]: df
Out[33]:
          col
level1
1      0    0
       1    1
       2    2
       3    3
       4    4

In [34]: pd.concat([df, df], keys=[1, 2], names=['level2'])
Out[34]:
                 col
level2 level1
1      1      0    0
              1    1
              2    2
              3    3
              4    4
2      1      0    0
              1    1
              2    2
              3    3
              4    4

In [35]: pd.concat([df, df[:3]], keys=[1, 2], names=['level2'])
Out[35]:
            col
level2
1      1 0    0
         1    1
         2    2
         3    3
         4    4
2      1 0    0
         1    1
         2    2      

@seth-a your example is really complex, which makes it much harder to see what is exactly going on, or what you think is wrong.
But the above in any case clearly shows a bug.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

thanks @jorisvandenbossche

yeah the name handling is prob not very robust.

@jreback jreback added Bug MultiIndex and removed Error Reporting Incorrect or improved errors from pandas labels Mar 23, 2017
@jreback jreback changed the title ERR: concat of MultiIndex with names passed BUG: concat of MultiIndex with names passed Mar 23, 2017
@seth-a
Copy link
Author

seth-a commented Mar 23, 2017

@jorisvandenbossche, It's a much-simplified version of a script I was working on. I'm glad you found an easier case where it appears.

@jorisvandenbossche
Copy link
Member

Additional observation: it only occurs when the name of the other level is None, otherwise it works fine:

In [40]: df = pd.DataFrame({'col': range(5)}, index=pd.MultiIndex.from_product([[1], range(5)], names=['level1', 'leve
    ...: l3']))

In [41]: pd.concat([df, df[:3]], keys=[1, 2], names=['level2'])
Out[41]:
                      col
level2 level1 level3
1      1      0         0
              1         1
              2         2
              3         3
              4         4
2      1      0         0
              1         1
              2         2

and (as already see in the example above) only when the index values are not exactly the same (different length in the example above, or different values of same length also triggers it)

funnycrab added a commit to funnycrab/pandas that referenced this issue Apr 9, 2017
This is a fix attempt for issue pandas-dev#15787.

The discrepancy between definition and corresponding implementation of so-called non-none names in function _get_consensus_names leads to this bug.
@jreback jreback modified the milestones: 0.20.0, Next Major Release Apr 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants