Skip to content

QST: Any support for df["x"] = y where y.columns are MultiIndexed? #35727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
henrifroese opened this issue Aug 14, 2020 · 9 comments
Closed

QST: Any support for df["x"] = y where y.columns are MultiIndexed? #35727

henrifroese opened this issue Aug 14, 2020 · 9 comments
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@henrifroese
Copy link

We're working with DataFrames where the columns are MultiIndexed, so e.g. ones that look like this:

        pca          
       pca1      pca2
0  0.754675  1.868685
1 -1.861651 -0.048236
2 -0.797750  0.388400

which one can get through pd.DataFrame(np.random.normal(size=(6,)).reshape((3,2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]])).

We now want to combine several of those to e.g. get this:

        pca                 nmf          
       pca1      pca2      nmf1      nmf2
0  1.671707  0.452155  1.671707  0.452155
1  0.861315 -0.100849  0.861315 -0.100849
2  1.056616 -0.852532  1.056616 -0.852532

We know that we can do this through e.g. pd.concat([df_pca, df_nmf], axis=1). Is there any support for doing the same like this: df["pca"] = df_pca for some df? We get ValueError: Wrong number of items passed 4, placement implies 1.

It's really important for us to allow usage like this: df["pca"] = df_pca and not just through concat.

@henrifroese henrifroese added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Aug 14, 2020
@rhshadrach
Copy link
Member

Thanks for asking this. Can you explain why concat is not sufficient here?

@henrifroese
Copy link
Author

henrifroese commented Aug 14, 2020

Sure, see here (probably should have linked this in the first place :octocat: ). You can already see our plans to change some pandas code, we were hoping for a nicer solution. From a user perspective working with these "Subcolumns" (i.e. MultiIndex in the columns), it feels like df["pca"] = df_pca should work.

EDIT: We just found a better solution; it still requires changing pandas code but it's much cleaner:

import pandas as pd
import numpy as np

_pd_original__setitem__ = pd.DataFrame.__setitem__
pd.DataFrame._original__setitem__ = _pd_original__setitem__

def _hero__setitem__(self, key, value):

    # called when doing self["key"] = value
    # e.g. df["pca"] = hero.pca(df["texts"]) is
    # doing pd.DataFrame.__setitem__(self=df, key="pca", value=hero.pca(df["texts"])
    #
    # what we want to do:
    # 1. user calls df["pca"] = hero.pca(df["texts"]), so __setitem__(self=df, key="pca", value=hero.pca(df["texts"])
    # 2. we make self multiindexed if it isn't already
    # 3. we change value columns so the first level is named `key`
    # 4. we do self[value.columns] = value

    # 1.
    if isinstance(value, pd.DataFrame) and isinstance(value.columns, pd.MultiIndex) and isinstance(key, str):

        # 2.
        if not isinstance(self.columns, pd.MultiIndex):
            self.columns = pd.MultiIndex.from_tuples([(col_name, "") for col_name in self.columns.values])

        # 3.
        value.columns = pd.MultiIndex.from_tuples([(key, subcol_name) for _, subcol_name in value.columns.values])

        # 4.
        self[value.columns] = value

    else:

        self._original__setitem__(key, value)


pd.DataFrame.__setitem__ = _hero__setitem__

Example:

df = pd.DataFrame(np.random.normal(size=(6,)).reshape(
    (3, 2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]]))
df2 = pd.DataFrame(np.random.normal(size=(6,)).reshape(
    (3, 2)), columns=pd.MultiIndex.from_product([['pipapo'], ["nmf1", "nmf2"]]))


df["nmf"] = df2
print(df)

@rhshadrach
Copy link
Member

Is something of the form:

df[df_pca.columns] = df_pca

acceptable? Here is a small example:

df = pd.DataFrame(
    {
        ('a', '1'): [1, 2, 3],
    }        
)

new_vals = pd.DataFrame(
    {
        ('b', '1'): [4, 5, 6],
        ('b', '2'): [7, 8, 9],
    }
)
df[new_val.columns] = new_vals
print(df)

Output:

   a  b   
   1  1  2
0  1  4  7
1  2  5  8
2  3  6  9

Note: in the link, you mention combining a DataFrame with an Index of columns with one that has a MultiIndex. Even with pd.concat, you will not get a MultiIndex:

df = pd.DataFrame(
    {
        'a': [1, 2, 3],
    }        
)

new_val = pd.DataFrame(
    {
        ('b', '1'): [4, 5, 6],
        ('b', '2'): [7, 8, 9],
    }
)
df[new_val.columns] = new_val
print(df)

Output:

   a  (b, 1)  (b, 2)
0  1       4       7
1  2       5       8
2  3       6       9

@henrifroese
Copy link
Author

Thanks, see the edit above 🤖 😬 . It makes our implementation way easier so that's great, still we have to change some Pandas just for our library so that's suboptimal.

@rhshadrach
Copy link
Member

Glad you found a workable solution. As for implementing something like this in pandas itself, I think the added complexity may not be worth it. Would like to hear others thoughts on this though.

@henrifroese
Copy link
Author

henrifroese commented Aug 16, 2020

Glad you found a workable solution

Right, we have sadly now noticed this (so it isn't a viable solution after all 😕):

So our main issue is that we want to

  • store a matrix in a DataFrame that looks nice, so not just one row per cell but rather one entry per cell (which we can achieve through the approach above with "Subcolumns")
  • and allow users to place this in their DataFrame with df["pca"] = ....

The problem we're now facing with our implementation:

Internally, pandas will at some point loop over all "subcolumns" in pd.DataFrame._setitem_array and do

for k1, k2 in zip(key, value.columns):
    self[k1] = value[k2]

which is of course extremely slow when working with a few hundred dimensions / subcolumns.

So seems like we're actually back at square one to find a performant and good-looking implementation of matrices in DataFrames.

@iamlemec
Copy link
Contributor

Speaking mostly as a pretty heavy pandas user here, this has always seemed like the biggest piece of missing functionality to me. I feel like a lot of what I do (as an economist and otherwise) is looking at various data series over a panel (say US counties over time). If you have GDP and population, it would be amazing to just be able to do the intuitive thing to calculate and assign GDP per capita. I know how to use concat or stack/unstack, but I've seen less experienced users get tripped up by this.

I'm not super well-versed on MultiIndex internals, but I am curious, what are the major hurdles preventing this? Is it issues with determining whether the self and other indices are compatible or more a matter of possible unintended consequences on the user side?

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Aug 23, 2020
@jreback jreback added this to the Contributions Welcome milestone Aug 23, 2020
@jreback
Copy link
Contributor

jreback commented Aug 23, 2020

this is not that hard though there might be some edge cases; i am pretty sure this has come up before if u would search for similar issues

it would take a community pull request to implement

@rhshadrach
Copy link
Member

Closing as a duplicate of #7475.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

4 participants