QST: Any support for df["x"] = y where y.columns are MultiIndexed? #35727

henrifroese · 2020-08-14T16:47:21Z

We're working with DataFrames where the columns are MultiIndexed, so e.g. ones that look like this:

        pca          
       pca1      pca2
0  0.754675  1.868685
1 -1.861651 -0.048236
2 -0.797750  0.388400

which one can get through pd.DataFrame(np.random.normal(size=(6,)).reshape((3,2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]])).

We now want to combine several of those to e.g. get this:

        pca                 nmf          
       pca1      pca2      nmf1      nmf2
0  1.671707  0.452155  1.671707  0.452155
1  0.861315 -0.100849  0.861315 -0.100849
2  1.056616 -0.852532  1.056616 -0.852532

We know that we can do this through e.g. pd.concat([df_pca, df_nmf], axis=1). Is there any support for doing the same like this: df["pca"] = df_pca for some df? We get ValueError: Wrong number of items passed 4, placement implies 1.

It's really important for us to allow usage like this: df["pca"] = df_pca and not just through concat.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2020-08-14T18:35:59Z

Thanks for asking this. Can you explain why concat is not sufficient here?

henrifroese · 2020-08-14T18:43:42Z

Sure, see here (probably should have linked this in the first place ). You can already see our plans to change some pandas code, we were hoping for a nicer solution. From a user perspective working with these "Subcolumns" (i.e. MultiIndex in the columns), it feels like df["pca"] = df_pca should work.

EDIT: We just found a better solution; it still requires changing pandas code but it's much cleaner:

import pandas as pd
import numpy as np

_pd_original__setitem__ = pd.DataFrame.__setitem__
pd.DataFrame._original__setitem__ = _pd_original__setitem__

def _hero__setitem__(self, key, value):

    # called when doing self["key"] = value
    # e.g. df["pca"] = hero.pca(df["texts"]) is
    # doing pd.DataFrame.__setitem__(self=df, key="pca", value=hero.pca(df["texts"])
    #
    # what we want to do:
    # 1. user calls df["pca"] = hero.pca(df["texts"]), so __setitem__(self=df, key="pca", value=hero.pca(df["texts"])
    # 2. we make self multiindexed if it isn't already
    # 3. we change value columns so the first level is named `key`
    # 4. we do self[value.columns] = value

    # 1.
    if isinstance(value, pd.DataFrame) and isinstance(value.columns, pd.MultiIndex) and isinstance(key, str):

        # 2.
        if not isinstance(self.columns, pd.MultiIndex):
            self.columns = pd.MultiIndex.from_tuples([(col_name, "") for col_name in self.columns.values])

        # 3.
        value.columns = pd.MultiIndex.from_tuples([(key, subcol_name) for _, subcol_name in value.columns.values])

        # 4.
        self[value.columns] = value

    else:

        self._original__setitem__(key, value)


pd.DataFrame.__setitem__ = _hero__setitem__

Example:

df = pd.DataFrame(np.random.normal(size=(6,)).reshape(
    (3, 2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]]))
df2 = pd.DataFrame(np.random.normal(size=(6,)).reshape(
    (3, 2)), columns=pd.MultiIndex.from_product([['pipapo'], ["nmf1", "nmf2"]]))


df["nmf"] = df2
print(df)

rhshadrach · 2020-08-14T19:03:47Z

Is something of the form:

df[df_pca.columns] = df_pca

acceptable? Here is a small example:

df = pd.DataFrame(
    {
        ('a', '1'): [1, 2, 3],
    }        
)

new_vals = pd.DataFrame(
    {
        ('b', '1'): [4, 5, 6],
        ('b', '2'): [7, 8, 9],
    }
)
df[new_val.columns] = new_vals
print(df)

Output:

Note: in the link, you mention combining a DataFrame with an Index of columns with one that has a MultiIndex. Even with pd.concat, you will not get a MultiIndex:

df = pd.DataFrame(
    {
        'a': [1, 2, 3],
    }        
)

new_val = pd.DataFrame(
    {
        ('b', '1'): [4, 5, 6],
        ('b', '2'): [7, 8, 9],
    }
)
df[new_val.columns] = new_val
print(df)

Output:

   a  (b, 1)  (b, 2)
0  1       4       7
1  2       5       8
2  3       6       9

henrifroese · 2020-08-14T19:08:12Z

Thanks, see the edit above 🤖 😬 . It makes our implementation way easier so that's great, still we have to change some Pandas just for our library so that's suboptimal.

rhshadrach · 2020-08-16T04:30:59Z

Glad you found a workable solution. As for implementing something like this in pandas itself, I think the added complexity may not be worth it. Would like to hear others thoughts on this though.

henrifroese · 2020-08-16T10:39:14Z

Glad you found a workable solution

Right, we have sadly now noticed this (so it isn't a viable solution after all 😕):

So our main issue is that we want to

store a matrix in a DataFrame that looks nice, so not just one row per cell but rather one entry per cell (which we can achieve through the approach above with "Subcolumns")
and allow users to place this in their DataFrame with df["pca"] = ....

The problem we're now facing with our implementation:

Internally, pandas will at some point loop over all "subcolumns" in pd.DataFrame._setitem_array and do

for k1, k2 in zip(key, value.columns):
    self[k1] = value[k2]

which is of course extremely slow when working with a few hundred dimensions / subcolumns.

So seems like we're actually back at square one to find a performant and good-looking implementation of matrices in DataFrames.

iamlemec · 2020-08-22T05:01:16Z

Speaking mostly as a pretty heavy pandas user here, this has always seemed like the biggest piece of missing functionality to me. I feel like a lot of what I do (as an economist and otherwise) is looking at various data series over a panel (say US counties over time). If you have GDP and population, it would be amazing to just be able to do the intuitive thing to calculate and assign GDP per capita. I know how to use concat or stack/unstack, but I've seen less experienced users get tripped up by this.

I'm not super well-versed on MultiIndex internals, but I am curious, what are the major hurdles preventing this? Is it issues with determining whether the self and other indices are compatible or more a matter of possible unintended consequences on the user side?

jreback · 2020-08-23T00:28:30Z

this is not that hard though there might be some edge cases; i am pretty sure this has come up before if u would search for similar issues

it would take a community pull request to implement

rhshadrach · 2020-08-23T13:57:47Z

Closing as a duplicate of #7475.

henrifroese added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Aug 14, 2020

henrifroese mentioned this issue Aug 14, 2020

Can we avoid having a cell with a list? jbesomi/texthero#151

Closed

jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Aug 23, 2020

jreback added this to the Contributions Welcome milestone Aug 23, 2020

rhshadrach closed this as completed Aug 23, 2020

mk2510 mentioned this issue Sep 9, 2020

Add pandas __setitem__ support for DocumentTermDF jbesomi/texthero#158

Closed

iamlemec mentioned this issue Sep 30, 2020

ENH: allow for top and mid-level assignment to DataFrames with MultIndex columns #7475 #36755

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QST: Any support for df["x"] = y where y.columns are MultiIndexed? #35727

QST: Any support for df["x"] = y where y.columns are MultiIndexed? #35727

henrifroese commented Aug 14, 2020

rhshadrach commented Aug 14, 2020

henrifroese commented Aug 14, 2020 •

edited

Loading

rhshadrach commented Aug 14, 2020

henrifroese commented Aug 14, 2020

rhshadrach commented Aug 16, 2020

henrifroese commented Aug 16, 2020 •

edited

Loading

iamlemec commented Aug 22, 2020

jreback commented Aug 23, 2020

rhshadrach commented Aug 23, 2020

QST: Any support for df["x"] = y where y.columns are MultiIndexed? #35727

QST: Any support for df["x"] = y where y.columns are MultiIndexed? #35727

Comments

henrifroese commented Aug 14, 2020

rhshadrach commented Aug 14, 2020

henrifroese commented Aug 14, 2020 • edited Loading

rhshadrach commented Aug 14, 2020

henrifroese commented Aug 14, 2020

rhshadrach commented Aug 16, 2020

henrifroese commented Aug 16, 2020 • edited Loading

iamlemec commented Aug 22, 2020

jreback commented Aug 23, 2020

rhshadrach commented Aug 23, 2020

henrifroese commented Aug 14, 2020 •

edited

Loading

henrifroese commented Aug 16, 2020 •

edited

Loading