-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
QST: Any support for df["x"] = y where y.columns are MultiIndexed? #35727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for asking this. Can you explain why concat is not sufficient here? |
Sure, see here (probably should have linked this in the first place EDIT: We just found a better solution; it still requires changing pandas code but it's much cleaner:
Example:
|
Is something of the form:
acceptable? Here is a small example:
Output:
Note: in the link, you mention combining a DataFrame with an Index of columns with one that has a MultiIndex. Even with pd.concat, you will not get a MultiIndex:
Output:
|
Thanks, see the edit above 🤖 😬 . It makes our implementation way easier so that's great, still we have to change some Pandas just for our library so that's suboptimal. |
Glad you found a workable solution. As for implementing something like this in pandas itself, I think the added complexity may not be worth it. Would like to hear others thoughts on this though. |
Right, we have sadly now noticed this (so it isn't a viable solution after all 😕): So our main issue is that we want to
The problem we're now facing with our implementation: Internally, pandas will at some point loop over all "subcolumns" in
which is of course extremely slow when working with a few hundred dimensions / subcolumns. So seems like we're actually back at square one to find a performant and good-looking implementation of matrices in DataFrames. |
Speaking mostly as a pretty heavy pandas user here, this has always seemed like the biggest piece of missing functionality to me. I feel like a lot of what I do (as an economist and otherwise) is looking at various data series over a panel (say US counties over time). If you have GDP and population, it would be amazing to just be able to do the intuitive thing to calculate and assign GDP per capita. I know how to use concat or stack/unstack, but I've seen less experienced users get tripped up by this. I'm not super well-versed on MultiIndex internals, but I am curious, what are the major hurdles preventing this? Is it issues with determining whether the self and other indices are compatible or more a matter of possible unintended consequences on the user side? |
this is not that hard though there might be some edge cases; i am pretty sure this has come up before if u would search for similar issues it would take a community pull request to implement |
Closing as a duplicate of #7475. |
We're working with DataFrames where the columns are MultiIndexed, so e.g. ones that look like this:
which one can get through
pd.DataFrame(np.random.normal(size=(6,)).reshape((3,2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]]))
.We now want to combine several of those to e.g. get this:
We know that we can do this through e.g.
pd.concat([df_pca, df_nmf], axis=1)
. Is there any support for doing the same like this:df["pca"] = df_pca
for some df? We getValueError: Wrong number of items passed 4, placement implies 1
.It's really important for us to allow usage like this:
df["pca"] = df_pca
and not just through concat.The text was updated successfully, but these errors were encountered: