-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
What to do with future Ambiguity Error about overlapping names when sorting? #21080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the original idea was that this would indeed not be allowed any more, so that you cannot sort using a name that both index level name as column name. |
This is also consistent with the idea of seeing an index as a "special column", and thus you have two "columns" with the same name, which currently also raises an error if you try to sort on that. |
My recollection aligns with @jorisvandenbossche's, that we were working towards treating index levels as 'special columns'. Perhaps we should start collecting some common use cases where the ambiguity arises. @TomAugspurger could you post an example of the |
In [5]: def func(df):
...: return df.assign(b=df.b - df.b.mean())
...:
...:
In [6]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
In [7]: df.groupby("a").apply(func)
Out[7]:
a b
a
1 0 1 0.0
2 1 2 0.0 |
To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them? e.g. the last one would be something like "use |
Closing this, and we can improve docs as we discover problems. |
It is not ambiguous when I specify df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))
df.sort_values(['a', 'b'], axis=0) But the warning still. |
@xieyuheng I agree with you should be not a warning if you specify axis=0, but pandas owners often have a special way to treat bugs, warnings |
@VelizarVESSELINOV what do you mean? @xieyuheng I think that is ambiguous. It's not clear whether the |
Another example of the error: pd.DataFrame(
[1,2,3,3,3,4,5,5,6,7,7,7,8,9], columns = ['a']
).groupby(
'a',
# as_index=False
).agg({
'a':'count'
}).sort_values(
'a'
) Traceback:
|
This negates the usefulness of the entire groupby. I would prefer if this resulted in the same groupby values being shown without the ambiguous name (maybe with the word 'values' for the index column? |
Renaming the index to a different name and using the new name as index resolves a similar issue that arrises with the use of pivot_table. For example, for a df DataFrame that contains 'A', 'age', and 'gender', if you first groupby 'A' and want to create pivot_table using 'A' as index, python throws up TypeError: 'A' used as index and column. The following worked for me. Create a new variable and use it as index.df["a"] = df.A |
xref #17361
What should the user do in this situation? Should we provide a keyword to disambiguate? A literal like
pd.ColumnLabel('a')
orpd.IndexName('a')
? Or do we require that they rename an index or column?Right now, they're essentially stuck with the last one. If we want to discourage that, then I suppose that's OK. But it's somewhat common to end up with overlapping names, from e.g. a groupby.
cc @jmmease
The text was updated successfully, but these errors were encountered: