Skip to content

What to do with future Ambiguity Error about overlapping names when sorting? #21080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue May 16, 2018 · 12 comments
Closed
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 16, 2018

xref #17361

In [24]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))

In [25]: df.sort_values(['a', 'b'])
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: 'a' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[25]:
   a  b
a
1  1  3
2  2  4

What should the user do in this situation? Should we provide a keyword to disambiguate? A literal like pd.ColumnLabel('a') or pd.IndexName('a')? Or do we require that they rename an index or column?
Right now, they're essentially stuck with the last one. If we want to discourage that, then I suppose that's OK. But it's somewhat common to end up with overlapping names, from e.g. a groupby.

cc @jmmease

@jorisvandenbossche
Copy link
Member

I think the original idea was that this would indeed not be allowed any more, so that you cannot sort using a name that both index level name as column name.

@jorisvandenbossche
Copy link
Member

This is also consistent with the idea of seeing an index as a "special column", and thus you have two "columns" with the same name, which currently also raises an error if you try to sort on that.
(But of course, that "idea" is currently not fully in practice, so it is not necessarily a good argument)

@jorisvandenbossche jorisvandenbossche changed the title What to do with Ambiguity Error when sorting What to do with future Ambiguity Error about overlapping names when sorting? May 16, 2018
@jonmmease
Copy link
Contributor

My recollection aligns with @jorisvandenbossche's, that we were working towards treating index levels as 'special columns'. Perhaps we should start collecting some common use cases where the ambiguity arises. @TomAugspurger could you post an example of the groupby situation you referred to?

@TomAugspurger
Copy link
Contributor Author

In [5]: def func(df):
   ...:     return df.assign(b=df.b - df.b.mean())
   ...:
   ...:

In [6]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

In [7]: df.groupby("a").apply(func)
Out[7]:
     a    b
a
1 0  1  0.0
2 1  2  0.0

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented May 16, 2018

To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?

e.g. the last one would be something like "use groupby('a', as_index=False)".

@TomAugspurger
Copy link
Contributor Author

Closing this, and we can improve docs as we discover problems.

@xieyuheng
Copy link

It is not ambiguous when I specify axis=0.

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))

df.sort_values(['a', 'b'], axis=0)

But the warning still.

@VelizarVESSELINOV
Copy link

@xieyuheng I agree with you should be not a warning if you specify axis=0, but pandas owners often have a special way to treat bugs, warnings

@TomAugspurger
Copy link
Contributor Author

@VelizarVESSELINOV what do you mean?

@xieyuheng I think that is ambiguous. It's not clear whether the a is referring to the index or the column.

@Vetrintsev
Copy link

Another example of the error:

pd.DataFrame(
    [1,2,3,3,3,4,5,5,6,7,7,7,8,9], columns = ['a']
).groupby(
    'a',
#   as_index=False
).agg({
    'a':'count'
}).sort_values(
    'a'
)

Traceback:

ValueError: 'a' is both an index level and a column label, which is ambiguous.

@ivoska
Copy link

ivoska commented Feb 5, 2020

To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?

e.g. the last one would be something like "use groupby('a', as_index=False)".

This negates the usefulness of the entire groupby. I would prefer if this resulted in the same groupby values being shown without the ambiguous name (maybe with the word 'values' for the index column?

@dtdibaba
Copy link

Renaming the index to a different name and using the new name as index resolves a similar issue that arrises with the use of pivot_table.

For example, for a df DataFrame that contains 'A', 'age', and 'gender', if you first groupby 'A' and want to create pivot_table using 'A' as index, python throws up TypeError: 'A' used as index and column.

The following worked for me.

Create a new variable and use it as index.

df["a"] = df.A
df.index.name = "a"
my_sum = df.pivot_table('age', index="a", columns='gender',
aggfunc="sum")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants