-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Rename index when using DataFrame.reset_index #42346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gcaria thank you for the PR!
Please take a look at the review comments below.
|
||
names = ["first", "second"] | ||
stacked.index.names = names | ||
deleveled = stacked.reset_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that you do not even need deleveled
here. Indeed, you do not use names
kwarg here. As for assertions, then you probably can do assertions against stacked
instead of deleveled
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I was trying to test the new renaming feature, by taking advantage of the standard (before this PR) implementation of reset_index
, where the level names are simply copied over. So stacked
is the starting point from which reset_index
creates two DataFrames, which should have exactly the same added columns, except for the names.
I notice now that setting stacked.index.names
is not relevant, and could/should be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ivanovmg for your feedback!
I've added some comments that I hope help clarifying the intent of these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! A couple of follow-up comments.
93163f0
to
2fca991
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see the problem with that else
statement, you're absolutely right!
By the way, could you point me to where the whatsnew
file is so that I can edit it?
Whatnew file is here: |
dd538e4
to
8b41541
Compare
3897758
to
aa7cb2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jreback for your feedback. I've added a few checks where errors are raised.
For now I've set that names
has to be a tuple or list for a MultiIndex
, but technically any sequence can be accepted. Accepting a dictionary like {'old' : 'new'}
would be useful to rename the levels without worrying about their order.
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
1705967
to
fcc5db2
Compare
Sorry for the long pause, I've finally added the |
de49b85
to
cb5249c
Compare
@gcaria - this looks almost there Please can you get the code-checks passing - see the contributing guide
Looks like these changes are also causing some test failures - see Azure Pipelines.
Can you also merge master and address any remaining comments above Thanks! |
928393d
to
fc1ee7a
Compare
fc1ee7a
to
6f4cd27
Compare
6f4cd27
to
437a9b2
Compare
437a9b2
to
96f52fb
Compare
def get_default_index_names(self, df: DataFrame, names: str = None): | ||
|
||
if names is not None and not isinstance(names, str): | ||
raise ValueError("Names must be a string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or list/tuple of strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah use is_hashable
here to be consistent with other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we likley have tests that hit this..but
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few mypy failures - otherwise looks almost there!
pandas/core/indexes/base.py:1562: error: Variable "pandas.core.indexes.base.Index.str" is not valid as a type [valid-type]
pandas/core/indexes/base.py:1562: note: See https://mypy.readthedocs.io/en/stable/common_issues.html#variables-vs-type-aliases
pandas/core/frame.py:5803: error: Value of type "Union[Hashable, Sequence[Hashable]]" is not indexable [index]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments from my side.
pandas/core/indexes/base.py
Outdated
@@ -1559,6 +1559,18 @@ def _validate_names( | |||
|
|||
return new_names | |||
|
|||
def get_default_index_names(self, df: DataFrame, names: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you type the output?
pandas/core/indexes/multi.py
Outdated
@@ -1391,6 +1391,25 @@ def format( | |||
# -------------------------------------------------------------------- | |||
# Names Methods | |||
|
|||
def get_default_index_names(self, names=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you type?
pandas/core/indexes/multi.py
Outdated
if not names: | ||
names = [ | ||
(n if n is not None else f"level_{i}") for i, n in enumerate(self.names) | ||
] | ||
else: | ||
if len(names) != self.nlevels: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consider returning right away and getting rid of else statement, which will simplify this code:
if not names:
return [...]
if len(names) != self.nlevels:
raise ValueError
return names
def get_default_index_names(self, df: DataFrame, names: str = None): | ||
|
||
if names is not None and not isinstance(names, str): | ||
raise ValueError("Names must be a string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah use is_hashable
here to be consistent with other cases
def get_default_index_names(self, df: DataFrame, names: str = None): | ||
|
||
if names is not None and not isinstance(names, str): | ||
raise ValueError("Names must be a string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we likley have tests that hit this..but
(n if n is not None else f"level_{i}") | ||
for i, n in enumerate(self.index.names) | ||
] | ||
names = self.index.get_default_index_names(names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this must not be hit in tests (as you are incorrectly passing self)
if names is not None and not isinstance(names, str): | ||
raise ValueError("Names must be a string") | ||
|
||
default = "index" if "index" not in df else "level_0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead test if .index._is_multi or not
@@ -1559,6 +1559,21 @@ def _validate_names( | |||
|
|||
return new_names | |||
|
|||
def get_default_index_names( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this must be the same signature as the multi-index one
@@ -1559,6 +1559,21 @@ def _validate_names( | |||
|
|||
return new_names | |||
|
|||
def get_default_index_names( | |||
self, df: DataFrame, names: str = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be passed the dataframe at all
@@ -1391,6 +1391,26 @@ def format( | |||
# -------------------------------------------------------------------- | |||
# Names Methods | |||
|
|||
def get_default_index_names( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a private method
this was looking good, if you'd merge master and address comments |
@gcaria - do you still want to work on this? See comment above from Jeff |
Unfortunately I don't have time to work on this anymore, but all the work is almost done. |
Thanks for the update. If you happen to find time to work on this PR in the future, or if anyone else wants to finish up this feature, we can reopen. |
Probably the hardest part of this PR is choosing a good name for the new argument. As mentioned in #6878 the
name
keyword is already used inSeries.reset_index
, although to rename the data column. It could be argued that usingname
also in this PR would not cause confusion, since it's hard to assume one can rename a column's name inDataFrame.reset_index
(which column would it be?).Anyway, I've opted for
names
. I've also thought ofrename
but that sounds like asking for a boolean.new_name
is an option too, although a bit ugly.