-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add allow_duplicates to MultiIndex.to_frame #45318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/indexes/multi.py
Outdated
@@ -1774,14 +1784,20 @@ def to_frame(self, index: bool = True, name=lib.no_default) -> DataFrame: | |||
else: | |||
idx_names = self.names | |||
|
|||
idx_names = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm why are you repeating L1785?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is doing a transform: filling in None names with the level number.
Whether that is the right thing to do is another issue. I am just preserving the existing behavior,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then this needs another argument similar to how this is done in .reset_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not changing anything here. The old code is https://github.com/johnzangwill/pandas/blob/6cc5584bba59ef8f06d4dc901dc39ddd08d1519f/pandas/core/indexes/multi.py#L1780:
(level if lvlname is None else lvlname): self._get_level_values(level)
and I have just moved that logic earlier, since I need unique dictionary indexes.
In any case, insert and reset_index do this differently, replacing None level labels with level_n. As I say, that is a separate issue and I have raised it elsewhere (#45245), but is is not the subject of this PR,
I don't think that this is conditional in reset_index or that there is an argument for it. Which argument are you referring to?
This is the code in reset_index:
if isinstance(self.index, MultiIndex):
names = com.fill_missing_names(self.index.names)
to_insert = zip(self.index.levels, self.index.codes)
else:
default = "index" if "index" not in self else "level_0"
names = [default] if self.index.name is None else [self.index.name]
to_insert = ((self.index, None),)
that puts in "level_n" for multi-index and "index" or "level_0" for simple index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you just make a method on Index then to do this, repeating this code is not great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have searched Pandas and I cannot find any other instance of this. The nearest is
pandas/pandas/core/indexes/base.py
Line 1631 in 4e034ec
name = self.name or 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think a common method on index is worth it here (to share here & reset_index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but I have already explained that reset_index
does not do this.
reset_index
, to_records
and many other methods all fill the None entries with "level_n", not with "n". As you know, I factored those out into a common method (com.fill_missing_names
#44878) which is invoked in 6 different places.
These MI/Index.to_frame
methods are the only ones which do it differently, filling the gaps with the column number. This difference could be discussed, and I have made an issue (#45245), but I don't suggest changing it without a lot of thought. Changing to_frame
would break virtually all its tests.
pandas/core/indexes/multi.py
Outdated
@@ -1774,14 +1784,20 @@ def to_frame(self, index: bool = True, name=lib.no_default) -> DataFrame: | |||
else: | |||
idx_names = self.names | |||
|
|||
idx_names = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think a common method on index is worth it here (to share here & reset_index)
doc/source/whatsnew/v1.4.0.rst
Outdated
@@ -227,6 +227,7 @@ Other enhancements | |||
- Add support for `Zstandard <http://facebook.github.io/zstd/>`_ compression to :meth:`DataFrame.to_pickle`/:meth:`read_pickle` and friends (:issue:`43925`) | |||
- :meth:`DataFrame.to_sql` now returns an ``int`` of the number of written rows (:issue:`23998`) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-01-20 11:35:58 UTC |
I have factored out a private Index._make_labels. It is just used in Index and MI to_frame (see my note above #45318 (comment)) A tiny change of behavior would be e.g. Index([0], name="") which would have been changed to 0 before. Now it is left as "" because I am looking for explicit None, rather than just Falsey. The tests all seem to pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small change, ping on green.
pandas/core/indexes/base.py
Outdated
@@ -1353,6 +1353,18 @@ def _format_attrs(self) -> list[tuple[str_t, str_t | int | bool | None]]: | |||
attrs.append(("length", len(self))) | |||
return attrs | |||
|
|||
@final | |||
def _make_labels(self) -> Hashable | Sequence[Hashable]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change to _get_level_names
to be consistent with naming.
This is as Green as CI allows... |
thanks @johnzangwill |
Part of issue #45245 for multi-indexes with repeated level names.
Old behavior was to remove columns with repeated level names:
New behavior is to raise, unless
allow_duplicates
isTrue
:Note: The other issues in #45245