Skip to content

ENH: Add allow_duplicates to MultiIndex.to_frame #45318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jan 22, 2022

Conversation

johnzangwill
Copy link
Contributor

@johnzangwill johnzangwill commented Jan 11, 2022

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

Part of issue #45245 for multi-indexes with repeated level names.

Old behavior was to remove columns with repeated level names:

import pandas as pd
pd.MultiIndex.from_tuples([[1, 2]], names=["a", "a"]).to_frame()
     a
a a   
1 2  2

New behavior is to raise, unless allow_duplicates is True:

pd.MultiIndex.from_tuples([[1, 2]], names=["a", "a"]).to_frame(allow_duplicates=True)
     a  a
a a      
1 2  1  2

Note: The other issues in #45245

  1. Missing column names. This has been left as it was: the name is filled with the level number, n. This is different from many other methods that fill with level_n, but obviously needs discussing.
  2. bool values: might be fixed by ENH: Index[bool] #45061

@@ -1774,14 +1784,20 @@ def to_frame(self, index: bool = True, name=lib.no_default) -> DataFrame:
else:
idx_names = self.names

idx_names = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm why are you repeating L1785?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is doing a transform: filling in None names with the level number.

Whether that is the right thing to do is another issue. I am just preserving the existing behavior,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this needs another argument similar to how this is done in .reset_index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not changing anything here. The old code is https://github.com/johnzangwill/pandas/blob/6cc5584bba59ef8f06d4dc901dc39ddd08d1519f/pandas/core/indexes/multi.py#L1780:

(level if lvlname is None else lvlname): self._get_level_values(level)

and I have just moved that logic earlier, since I need unique dictionary indexes.

In any case, insert and reset_index do this differently, replacing None level labels with level_n. As I say, that is a separate issue and I have raised it elsewhere (#45245), but is is not the subject of this PR,

I don't think that this is conditional in reset_index or that there is an argument for it. Which argument are you referring to?

This is the code in reset_index:

            if isinstance(self.index, MultiIndex):
                names = com.fill_missing_names(self.index.names)
                to_insert = zip(self.index.levels, self.index.codes)
            else:
                default = "index" if "index" not in self else "level_0"
                names = [default] if self.index.name is None else [self.index.name]
                to_insert = ((self.index, None),)

that puts in "level_n" for multi-index and "index" or "level_0" for simple index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you just make a method on Index then to do this, repeating this code is not great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have searched Pandas and I cannot find any other instance of this. The nearest is

name = self.name or 0
which does implement the policy (on self.name, not self.names). I can factor that down if you think that it is worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think a common method on index is worth it here (to share here & reset_index)

Copy link
Contributor Author

@johnzangwill johnzangwill Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but I have already explained that reset_index does not do this.

reset_index, to_records and many other methods all fill the None entries with "level_n", not with "n". As you know, I factored those out into a common method (com.fill_missing_names #44878) which is invoked in 6 different places.

These MI/Index.to_frame methods are the only ones which do it differently, filling the gaps with the column number. This difference could be discussed, and I have made an issue (#45245), but I don't suggest changing it without a lot of thought. Changing to_frame would break virtually all its tests.

@jreback jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 16, 2022
@johnzangwill johnzangwill requested a review from jreback January 16, 2022 19:46
@@ -1774,14 +1784,20 @@ def to_frame(self, index: bool = True, name=lib.no_default) -> DataFrame:
else:
idx_names = self.names

idx_names = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think a common method on index is worth it here (to share here & reset_index)

@@ -227,6 +227,7 @@ Other enhancements
- Add support for `Zstandard <http://facebook.github.io/zstd/>`_ compression to :meth:`DataFrame.to_pickle`/:meth:`read_pickle` and friends (:issue:`43925`)
- :meth:`DataFrame.to_sql` now returns an ``int`` of the number of written rows (:issue:`23998`)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@johnzangwill johnzangwill requested a review from jreback January 17, 2022 15:31
@pep8speaks
Copy link

pep8speaks commented Jan 17, 2022

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-01-20 11:35:58 UTC

@johnzangwill
Copy link
Contributor Author

johnzangwill commented Jan 17, 2022

I have factored out a private Index._make_labels. It is just used in Index and MI to_frame (see my note above #45318 (comment))

A tiny change of behavior would be e.g. Index([0], name="") which would have been changed to 0 before. Now it is left as "" because I am looking for explicit None, rather than just Falsey.

The tests all seem to pass.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small change, ping on green.

@@ -1353,6 +1353,18 @@ def _format_attrs(self) -> list[tuple[str_t, str_t | int | bool | None]]:
attrs.append(("length", len(self)))
return attrs

@final
def _make_labels(self) -> Hashable | Sequence[Hashable]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change to _get_level_names to be consistent with naming.

@johnzangwill
Copy link
Contributor Author

This is as Green as CI allows...
@jreback Please see my note (#44755 (comment)) . I think that you will find that my solution there is not so bad as you first thought.

@johnzangwill johnzangwill requested a review from jreback January 20, 2022 14:24
@jreback jreback added this to the 1.5 milestone Jan 22, 2022
@jreback jreback merged commit 7dbfe9f into pandas-dev:main Jan 22, 2022
@jreback
Copy link
Contributor

jreback commented Jan 22, 2022

thanks @johnzangwill

@johnzangwill johnzangwill deleted the MultiIndex-to_frame branch January 22, 2022 10:30
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants