Skip to content

Feature/groupby repr ellipses 1135 #24853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

benjaminarjun
Copy link
Contributor

Currently one test is failing:

    def test_groups(self, df):
        grouped = df.groupby(['A'])
        groups = grouped.groups
        assert groups is grouped.groups  # caching works

I'm not sure what exactly this test is checking for. Is this a behavior that needs to be kept?

@codecov
Copy link

codecov bot commented Jan 21, 2019

Codecov Report

Merging #24853 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24853      +/-   ##
==========================================
- Coverage   91.98%   91.97%   -0.01%     
==========================================
  Files         175      175              
  Lines       52372    52375       +3     
==========================================
- Hits        48172    48171       -1     
- Misses       4200     4204       +4
Flag Coverage Δ
#multiple 90.52% <100%> (ø) ⬆️
#single 40.7% <75%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 97.23% <ø> (ø) ⬆️
pandas/io/formats/printing.py 86.09% <ø> (+0.53%) ⬆️
pandas/core/frame.py 96.9% <ø> (-0.12%) ⬇️
pandas/core/groupby/grouper.py 98.18% <ø> (ø) ⬆️
pandas/core/groupby/ops.py 95.97% <ø> (ø) ⬆️
pandas/core/indexes/base.py 96.94% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/util/testing.py 90.61% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48ea04f...9742473. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 21, 2019

Codecov Report

Merging #24853 into master will decrease coverage by 49.49%.
The diff coverage is 23.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24853      +/-   ##
==========================================
- Coverage   92.39%   42.89%   -49.5%     
==========================================
  Files         166      166              
  Lines       52391    52398       +7     
==========================================
- Hits        48407    22477   -25930     
- Misses       3984    29921   +25937
Flag Coverage Δ
#multiple ?
#single 42.89% <23.07%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 35.84% <ø> (-61.08%) ⬇️
pandas/io/formats/printing.py 65.15% <10%> (-20.27%) ⬇️
pandas/core/groupby/groupby.py 24.62% <66.66%> (-72.19%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
... and 126 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01e7872...d6b310a. Read the comment docs.

@WillAyd WillAyd added Groupby Output-Formatting __repr__ of pandas objects, to_string labels Jan 21, 2019
@WillAyd
Copy link
Member

WillAyd commented Feb 6, 2019

@benarthur91 can you check failures on this one?

@benjaminarjun
Copy link
Contributor Author

benjaminarjun commented Feb 6, 2019

Pushing w/ failing test commented out - I don't understand the issue well, and want to confirm this test is the reason for red from Codecov before addressing.

def test_groups(self, df):
grouped = df.groupby(['A'])
groups = grouped.groups
assert groups is grouped.groups # caching works
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing on this line - I'm wondering what the value of this behavior is and/or whether there's interest in retaining it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I would think so; so the current approach is instantiating a new class on every access of .groups? That seems potentially expensive and counter-intuitive.

Is there a way to get the intended behavior without a new class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm aware of. groupsis currently a standard dict, whose __repr__ isn't abbreviated, even for large instances. Seems you'd have to override the __repr__ to get this behavior, and to do that you'd have to subclass dict. Maybe there's a better way I haven't thought of.

In response to instantiating a new class on every access, I could look into storing groups on the GroupBy object as an instance of the new class rather than a plain dict. Then .groups would just get the attribute rather than creating a new object every time it's called. I think that would resolve this case.

@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2019

@benarthur91 any chance you can merge master and refactor to go another route? I'm personally -1 on current implementation due to instantiation of new object on property access as I think that's just a confusing API

@benjaminarjun
Copy link
Contributor Author

@WillAyd Definitely. Did you have a particular alternative in mind? I was thinking to make the underlying object an instance of the class, but if you have a thought I'd love to hear it.

@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2019 via email

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a whatsnew note (other api changes is ok) in 0.25

@@ -5290,6 +5291,27 @@ def _add_logical_methods_disabled(cls):
Index._add_comparison_methods()


class IndexGroupbyGroups(dict):
def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just need to call pandas.io.formats.printing.pprint_thing (with the option passed)

@@ -5290,6 +5291,27 @@ def _add_logical_methods_disabled(cls):
Index._add_comparison_methods()


class IndexGroupbyGroups(dict):
def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give a class doc-string here

@benjaminarjun
Copy link
Contributor Author

@jreback I've updated to use pprint_thing. I believe the desired output is something like

{1: Int64Index([0, 1, 2], dtype='int64'), ... , 3: Int64Index([5], dtype='int64')}

but with the use of this function I get:

{1: [0, 1, ...], 2: [3, 4], ...}

However this abbreviates at the end of the object rather than the middle (not consistent with DataFrame's repr) and the pprint behavior is applied recursively, so reprs of the object's values are oversimplified. I attempted to address these issues on commit 19bb9bf, however this required changes to the pprint_thing definition.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

can you merge master and see if you can get this passing

@@ -5274,6 +5276,14 @@ def _add_logical_methods_disabled(cls):
Index._add_comparison_methods()


class IndexGroupbyGroups(dict):
"""Dict extension to support abbreviated __repr__"""
from pandas.io.formats.printing import pprint_thing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be imported at the top?

@@ -5274,6 +5276,14 @@ def _add_logical_methods_disabled(cls):
Index._add_comparison_methods()


class IndexGroupbyGroups(dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be defined in pandas/core/groupby/groupby.py and used on the dict returning functions, mainly groups and indices

this is not user facing function and kind of obscures that this is from groupby

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, maybe best to actually put this in pandas.io.format.printing and name it PrettyDict(dict), see my other notes for where to actually use this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll move this. To your first point, are you saying a groups property should do something like this: return PrettyDict(self.grouper.groups)? @WillAyd mentioned had noted that he'd prefer to put this in a different place, as multiple calls to the method will create a new object every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes; thiat would be a centralized place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topic initially came up because instantiating in the groups method was failing this test:

def test_groups(self, df):
        grouped = df.groupby(['A'])
        groups = grouped.groups
        assert groups is grouped.groups  # caching works

@WillAyd Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @jreback says here makes a lot of sense - go ahead with the move!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll remove this test as part of the next commit.

'b': [1, 2, 3, 4, 5, 6]
})

with option_context('display.max_rows', 2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also try a grouper like np.array(df.a) which hits a different path

@@ -1761,6 +1761,22 @@ def test_period(self):
assert str(df) == exp


class TestDataFrameGroupByFormatting(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this goes in pandas/tests/groupby/test_grouping.py near the other repr tests

- The ``arg`` argument in :meth:`pandas.core.groupby.DataFrameGroupBy.agg` has been renamed to ``func`` (:issue:`26089`)
- :meth:`Index.groupby` and dependent methods (notably :attr:`GroupBy.groups`) now return object with abbreviated repr (:issue:`1135`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a user facing message; Index.groupby is not really a public method, while Groupby.groups is the user facing; pls make this a bit more clear.

@@ -5274,6 +5276,14 @@ def _add_logical_methods_disabled(cls):
Index._add_comparison_methods()


class IndexGroupbyGroups(dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, maybe best to actually put this in pandas.io.format.printing and name it PrettyDict(dict), see my other notes for where to actually use this.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

can you merge master and update

@jreback
Copy link
Contributor

jreback commented Jul 11, 2019

closing as stale, but this is pretty close if you or @pandas-dev/pandas-core would like to finish up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

df.groupby('key').groups printed all: problem with large arrays
3 participants