Skip to content

REF: NDFrame describe #36833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

Conversation

ivanovmg
Copy link
Member

@ivanovmg ivanovmg commented Oct 3, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Refactor NDFrame.describe method.

  • Create module pandas/io/formats/describe.py
  • Delegate NDFrame.describe to function describe_ndframe in the new module
  • Implement polymorphism for describing series and dataframe
  • Implement strategy pattern for describing series of different kinds (categorical, numeric, timestamp)

Benefits:

  • Reduce complexity in pandas/core/generic.py
  • Straightforward logic how to treat each datatype (reduced if/elif/else workflow)
  • Enable potential for further extension

@ivanovmg ivanovmg added Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code labels Oct 3, 2020


def describe_ndframe(
*,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the * needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to enforce keyword-only arguments, to avoid any confusion with positional arguments.

is_timedelta64_dtype,
)

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid importing pd?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made concrete imports for Series and concat. Unfortunately, for Series I had to use local import right in place, where Series is needed, because of the issue with circular imports.

@ivanovmg
Copy link
Member Author

ivanovmg commented Oct 6, 2020

Also, I added test for checking that if include is 'all' then setting exclude will raise ValueError. This case was not covered.

@ivanovmg ivanovmg requested a review from jbrockmendel October 7, 2020 11:14
@jbrockmendel
Copy link
Member

I like the test, otherwise want to encourage you to look at non-refactor PRs.

A few ideas of things that would be really helpful that I don't think anyone is actively working on:

  1. In tests I see a bunch of warnings from matplotlib: figure out if these mean we're doing something wrong, or if we can suppress these.
    1.5) Same thing with closed node warnings in the pytables tests
  2. There are a bunch of methods like Series.shift that just call the NDFrame method but are put in Series so that we can tinker with the docstring. Can that be avoided? (Same for DataFrame, though i think there are fewer of those)
  3. Which _shared_docs actually need to be shared? Are there any that aren't shared that can be de-duplicated?
  4. try to implement ExtensionArray._value_counts in terms of factorize (?)

Come to think of it, a while back there was a push to refactor some of the tests to method-specific files (tests.frame.methods, tests.series.methods). We got a lot of that done, but there's always more to do there.

For that matter, in tests.arithmetic there are a ton of near-duplicate tests that need to be de-duplicated/parametrized (search for TODO comments)

@ivanovmg
Copy link
Member Author

ivanovmg commented Oct 8, 2020

I like the test, otherwise want to encourage you to look at non-refactor PRs.

A few ideas of things that would be really helpful that I don't think anyone is actively working on:

1. In tests I see a bunch of warnings from matplotlib: figure out if these mean we're doing something wrong, or if we can suppress these.
   1.5) Same thing with closed node warnings in the pytables tests

2. There are a bunch of methods like `Series.shift` that just call the NDFrame method but are put in Series so that we can tinker with the docstring.  Can that be avoided?  (Same for DataFrame, though i think there are fewer of those)

3. Which _shared_docs actually need to be shared?  Are there any that aren't shared that can be de-duplicated?

4. try to implement ExtensionArray._value_counts in terms of factorize (?)

Come to think of it, a while back there was a push to refactor some of the tests to method-specific files (tests.frame.methods, tests.series.methods). We got a lot of that done, but there's always more to do there.

For that matter, in tests.arithmetic there are a ton of near-duplicate tests that need to be de-duplicated/parametrized (search for TODO comments)

The reason I refactor is that first I would like to figure out what the code does. It hasn't been long since I first opened pandas source code. I guess that if it is readable it will help the team maintain the code and implement new features.

Since you mentioned the possible areas where the contribution is welcome, I will certainly look at that. Thank you!

I am also looking forward to other comments regarding the present PR.

@pep8speaks
Copy link

pep8speaks commented Nov 12, 2020

Hello @ivanovmg! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-12 17:24:43 UTC

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

not averse to simplifying this, but can you instead do a pre-cursor PR which moves code with NO changes. then little by little make changes. otherwise these are very hard to review.

closing this PR. pls issue a new one.

@jreback jreback closed this Dec 29, 2020
@ivanovmg ivanovmg deleted the refactor/describe branch January 11, 2021 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants