Skip to content

REF: pandas/io/formats/format.py #36434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 75 commits into from
Sep 19, 2020
Merged

Conversation

ivanovmg
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Partially addresses #36407

Before splitting DataFrameFormatter into more dedicated classes, I decided to refactor the class itself, to make the outstanding refactoring more manageable.

As suggested by @jreback, I made this refactor small, trying to split big functions into smaller ones and find some better naming.
Yet, rather small changes are done so far, just to keep diffs readable.
Once approved, I will move on to further refactor.

@pep8speaks
Copy link

pep8speaks commented Sep 17, 2020

Hello @ivanovmg! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-19 18:09:57 UTC

return strcols

if is_list_like(self.header):
assert isinstance(self.header, list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_list_like is not the same as isinstance(..., list). can you revert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thank you


self.bold_rows = bold_rows
self.escape = escape
if hasattr(self, "_max_cols_fitted"):
Copy link
Member

@simonjayhawkins simonjayhawkins Sep 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasattr checks don't play play well with mypy

max_cols_fitted is called from is_truncated_horizontally which is quite prevalent.

optimisation here may be premature here, but in I think generally avoiding hasattr checks is preferable is called often.

class MyClass:
    pass

cls = MyClass()

%timeit hasattr(cls, "_max_cols_fitted")
# 120 ns ± 8.08 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

class MyClass:
    _max_cols_fitted = None

cls = MyClass()

%timeit cls._max_cols_fitted is not None
# 85.5 ns ± 1.94 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

what is the reason for changing max_rows_adj to a property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for making the property out of max_rows_adj/max_rows_fitted was to place the computation inside the function. Before that it was scattered in the other methods, while it was difficult to understand how the calculation was actually carried out. The present implementation makes it more clear.

Before hasattr check I used try/except AttributeError check. But if I understood correctly, @jreback suggested to revert to hasattr check, which was there originally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the separate function is good. maybe _calc_max_cols_fitted and just call it once and assign to attribute like before

Before hasattr check I used try/except AttributeError check. But if I understood correctly, @jreback suggested to revert to hasattr check, which was there originally.

it was only called once before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If performance is a concern, we can consider cached_property decorator or it's equivalent created for Python 3.7. I can do this if required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the separate function is good. maybe _calc_max_cols_fitted and just call it once and assign to attribute like before

Before hasattr check I used try/except AttributeError check. But if I understood correctly, @jreback suggested to revert to hasattr check, which was there originally.

it was only called once before.

This is fine with me. I will change as you suggest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If performance is a concern

it's not. just suggesting avoiding certain patterns

from python/mypy#1424 (comment)

In general i despise hasattr() checks; it's usually much better to add a
class level initialization to None and check for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can consider cached_property decorator or it's equivalent created for Python 3.7. I can do this if required.

This is not in the standard library til 3.8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the standard library starting from 3.8. Before that there wasn't one. So, I was thinking could b reasonable to implement the cached property for the needs, if the version < 3.8.

@ivanovmg ivanovmg requested a review from jreback September 19, 2020 18:14
This allows one avoid multiple calls of the properties,
which were there originally.
@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

we already have a cached property decorator

cache_readonly

though only use it if we are actually doing something non trivial

@simonjayhawkins
Copy link
Member

yeah cache_readonly is in cython without a stub file, so using here would undo all the typing benefits since properties would resolve to 'Any'

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping on green. nice improvement. happy to have additional to separate classes as you mentioned.

"""
Render a DataFrame to a list of columns (as lists of strings).
def _truncate_horizontally(self) -> None:
"""Remove columns, which are not to be displayed and adjust formatters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't be against making this a non-mutating function and instead just return the new values (and set where this is called), but you did explain so ok.

@jreback jreback added this to the 1.2 milestone Sep 19, 2020
@jreback jreback merged commit 2705dd6 into pandas-dev:master Sep 19, 2020
@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

thanks @ivanovmg keep em coming!

x
for x in range(self.frame.shape[0])
if x < row_num or x >= len(self.frame) - row_num
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is the reason for the slowdown (#36636). In the original code, we do 2 slices + a concat. What was the motivation for this (slow) for loop? To select the subset in a single go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it was supposed to be more readable to extract necessary rows. But, apparently, I never considered cases with large number or rows. Will fix it.

- tr_row_num
"""
assert self.max_rows_fitted is not None
row_num = self.max_rows_fitted // 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I find the "max_rows_fitted" naming instead of "max_rows_adjusted" as it was before a bit confusing, as the "fitting" seems to come from "fit on the screen", but the eventual max_rows are not necessarily what fit on the screen, that depends on your settings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will change in the upcoming PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants