-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add custom descriptors (such as dtype, nunique, etc.) to Styler output #43894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ndex_hiding # Conflicts: # pandas/tests/io/formats/style/test_to_latex.py
…ding # Conflicts: # pandas/io/formats/style_render.py # pandas/tests/io/formats/style/test_html.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use styler much, but if the community has an interest in this feature I think it's worthwhile
@jreback what you think here? OK to merge here on green? I can extend the docs as an add on? |
no this still has questions |
@bashtage I believe the questions jeff is referring to are yours, do you have time to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a lot more useful now. The biggest question in this review is what the more API should look like. It seems to be that allowing either Sequence[str | Callable]
or dict[str, str | Callable]
is more consistent with other parts of pandas that are flexible. I can't think of any cases where tuple(str, str | Callable)
would be used to get a string name and it's associated argument.
pandas/io/formats/style_render.py
Outdated
@@ -124,6 +128,7 @@ def __init__( | |||
self.hide_columns_: list = [False] * self.columns.nlevels | |||
self.hidden_rows: Sequence[int] = [] # sequence for specific hidden rows/cols | |||
self.hidden_columns: Sequence[int] = [] | |||
self.descriptors: list[str | Callable | tuple[str, Callable]] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a common API? I think most places where alternative input are allowed either use a Sequence of some sort or dict[str, str | Callable]
. Would be be cleaner if the allowed inputs were Sequence[str | Callable] | dict[str | Callable]
.
Also, can't the Callable be better described? Isn't is
Callable[[Series], float]? It seems that the
Callablemust take a
Seriesand return some sort of scalar value. Thie question is whether the scalar value can be non-numeric. If it could be any numeric value, than you could you use something like
Callable[[Series], int | float]. So more generally, what are the requirements for the
Callable`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can't the
Callable be better described? Isn't is
Callable[[Series], float]? It seems that the
Callablemust take a
Seriesand return some sort of scalar value. Thie question is whether the scalar value can be non-numeric. If it could be any numeric value, than you could you use something like
Callable[[Series], int | float]. So more generally, what are the requirements for the
Callable`?
The Callable could techincally return anything so long as the returned object has a __str__
representation to populate in the HTML element. Common values might be int, float, string, but yes it does accept only series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sig should then be Callable[[Series],Any]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- changed typing. But did not add the dict.
List[Tuple[str, str]]
is present for some of theStyler
functions
pandas/io/formats/style_render.py
Outdated
elif isinstance(descriptor, tuple): | ||
name, func = descriptor[0], descriptor[1] | ||
else: | ||
name, func = None, descriptor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the descriptor is a Callable
I think it would be best to try getattr(descriptor, "name", None)
for the name
rather than committing to a blank name. For example pd.Series.mean.__name__
is mean
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- chged. note the case of
"<lambda>"
i thought was worth blanking
pandas/io/formats/style_render.py
Outdated
@@ -124,6 +128,7 @@ def __init__( | |||
self.hide_columns_: list = [False] * self.columns.nlevels | |||
self.hidden_rows: Sequence[int] = [] # sequence for specific hidden rows/cols | |||
self.hidden_columns: Sequence[int] = [] | |||
self.descriptors: list[str | Callable | tuple[str, Callable]] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do stick with the tuple, it should be tuple(str, str | Callable)
which would allow someone to pass a custom name for the output along with a common function, e.g., ("average", "mean"). If moving to a dict, then both
{"average":"mean"}and
{"average":Series.mean}` should be allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agreed, having a think if this can be improved.
I am still -1 here. This is basically completely changing the way things work. We already have .describe() to do exactly this. If you want to add I don't have a good recommendation for how to reconcile this. I support if allow a 'header' table would be more ameanable. This just feels tacked on with a lot of bespoke things going on. |
I think I get what you are saying. I think generalizing What I don't think should been done is have these generalized functionalities moved to |
@bashtage @jreback @mroeschke, this PR stalled due to difficulty in finding a way forward. I have tried to refactor it addressing some concerns above in an alternate #46105. If you have time please comment on the alternative. |
I'd be interested in hearing a more detailed response to the opposition of this. The styler is an end use feature, so I'm not sure I understand the issue with there being data descriptors in the header or in a separate footer. I understand and agree to the issue with duplicating computation logic but not with adding extra rows to the header. Again it's end use, where further computation isnt the primary purpose. |
api considerations are always paramount - the goal is consistency and simplicity will look soon |
There is a Separation of Concerns argument here: 1) the part to generate the descriptor data & combine with the original data, 2) the part to stylize the result. Without this feature today it's maybe roughly equivalent to (?):
Maybe there should be a better way to do 1 (that's why |
@mroeschke solution, with the concatenation, might work but it then becomes harder to index the original data, export the styler for more reuseable styling, and apply the builtin formatters since you then have to exclude the concatenated rows from your subset. I have used his solution before for a "Total" row and it is cumbersome but worked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small refactorings to make it simpler, and perhaps more palatable in the long run.
Parameters | ||
---------- | ||
descriptors : list of str, callables or 2-tuples of str and callable | ||
If a string is given must be a valid Series method, e.g. "mean" invokes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you some additional details about the callable. Specifically it must (or should) return a scalar quantity. Is the callable allowed to throw and exception? If so, what happens? I would be good to clarify this.
... return s.mean() | ||
>>> styler = df.style.set_descriptors([ | ||
... "mean", | ||
... Series.mean, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The screen show shows no lable on this line? Was it not possible to have .__name__
as a default test?
@@ -124,6 +133,7 @@ def __init__( | |||
self.hide_columns_: list = [False] * self.columns.nlevels | |||
self.hidden_rows: Sequence[int] = [] # sequence for specific hidden rows/cols | |||
self.hidden_columns: Sequence[int] = [] | |||
self.descriptors: list[Descriptor | tuple[str, Descriptor]] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make simplify this to self.descriptors: list[tuple[str, Descriptor]]
? Also, should it immutable, so that it would be self.descriptors: tuple[tuple[str, Descriptor], ...]
?
Essentially if the user doesn't provide a name, then the name used internally, even if blank, is appended so that the tuple form is the only one that would ever need to be used.
@@ -477,6 +494,108 @@ def _generate_col_header_row(self, iter: tuple, max_cols: int, col_lengths: dict | |||
|
|||
return index_blanks + column_name + column_headers | |||
|
|||
def _generate_descriptor_row(self, iter: tuple, max_cols: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of tuples are allowed? Would be better to be as specific as possible to reduce any future refactor risks.
else: | ||
func = descriptor[1] | ||
else: | ||
name, func = getattr(descriptor, "__name__", None), descriptor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should name be None
of just ""
? Is there any utility is tracking None
? If name is ""
then you can simplify the type above to be str
rather than str | None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps just use self.css["blank_value"]
when missing and simplify below.
func = descriptor[1] | ||
else: | ||
name, func = getattr(descriptor, "__name__", None), descriptor | ||
name = None if name == "<lambda>" else name # blank nameless functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Replace nameless -> anonymous in comment
) | ||
|
||
base_css = f"{self.css['descriptor_name']} {self.css['descriptor']}{r}" | ||
if name is not None and not self.hide_column_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above refactor removes this if
then
since name
is always assigned.
I've some to the point where the LOC is now small enough that it is worth including for the flexibility it brings. The hard part is always expanding the API as pandas as become more mature, as this requires some effort (although I would rate this as highly stable, and so low maintenance). Ideally, there would be a way to find out if this feature was in fact widely desirable before committing to it for a fairly long period of time. |
Thanks @bashtage for the review and I agree with all your points, but I think Jeff was very keen on completely separating the calculation and thereby reducing the LOC much more, which I also agree with, but I was struggling to think of a way. Finally, I recently had another thought and managed to completely reengineer this with the execution into less than 20 LOC, with better formatting and HTML styling built in, so I'm closing to close this and leave the superior, #46105, open. (thanks also @mroeschke for review) |
Former example...