Skip to content

ENH: Add custom descriptors (such as dtype, nunique, etc.) to Styler output #43894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 96 commits into from

Conversation

attack68
Copy link
Contributor

@attack68 attack68 commented Oct 5, 2021

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't use styler much, but if the community has an interest in this feature I think it's worthwhile

@attack68
Copy link
Contributor Author

@jreback what you think here? OK to merge here on green? I can extend the docs as an add on?

@jreback
Copy link
Contributor

jreback commented Jan 27, 2022

no this still has questions

@attack68
Copy link
Contributor Author

@bashtage I believe the questions jeff is referring to are yours, do you have time to review?

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a lot more useful now. The biggest question in this review is what the more API should look like. It seems to be that allowing either Sequence[str | Callable] or dict[str, str | Callable] is more consistent with other parts of pandas that are flexible. I can't think of any cases where tuple(str, str | Callable) would be used to get a string name and it's associated argument.

@@ -124,6 +128,7 @@ def __init__(
self.hide_columns_: list = [False] * self.columns.nlevels
self.hidden_rows: Sequence[int] = [] # sequence for specific hidden rows/cols
self.hidden_columns: Sequence[int] = []
self.descriptors: list[str | Callable | tuple[str, Callable]] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a common API? I think most places where alternative input are allowed either use a Sequence of some sort or dict[str, str | Callable]. Would be be cleaner if the allowed inputs were Sequence[str | Callable] | dict[str | Callable].

Also, can't the Callable be better described? Isn't is Callable[[Series], float]? It seems that the Callablemust take aSeriesand return some sort of scalar value. Thie question is whether the scalar value can be non-numeric. If it could be any numeric value, than you could you use something likeCallable[[Series], int | float]. So more generally, what are the requirements for the Callable`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can't the Callable be better described? Isn't is Callable[[Series], float]? It seems that the Callablemust take aSeriesand return some sort of scalar value. Thie question is whether the scalar value can be non-numeric. If it could be any numeric value, than you could you use something likeCallable[[Series], int | float]. So more generally, what are the requirements for the Callable`?

The Callable could techincally return anything so long as the returned object has a __str__ representation to populate in the HTML element. Common values might be int, float, string, but yes it does accept only series.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sig should then be Callable[[Series],Any]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • changed typing. But did not add the dict. List[Tuple[str, str]] is present for some of the Styler functions

elif isinstance(descriptor, tuple):
name, func = descriptor[0], descriptor[1]
else:
name, func = None, descriptor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the descriptor is a Callable I think it would be best to try getattr(descriptor, "name", None) for the name rather than committing to a blank name. For example pd.Series.mean.__name__ is mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • chged. note the case of "<lambda>" i thought was worth blanking

@@ -124,6 +128,7 @@ def __init__(
self.hide_columns_: list = [False] * self.columns.nlevels
self.hidden_rows: Sequence[int] = [] # sequence for specific hidden rows/cols
self.hidden_columns: Sequence[int] = []
self.descriptors: list[str | Callable | tuple[str, Callable]] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do stick with the tuple, it should be tuple(str, str | Callable) which would allow someone to pass a custom name for the output along with a common function, e.g., ("average", "mean"). If moving to a dict, then both {"average":"mean"}and{"average":Series.mean}` should be allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agreed, having a think if this can be improved.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2022

I am still -1 here. This is basically completely changing the way things work. We already have .describe() to do exactly this. If you want to add .describe() and then allow it to be more generic ok there. BUT then this adds these as header summary rows, again completely changing things.

I don't have a good recommendation for how to reconcile this. I support if allow a 'header' table would be more ameanable. This just feels tacked on with a lot of bespoke things going on.

@Delengowski
Copy link
Contributor

I am still -1 here. This is basically completely changing the way things work. We already have .describe() to do exactly this. If you want to add .describe() and then allow it to be more generic ok there. BUT then this adds these as header summary rows, again completely changing things.

I don't have a good recommendation for how to reconcile this. I support if allow a 'header' table would be more ameanable. This just feels tacked on with a lot of bespoke things going on.

I think I get what you are saying. I think generalizing .describe() would be a smart move, but keep this api, have the actual stylization call the new .describe(). I'm of the opinion that the addition of the data descriptors to the header does belong on styler, but I get your concern of having this extra describe functionality be under styler rather than under .describe() (if that's indeed what you are talking about).

What I don't think should been done is have these generalized functionalities moved to .describe() and then have api be that the end user call .describe() and be forced to pass the resultant DataFrame/Series to some kwarg on styler.

@attack68
Copy link
Contributor Author

@bashtage @jreback @mroeschke, this PR stalled due to difficulty in finding a way forward. I have tried to refactor it addressing some concerns above in an alternate #46105. If you have time please comment on the alternative.

@Delengowski
Copy link
Contributor

I'd be interested in hearing a more detailed response to the opposition of this. The styler is an end use feature, so I'm not sure I understand the issue with there being data descriptors in the header or in a separate footer.

I understand and agree to the issue with duplicating computation logic but not with adding extra rows to the header. Again it's end use, where further computation isnt the primary purpose.

@jreback
Copy link
Contributor

jreback commented Feb 21, 2022

api considerations are always paramount - the goal is consistency and simplicity
not convering every possible case

will look soon

@mroeschke
Copy link
Member

There is a Separation of Concerns argument here: 1) the part to generate the descriptor data & combine with the original data, 2) the part to stylize the result. Without this feature today it's maybe roughly equivalent to (?):

result = pd.concat([original_data.astype(object), descriptor.astype(object)])
output = result.style.(...)

Maybe there should be a better way to do 1 (that's why describe() was brought up, pivot_table has similar functionality)

@attack68
Copy link
Contributor Author

@mroeschke solution, with the concatenation, might work but it then becomes harder to index the original data, export the styler for more reuseable styling, and apply the builtin formatters since you then have to exclude the concatenated rows from your subset.

I have used his solution before for a "Total" row and it is cumbersome but worked.

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small refactorings to make it simpler, and perhaps more palatable in the long run.

Parameters
----------
descriptors : list of str, callables or 2-tuples of str and callable
If a string is given must be a valid Series method, e.g. "mean" invokes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you some additional details about the callable. Specifically it must (or should) return a scalar quantity. Is the callable allowed to throw and exception? If so, what happens? I would be good to clarify this.

... return s.mean()
>>> styler = df.style.set_descriptors([
... "mean",
... Series.mean,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screen show shows no lable on this line? Was it not possible to have .__name__ as a default test?

@@ -124,6 +133,7 @@ def __init__(
self.hide_columns_: list = [False] * self.columns.nlevels
self.hidden_rows: Sequence[int] = [] # sequence for specific hidden rows/cols
self.hidden_columns: Sequence[int] = []
self.descriptors: list[Descriptor | tuple[str, Descriptor]] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make simplify this to self.descriptors: list[tuple[str, Descriptor]]? Also, should it immutable, so that it would be self.descriptors: tuple[tuple[str, Descriptor], ...]?

Essentially if the user doesn't provide a name, then the name used internally, even if blank, is appended so that the tuple form is the only one that would ever need to be used.

@@ -477,6 +494,108 @@ def _generate_col_header_row(self, iter: tuple, max_cols: int, col_lengths: dict

return index_blanks + column_name + column_headers

def _generate_descriptor_row(self, iter: tuple, max_cols: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of tuples are allowed? Would be better to be as specific as possible to reduce any future refactor risks.

else:
func = descriptor[1]
else:
name, func = getattr(descriptor, "__name__", None), descriptor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should name be None of just ""? Is there any utility is tracking None? If name is "" then you can simplify the type above to be str rather than str | None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps just use self.css["blank_value"] when missing and simplify below.

func = descriptor[1]
else:
name, func = getattr(descriptor, "__name__", None), descriptor
name = None if name == "<lambda>" else name # blank nameless functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Replace nameless -> anonymous in comment

)

base_css = f"{self.css['descriptor_name']} {self.css['descriptor']}{r}"
if name is not None and not self.hide_column_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above refactor removes this if then since name is always assigned.

@bashtage
Copy link
Contributor

I've some to the point where the LOC is now small enough that it is worth including for the flexibility it brings. The hard part is always expanding the API as pandas as become more mature, as this requires some effort (although I would rate this as highly stable, and so low maintenance).

Ideally, there would be a way to find out if this feature was in fact widely desirable before committing to it for a fairly long period of time.

@attack68
Copy link
Contributor Author

I've some to the point where the LOC is now small enough that it is worth including for the flexibility it brings. The hard part is always expanding the API as pandas as become more mature, as this requires some effort (although I would rate this as highly stable, and so low maintenance).

Ideally, there would be a way to find out if this feature was in fact widely desirable before committing to it for a fairly long period of time.

Thanks @bashtage for the review and I agree with all your points, but I think Jeff was very keen on completely separating the calculation and thereby reducing the LOC much more, which I also agree with, but I was struggling to think of a way.

Finally, I recently had another thought and managed to completely reengineer this with the execution into less than 20 LOC, with better formatting and HTML styling built in, so I'm closing to close this and leave the superior, #46105, open. (thanks also @mroeschke for review)

@attack68 attack68 closed this Feb 25, 2022
@attack68 attack68 deleted the describe_styler branch March 6, 2022 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Styler conditional formatting using DataFrame.style
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add option to display dtypes below column headers
6 participants