Skip to content

ENH: Use a proper CSS parser for pandas Styler objects #48869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

marknsikora
Copy link
Contributor

@marknsikora
Copy link
Contributor Author

This changes does have 1 sort of regression. The previous latex formatting code was using the CSS styling functions to pass around values, these values weren't properly formatted CSS entries though taking the form :value. I've updated these calls to pass through proper CSS entries even if the key name gets dropped. Old latex formatting code will not work with this update. But using correctly formatted CSS will work with both the new code and older versionss.

Old, only works in v1.5

    styler.set_table_styles(
        [
            {"selector": "toprule", "props": ":hline"},
            {"selector": "bottomrule", "props": ":otherline"},
        ]

New, works in both v1.5 and this update

    styler.set_table_styles(
        [
            {"selector": "toprule", "props": "value:hline"},
            {"selector": "bottomrule", "props": "value:otherline"},
        ]

@@ -49,6 +49,7 @@ dependencies:
- scipy=1.7.1
- sqlalchemy=1.4.16
- tabulate=0.8.9
- tinycss2=1.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actively maintained? This looks more or less abandoned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you looking at tinycss rather than tinycss2? The later has commits from a few weeks ago on the bleeding edge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked https://github.com/Kozea/tinycss2/commits/master and there are like 25 commits in the last 2 years

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can I say. There's not much more activity needed than that. It's a simple parser that just produces an AST from some CSS, only a few 100 lines of code if you strip out the comments and doc strings. Doesn't do any of the other fancy processing of interpreting the CSS content. That's all we need here to pull out the key-value pairs being passed in.

@marknsikora
Copy link
Contributor Author

Looking into the failing CI, it appears like conda only has version 1.0.2 of the library in question. Going to try and do some local testing before pushing another commit.

@mroeschke
Copy link
Member

cc @attack68

I have mixed feeling adding another optional dependency while it does also prevent rolling custom CSS parsing. If we end up spinning of Styler into another package I think it will be easier to curiate optional dependencies #47830 (comment)

@mroeschke mroeschke added the Styler conditional formatting using DataFrame.style label Oct 3, 2022
@attack68
Copy link
Contributor

attack68 commented Oct 3, 2022

There is a lot of stuff going on here:

  1. Requiring tinycss as a dependency to parse css to get Styler to work. There are error messages if jinja2 is not installed and users try to use Styler, we would need similar for tinycss2 and tests, if this were to go ahead.

  2. It has been discussed a number of times and Jeff is fine with Pandas 2.0 requiring jinja2 as a dependendency and if this were adopted I suspect this would be a suitable time for this PR, rather than in 1.6.0. I see this as too big a change when 2.0 is not that far away.

  3. Latex is hacky anyway. It was added after CSS styling and implemented to fit into the CSS framework to utilise all of the existing functionality and testing that was extensive. The bits of code that hack the CSS for Latex are not extensive and kept to a minimum. All changes to this would have to be checked for documentation thoroughly.

  4. The user change of the Latex set_table_styles is not the end of the world. Its not that obvious anyway and probably not extensively used, but again 2.0 and a change of docs is probably a better place.

  5. How does this parsing affect the benchmark performance with large HTML table rendering? We would need to run some of the benchmarks tests and ad hoc tests.

I have considered a PR similar to this, myself, in the past so am not averse to it, but, at first glance (and I admit I have not checked thoroughly yet) this PR in its current form seems too invasive to the code. I would consider two aspects of this PR:

  • splitting it up so that some areas of the code could operate without it in the first instance and would remain unaffected (e.g. you could target using only tinycss parser for the to_excel output if this is where this will have the biggest impact for users)
  • initially making tinycss optional so that code can fall back if it is not installed.

Whilst there is some hacky CSS in the code at present, I think we need good examples of how this PR will fundamentally enhance what the user can input and what he get out, if this is implemented. If this ends up being just code cleanup then the additional requirement to install tinycss may be an annoyance to users who simply dont use the hacky css parts anyway.

@@ -73,6 +73,7 @@ def test_to_read_gcs(gcs_buffer, format):
df1.to_csv(path, index=True)
df2 = read_csv(path, parse_dates=["dt"], index_col=0)
elif format == "excel":
pytest.importorskip("tinycss2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems to be exporting a DataFrame to Excel, not a Styler. Do the changes implemented here require a user to have tinycss2 installed even if they are just exporting a DataFrame to Excel without even bothering with adding styles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the excel writer is implemented there really is no such thing as a truly unstyled output. The regular to_excel function adds a bunch of borders and bold headers. Since there are multiple excel output libraries supported, these borders and other things are added as CSS in the more abstract layers of code. That CSS then gets merged with the user styles when using a Styler. But there is always some CSS being processed when writing to excel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question was meant more like: would users need to install a new dependency to use to_excel even if they don’t set user configured styles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my answer wasn't clear, that was the question I was trying to answer.

The normal to_excel function uses CSS under the hood for styling. So yes all excel users would need this additional dependency.

Comment on lines +388 to +391
if token.type == "ident":
# The old css parser normalized all identifier values, do
# so here to keep backwards compatibility
token.value = token.lower_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See related discussion here #48660 (comment) regarding case normalization. Seems like its worthy of further investigation of whether this is truly necessary for backwards compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100%, I looked into this back in the 0.24 code. But I believe the reason was to allow users to set a font like Arial and have it work everywhere. The HTML output expects the upper case version, but the Excel library needs an all lower case version.

@marknsikora
Copy link
Contributor Author

There is a lot of stuff going on here: ...

Some good points raised.

  1. I'm using the same optional import from the pandas util that we use for Jinja2, so we'll get the same error messages. Admittedly it is slightly annoying working with a fresh install because you only get one import error at a time.
  2. Totally fine pushing back to 2.0. For my needs I'm using a dev build for now.
  3. Regardless of whether this specific pull gets merged, cleaning up the LaTeX is probably a good idea. These fake :value entries that are being passed around will break any future attempt to make things more CSS compliant. Agreed that docs would need to be updated and checked.
  4. 👍
  5. I'll try and come up with some quick tests for that.

For your two proposals

splitting it up so that some areas of the code could operate without it in the first instance and would remain unaffected (e.g. you could target using only tinycss parser for the to_excel output if this is where this will have the biggest impact for users)

This one seems a bit tricky to me. All of the stylers use maybe_convert_css_to_tuples to pass through the CSS before it is then consumed by the writer classes. The chain of function calls is already fairly convoluted, I'd rather not add another branching path in there somewhere.

initially making tinycss optional so that code can fall back if it is not installed.

I like this one. The only downside I see is that it's now very difficult for a user to know what to do when they have weird CSS that's not being parsed correctly. We could either try and add some smart detection in there and throw some kind of warning. Or add a hard dependency that when using the excel or latex writers you need to have tinycss2 installed. The second option seems like a fair compromise. The regular HTML users can keep using the hacky parser, but for the two output engines that are most likely to have weird CSS we require the better parser.

Whilst there is some hacky CSS in the code at present, I think we need good examples of how this PR will fundamentally enhance what the user can input and what he get out, if this is implemented. If this ends up being just code cleanup then the additional requirement to install tinycss may be an annoyance to users who simply dont use the hacky css parts anyway.

The specific case that pushed me to add this was I was trying to set a style like number-format: MM/dd hh:mm:ss;. Two different things went wrong here. First the entire string was lower cased, so the MM month became a mm minute. Second because the parser does a simple str.split(':'), the last half of the format spec got dropped. So in the end what I got was mm/dd hh which was not even close to what I asked for.

@marknsikora
Copy link
Contributor Author

Came up with a simple benchmark based on some of the existing ones. Tested two cases, a fixed style for all cells and a dynamic style based on the content of the cell.

import numpy as np
import pandas as pd


def setup_df():
    np.random.seed(1234)

    nrows = 500
    df2 = pd.DataFrame(np.random.randn(nrows, 10))
    df2[0] = pd.period_range("2000", periods=nrows)
    df2[1] = range(nrows)

    return df2


def benchmark_fixed_style(df2):
    style = df2.style.applymap(lambda v: 'value: 1;').to_html()

def benchmark_dynamic_style(df2):
    style = df2.style.applymap(lambda v: f'value: {v};').to_html()

if __name__ == '__main__':
    import timeit
    print(timeit.timeit("benchmark_fixed_style(df2)", setup="from __main__ import benchmark_fixed_style, setup_df; df2 = setup_df()"))
    print(timeit.timeit("benchmark_dynamic_style(df2)", setup="from __main__ import benchmark_dynamic_style, setup_df; df2 = setup_df()"))

These are the results. If anyone wants to reproduce I recommend lowering the number of trials, the longest trail ran for over 100h while I was on vacation. We are seeing a slow down, but it's not terrible. The biggest risk here is having things feel slow in dynamic environments like notebooks, some quick testing had them feel about the same to me.

Benchmark 1.4 Pull Diff
Fixed 186729 343179 184%
Dynamic 246967 393767 159%

@attack68
Copy link
Contributor

I've had a think on this.

I think using a css parser for this is quite sensible. It would be very nice to fix the to_excel issues that result as a parsing problem.
The way this is implemented is also not that imposing to the existing code either.

I think the new package requirement means that we need to push it to 2.0.
It probably isnt worth providing dual functionality for the short 1.x last release before 2.0.

There is discussion regarding splitting Styler into its own pandas-styler package.
Currently it is a completely separate implementation except that it is embedded into the native df.to_excel method, as you highlighted. That needs to be a consideration for this.

Essentially it means that pandas would have to rely on pandas-styler as a dependency for to_excel.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 21, 2022
@marknsikora
Copy link
Contributor Author

I am still doing some work on this, but at a lower priority in the background since this will be pushed to 2.0.

Given the eventual goal of splitting the styling into a second package. I want to to why we need the css parser for all the regular non styled code. Back in the 0.2x days when I last looked at this we passed around css values inside the excel classes to represent the styling on things like headers. But the newest code appears to use a python dict that maps onto the attribute names used by one of the excel libraries. I'm going to poke around and see if I can find if that css dependency on read_excel and write_excel are redundant. That would allow me to avoid having to touch so many test files.

@attack68
Copy link
Contributor

2.0 will be the next release (after 1.5).

But I think what you propose is a good idea. My focus has been on the separate Stytler object to produce HTML, LaTeX (and would like to do JSON also) not Excel. Any Excel work has just been patching and it might not be clean. So appreciate you having a look and coming up with ideas. Separating DataFrame.to_excel from the Styler would also be great if we were to branch a separate package.

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Styler conditional formatting using DataFrame.style
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Use a proper CSS parser for pandas Styler objects
5 participants