ENH: Use a proper CSS parser for pandas Styler objects #48869

marknsikora · 2022-09-29T18:49:31Z

closes ENH: Use a proper CSS parser for pandas Styler objects #48868
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

marknsikora · 2022-09-29T18:59:25Z

This changes does have 1 sort of regression. The previous latex formatting code was using the CSS styling functions to pass around values, these values weren't properly formatted CSS entries though taking the form :value. I've updated these calls to pass through proper CSS entries even if the key name gets dropped. Old latex formatting code will not work with this update. But using correctly formatted CSS will work with both the new code and older versionss.

Old, only works in v1.5

    styler.set_table_styles(
        [
            {"selector": "toprule", "props": ":hline"},
            {"selector": "bottomrule", "props": ":otherline"},
        ]

New, works in both v1.5 and this update

    styler.set_table_styles(
        [
            {"selector": "toprule", "props": "value:hline"},
            {"selector": "bottomrule", "props": "value:otherline"},
        ]

phofl · 2022-09-29T19:33:16Z

ci/deps/actions-38-minimum_versions.yaml

@@ -49,6 +49,7 @@ dependencies:
  - scipy=1.7.1
  - sqlalchemy=1.4.16
  - tabulate=0.8.9
+  - tinycss2=1.0.0


Is this actively maintained? This looks more or less abandoned

Are you looking at tinycss rather than tinycss2? The later has commits from a few weeks ago on the bleeding edge.

I checked https://github.com/Kozea/tinycss2/commits/master and there are like 25 commits in the last 2 years

What can I say. There's not much more activity needed than that. It's a simple parser that just produces an AST from some CSS, only a few 100 lines of code if you strip out the comments and doc strings. Doesn't do any of the other fancy processing of interpreting the CSS content. That's all we need here to pull out the key-value pairs being passed in.

marknsikora · 2022-09-29T21:11:20Z

Looking into the failing CI, it appears like conda only has version 1.0.2 of the library in question. Going to try and do some local testing before pushing another commit.

mroeschke · 2022-10-03T17:05:33Z

cc @attack68

I have mixed feeling adding another optional dependency while it does also prevent rolling custom CSS parsing. If we end up spinning of Styler into another package I think it will be easier to curiate optional dependencies #47830 (comment)

attack68 · 2022-10-03T17:51:47Z

There is a lot of stuff going on here:

Requiring tinycss as a dependency to parse css to get Styler to work. There are error messages if jinja2 is not installed and users try to use Styler, we would need similar for tinycss2 and tests, if this were to go ahead.
It has been discussed a number of times and Jeff is fine with Pandas 2.0 requiring jinja2 as a dependendency and if this were adopted I suspect this would be a suitable time for this PR, rather than in 1.6.0. I see this as too big a change when 2.0 is not that far away.
Latex is hacky anyway. It was added after CSS styling and implemented to fit into the CSS framework to utilise all of the existing functionality and testing that was extensive. The bits of code that hack the CSS for Latex are not extensive and kept to a minimum. All changes to this would have to be checked for documentation thoroughly.
The user change of the Latex set_table_styles is not the end of the world. Its not that obvious anyway and probably not extensively used, but again 2.0 and a change of docs is probably a better place.
How does this parsing affect the benchmark performance with large HTML table rendering? We would need to run some of the benchmarks tests and ad hoc tests.

I have considered a PR similar to this, myself, in the past so am not averse to it, but, at first glance (and I admit I have not checked thoroughly yet) this PR in its current form seems too invasive to the code. I would consider two aspects of this PR:

splitting it up so that some areas of the code could operate without it in the first instance and would remain unaffected (e.g. you could target using only tinycss parser for the to_excel output if this is where this will have the biggest impact for users)
initially making tinycss optional so that code can fall back if it is not installed.

Whilst there is some hacky CSS in the code at present, I think we need good examples of how this PR will fundamentally enhance what the user can input and what he get out, if this is implemented. If this ends up being just code cleanup then the additional requirement to install tinycss may be an annoyance to users who simply dont use the hacky css parts anyway.

attack68 · 2022-10-03T17:57:59Z

pandas/tests/io/test_gcs.py

@@ -73,6 +73,7 @@ def test_to_read_gcs(gcs_buffer, format):
        df1.to_csv(path, index=True)
        df2 = read_csv(path, parse_dates=["dt"], index_col=0)
    elif format == "excel":
+        pytest.importorskip("tinycss2")


This test seems to be exporting a DataFrame to Excel, not a Styler. Do the changes implemented here require a user to have tinycss2 installed even if they are just exporting a DataFrame to Excel without even bothering with adding styles?

The way the excel writer is implemented there really is no such thing as a truly unstyled output. The regular to_excel function adds a bunch of borders and bold headers. Since there are multiple excel output libraries supported, these borders and other things are added as CSS in the more abstract layers of code. That CSS then gets merged with the user styles when using a Styler. But there is always some CSS being processed when writing to excel.

I think the question was meant more like: would users need to install a new dependency to use to_excel even if they don’t set user configured styles?

Sorry my answer wasn't clear, that was the question I was trying to answer.

The normal to_excel function uses CSS under the hood for styling. So yes all excel users would need this additional dependency.

tehunter · 2022-10-04T13:22:11Z

pandas/io/formats/css.py

+                if token.type == "ident":
+                    # The old css parser normalized all identifier values, do
+                    # so here to keep backwards compatibility
+                    token.value = token.lower_value


See related discussion here #48660 (comment) regarding case normalization. Seems like its worthy of further investigation of whether this is truly necessary for backwards compatibility

I'm not 100%, I looked into this back in the 0.24 code. But I believe the reason was to allow users to set a font like Arial and have it work everywhere. The HTML output expects the upper case version, but the Excel library needs an all lower case version.

marknsikora · 2022-10-04T16:12:35Z

There is a lot of stuff going on here: ...

Some good points raised.

I'm using the same optional import from the pandas util that we use for Jinja2, so we'll get the same error messages. Admittedly it is slightly annoying working with a fresh install because you only get one import error at a time.
Totally fine pushing back to 2.0. For my needs I'm using a dev build for now.
Regardless of whether this specific pull gets merged, cleaning up the LaTeX is probably a good idea. These fake :value entries that are being passed around will break any future attempt to make things more CSS compliant. Agreed that docs would need to be updated and checked.
👍
I'll try and come up with some quick tests for that.

For your two proposals

splitting it up so that some areas of the code could operate without it in the first instance and would remain unaffected (e.g. you could target using only tinycss parser for the to_excel output if this is where this will have the biggest impact for users)

This one seems a bit tricky to me. All of the stylers use maybe_convert_css_to_tuples to pass through the CSS before it is then consumed by the writer classes. The chain of function calls is already fairly convoluted, I'd rather not add another branching path in there somewhere.

initially making tinycss optional so that code can fall back if it is not installed.

I like this one. The only downside I see is that it's now very difficult for a user to know what to do when they have weird CSS that's not being parsed correctly. We could either try and add some smart detection in there and throw some kind of warning. Or add a hard dependency that when using the excel or latex writers you need to have tinycss2 installed. The second option seems like a fair compromise. The regular HTML users can keep using the hacky parser, but for the two output engines that are most likely to have weird CSS we require the better parser.

Whilst there is some hacky CSS in the code at present, I think we need good examples of how this PR will fundamentally enhance what the user can input and what he get out, if this is implemented. If this ends up being just code cleanup then the additional requirement to install tinycss may be an annoyance to users who simply dont use the hacky css parts anyway.

The specific case that pushed me to add this was I was trying to set a style like number-format: MM/dd hh:mm:ss;. Two different things went wrong here. First the entire string was lower cased, so the MM month became a mm minute. Second because the parser does a simple str.split(':'), the last half of the format spec got dropped. So in the end what I got was mm/dd hh which was not even close to what I asked for.

marknsikora · 2022-10-19T15:47:51Z

Came up with a simple benchmark based on some of the existing ones. Tested two cases, a fixed style for all cells and a dynamic style based on the content of the cell.

import numpy as np
import pandas as pd


def setup_df():
    np.random.seed(1234)

    nrows = 500
    df2 = pd.DataFrame(np.random.randn(nrows, 10))
    df2[0] = pd.period_range("2000", periods=nrows)
    df2[1] = range(nrows)

    return df2


def benchmark_fixed_style(df2):
    style = df2.style.applymap(lambda v: 'value: 1;').to_html()

def benchmark_dynamic_style(df2):
    style = df2.style.applymap(lambda v: f'value: {v};').to_html()

if __name__ == '__main__':
    import timeit
    print(timeit.timeit("benchmark_fixed_style(df2)", setup="from __main__ import benchmark_fixed_style, setup_df; df2 = setup_df()"))
    print(timeit.timeit("benchmark_dynamic_style(df2)", setup="from __main__ import benchmark_dynamic_style, setup_df; df2 = setup_df()"))

These are the results. If anyone wants to reproduce I recommend lowering the number of trials, the longest trail ran for over 100h while I was on vacation. We are seeing a slow down, but it's not terrible. The biggest risk here is having things feel slow in dynamic environments like notebooks, some quick testing had them feel about the same to me.

Benchmark	1.4	Pull	Diff
Fixed	186729	343179	184%
Dynamic	246967	393767	159%

attack68 · 2022-10-21T11:36:21Z

I've had a think on this.

I think using a css parser for this is quite sensible. It would be very nice to fix the to_excel issues that result as a parsing problem.
The way this is implemented is also not that imposing to the existing code either.

I think the new package requirement means that we need to push it to 2.0.
It probably isnt worth providing dual functionality for the short 1.x last release before 2.0.

There is discussion regarding splitting Styler into its own pandas-styler package.
Currently it is a completely separate implementation except that it is embedded into the native df.to_excel method, as you highlighted. That needs to be a consideration for this.

Essentially it means that pandas would have to rely on pandas-styler as a dependency for to_excel.

github-actions · 2022-11-21T00:06:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

marknsikora · 2022-11-21T13:25:15Z

I am still doing some work on this, but at a lower priority in the background since this will be pushed to 2.0.

Given the eventual goal of splitting the styling into a second package. I want to to why we need the css parser for all the regular non styled code. Back in the 0.2x days when I last looked at this we passed around css values inside the excel classes to represent the styling on things like headers. But the newest code appears to use a python dict that maps onto the attribute names used by one of the excel libraries. I'm going to poke around and see if I can find if that css dependency on read_excel and write_excel are redundant. That would allow me to avoid having to touch so many test files.

attack68 · 2022-11-21T13:57:53Z

2.0 will be the next release (after 1.5).

But I think what you propose is a good idea. My focus has been on the separate Stytler object to produce HTML, LaTeX (and would like to do JSON also) not Excel. Any Excel work has just been patching and it might not be clean. So appreciate you having a look and coming up with ideas. Separating DataFrame.to_excel from the Styler would also be great if we were to branch a separate package.

mroeschke · 2022-12-17T19:22:24Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Make Latex Styler use proper key:value css entries

b9b8cad

marknsikora added 2 commits September 29, 2022 15:08

Replace custom css parsing code with tinycss2

7454dc5

Add test skip for new tinycss2 dependency

e00f5a7

marknsikora force-pushed the tinycss2 branch from 6ea6d61 to e00f5a7 Compare September 29, 2022 19:08

marknsikora added 2 commits September 29, 2022 15:09

Fix doc typo

2859601

Merge branch 'main' into tinycss2

ee48a6d

phofl reviewed Sep 29, 2022

View reviewed changes

marknsikora added 2 commits September 29, 2022 17:27

Add tinycss2 to environment

f7d4832

Set tinycss2 to minimum version in conda-forge

6388662

marknsikora force-pushed the tinycss2 branch from e6bce22 to 6388662 Compare September 29, 2022 21:51

mroeschke added the Styler conditional formatting using DataFrame.style label Oct 3, 2022

attack68 reviewed Oct 3, 2022

View reviewed changes

tehunter reviewed Oct 4, 2022

View reviewed changes

github-actions bot added the Stale label Nov 21, 2022

mroeschke closed this Dec 17, 2022

attack68 mentioned this pull request Aug 31, 2024

BUG: style.map() not compatible with CSS string "url(data:..." #59623

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use a proper CSS parser for pandas Styler objects #48869

ENH: Use a proper CSS parser for pandas Styler objects #48869

marknsikora commented Sep 29, 2022

marknsikora commented Sep 29, 2022

phofl Sep 29, 2022

marknsikora Sep 29, 2022

phofl Sep 29, 2022

marknsikora Sep 29, 2022

marknsikora commented Sep 29, 2022

mroeschke commented Oct 3, 2022

attack68 commented Oct 3, 2022

attack68 Oct 3, 2022

marknsikora Oct 3, 2022

phofl Oct 3, 2022

marknsikora Oct 4, 2022

tehunter Oct 4, 2022

marknsikora Oct 4, 2022

marknsikora commented Oct 4, 2022

marknsikora commented Oct 19, 2022

attack68 commented Oct 21, 2022

github-actions bot commented Nov 21, 2022

marknsikora commented Nov 21, 2022

attack68 commented Nov 21, 2022

mroeschke commented Dec 17, 2022

ENH: Use a proper CSS parser for pandas Styler objects #48869

ENH: Use a proper CSS parser for pandas Styler objects #48869

Conversation

marknsikora commented Sep 29, 2022

marknsikora commented Sep 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marknsikora commented Sep 29, 2022

mroeschke commented Oct 3, 2022

attack68 commented Oct 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marknsikora commented Oct 4, 2022

marknsikora commented Oct 19, 2022

attack68 commented Oct 21, 2022

github-actions bot commented Nov 21, 2022

marknsikora commented Nov 21, 2022

attack68 commented Nov 21, 2022

mroeschke commented Dec 17, 2022