Skip to content

Add Zstandard compression support #43925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 22, 2021
Merged

Conversation

jonashaag
Copy link
Contributor

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

Haven't found any discussion about Zstandard so far but I think it would be a valuable addition to the Pickle compression options.

@pep8speaks
Copy link

pep8speaks commented Oct 8, 2021

Hello @jonashaag! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-22 09:44:38 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably ok, but

  • need to add this to some of the builds
  • need to update the doc strings for virtually all methods as the routines you changes are for evertyhing (e.g. it should work, unless of course those formats don't support it, e.g. hdf5 likely doesn't).

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label Oct 10, 2021
@jonashaag
Copy link
Contributor Author

jonashaag commented Oct 11, 2021

Thanks for the review!

  • Added Zstandard also to CSV writing and other places and updated all the docstrings
  • Refactored some tests to use _compression_to_extension dictionary
  • Fixed binary mode detection in io.common for Zstandard.

@jonashaag jonashaag changed the title Add Zstandard support to read_pickle/to_pickle Add Zstandard compression support Oct 12, 2021
@jonashaag jonashaag requested a review from jreback October 12, 2021 16:44
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way you are importing / skipping is not inline with other testing.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2021

cc @twoertwein if you can have a look

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok if you can merge master and address the existing comments

Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks good from my side!

I think the only pending parts are 1) whatsnew entry for 1.4, 2) pytest.importorskip, 3) not testing zstd on all CI configurations, and 4) rebasing

@jonashaag
Copy link
Contributor Author

Will push an update soon

@jonashaag
Copy link
Contributor Author

I've consolidated all compression docstrings into two shared fragments

@jonashaag jonashaag force-pushed the pickle-zstd branch 2 times, most recently from 850be36 to 532dd8f Compare December 8, 2021 14:36
@jonashaag jonashaag force-pushed the pickle-zstd branch 2 times, most recently from 77c9022 to 760ba0a Compare December 8, 2021 14:43
@jonashaag
Copy link
Contributor Author

jonashaag commented Dec 15, 2021

If this builds OK it should be ready for re-review

@jonashaag
Copy link
Contributor Author

Test failures should go away when this PR is merged (missing .zst URL in this GitHub repo)

@twoertwein
Copy link
Member

You will need to add "zstd" here

Union[Literal["infer", "gzip", "bz2", "zip", "xz"], CompressionDict]
to make mypy happy.

@twoertwein
Copy link
Member

The only obvious part missing to me is a whatsnew entry.

@jonashaag jonashaag requested a review from jreback December 16, 2021 12:37
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good! need to change the shared docs from generic to shared_docs and the circular import will go away.

some small other comments. pls merge master and ping on green.

@@ -1136,18 +1130,20 @@ def _is_binary_mode(handle: FilePath | BaseBuffer, mode: str) -> bool:
)


def _get_binary_io_classes() -> tuple[type, ...]:
def _get_binary_io_classes(_cache: list[type] = []) -> tuple[type, ...]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

@functools.cache
def _get_binary_io_classes() -> tuple[type, ...]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.9+ only

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functools.lru_cache should work as well, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure if you prefer that over the custom memoization I'm happy to change it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one "least recently used" input argument (might be a bit too much to use lru_cache) but it keeps the code slightly simpler. Will defer to @jreback

(".xz", "xz"),
(".XZ", "xz"),
pytest.param((".zst", "zstd"), marks=td.skip_if_no("zstandard")),
pytest.param((".ZST", "zstd"), marks=td.skip_if_no("zstandard")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compression tests take too much time: I would expect that we need only one upper-case test, we have one .lower() call that affects all compression formats. @jreback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small benchmark: io/test_common.py + io/test_pickle.py with full list: 4.25 s, with small list: 4.2 s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick benchmark! Based on that we can keep all the upper cases. Compression is also used in a lot of CSV tests.

@jonashaag
Copy link
Contributor Author

@jreback ping, should be good to go now. Test failures are because one of the test only works when this has been merged to master.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there still seems to be a few uses of generic._share_docs, can you change these to directly use the shared_docs itself (we must alias this but its not great)

@jreback jreback added this to the 1.4 milestone Dec 19, 2021
@jonashaag
Copy link
Contributor Author

Something with CI is really broken, all the PRs are red

@jonashaag jonashaag force-pushed the pickle-zstd branch 2 times, most recently from f331616 to 8bd477f Compare December 21, 2021 20:13
@jonashaag
Copy link
Contributor Author

@jreback should be good to go

@jreback
Copy link
Contributor

jreback commented Dec 22, 2021

can you merge master once more, been having a number of CI issues

@jonashaag
Copy link
Contributor Author

@jreback looks like those are only failing because those files are not yet on master

@jreback
Copy link
Contributor

jreback commented Dec 22, 2021

failing because those files are not yet on master

what files?

@jonashaag
Copy link
Contributor Author

jonashaag commented Dec 22, 2021

what files?

2021-12-22T10:21:51.1829969Z _______________ test_compressed_urls[python-explicit-zstd-.zst] _______________
...
2021-12-22T10:21:51.1885656Z E       urllib.error.HTTPError: HTTP Error 404: Not Found
2021-12-22T10:21:51.1885903Z 
2021-12-22T10:21:51.1886527Z C:\Miniconda\envs\pandas-dev\lib\urllib\request.py:649: HTTPError
2021-12-22T10:21:51.1887052Z ________________ test_compressed_urls[python-infer-zstd-.zst] _________________

https://github.com/pandas-dev/pandas/tree/master/pandas/tests/io/parser/data

base_url = (
"https://github.com/pandas-dev/pandas/raw/master/"
"pandas/tests/io/parser/data/salaries.csv"
)
url = base_url + extension

Hardcoded URL access to master branch

@jreback jreback merged commit e50d077 into pandas-dev:master Dec 22, 2021
@jreback
Copy link
Contributor

jreback commented Dec 22, 2021

thanks @jonashaag very nice!

@mroeschke
Copy link
Member

Looks like master is failing some builds are missing zstandard (or not importskiped)?

https://github.com/pandas-dev/pandas/runs/4612486852?check_suite_focus=true#step:10:255

@phofl
Copy link
Member

phofl commented Dec 23, 2021

Yep

FAILED pandas/tests/io/parser/test_network.py::test_compressed_urls[python-explicit-zstd-.zst]
FAILED pandas/tests/io/parser/test_network.py::test_compressed_urls[python-infer-zstd-.zst]
FAILED pandas/tests/io/parser/test_network.py::test_compressed_urls[c-explicit-zstd-.zst]
FAILED pandas/tests/io/parser/test_network.py::test_compressed_urls[c-infer-zstd-.zst]

@phofl phofl mentioned this pull request Dec 23, 2021
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants