Skip to content

Commit 850be36

Browse files
committed
Add Zstandard compression support
1 parent 41680b1 commit 850be36

38 files changed

+336
-257
lines changed

MANIFEST.in

+1
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ global-exclude *.xpt
3636
global-exclude *.cpt
3737
global-exclude *.xz
3838
global-exclude *.zip
39+
global-exclude *.zst
3940
global-exclude *~
4041
global-exclude .DS_Store
4142
global-exclude .git*

ci/deps/actions-38-slow.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,4 @@ dependencies:
3535
- xlsxwriter
3636
- xlwt
3737
- numba
38+
- zstandard

ci/deps/actions-39-slow.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ dependencies:
3838
- xlsxwriter
3939
- xlwt
4040
- pyreadstat
41+
- zstandard
4142
- pip
4243
- pip:
4344
- pyxlsb

ci/deps/actions-39.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ dependencies:
3737
- xlsxwriter
3838
- xlwt
3939
- pyreadstat
40+
- zstandard
4041
- pip
4142
- pip:
4243
- pyxlsb

ci/deps/azure-macos-38.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ dependencies:
3131
- xlrd
3232
- xlsxwriter
3333
- xlwt
34+
- zstandard
3435
- pip
3536
- pip:
3637
- cython>=0.29.24

ci/deps/azure-windows-38.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,4 @@ dependencies:
3333
- xlrd
3434
- xlsxwriter
3535
- xlwt
36+
- zstandard

ci/deps/azure-windows-39.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ dependencies:
3737
- xlsxwriter
3838
- xlwt
3939
- pyreadstat
40+
- zstandard
4041
- pip
4142
- pip:
4243
- pyxlsb

ci/deps/circle-38-arm64.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ dependencies:
1616
- numpy
1717
- python-dateutil
1818
- pytz
19+
- zstandard
1920
- pip
2021
- flask
2122
- pip:

doc/source/getting_started/install.rst

+10
Original file line numberDiff line numberDiff line change
@@ -402,3 +402,13 @@ qtpy Clipboard I/O
402402
xclip Clipboard I/O on linux
403403
xsel Clipboard I/O on linux
404404
========================= ================== =============================================================
405+
406+
407+
Compression
408+
^^^^^^^^^^^
409+
410+
========================= ================== =============================================================
411+
Dependency Minimum Version Notes
412+
========================= ================== =============================================================
413+
Zstandard Zstandard compression
414+
========================= ================== =============================================================

doc/source/user_guide/io.rst

+9-9
Original file line numberDiff line numberDiff line change
@@ -316,14 +316,14 @@ chunksize : int, default ``None``
316316
Quoting, compression, and file format
317317
+++++++++++++++++++++++++++++++++++++
318318

319-
compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``, ``dict``}, default ``'infer'``
319+
compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``'zstd'``, ``None``, ``dict``}, default ``'infer'``
320320
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
321-
bz2, zip, or xz if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2',
322-
'.zip', or '.xz', respectively, and no decompression otherwise. If using 'zip',
321+
bz2, zip, xz, or zstandard if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2',
322+
'.zip', '.xz', '.zst', respectively, and no decompression otherwise. If using 'zip',
323323
the ZIP file must contain only one data file to be read in.
324324
Set to ``None`` for no decompression. Can also be a dict with key ``'method'``
325-
set to one of {``'zip'``, ``'gzip'``, ``'bz2'``} and other key-value pairs are
326-
forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, or ``bz2.BZ2File``.
325+
set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other key-value pairs are
326+
forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``.
327327
As an example, the following could be passed for faster compression and to
328328
create a reproducible gzip archive:
329329
``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``.
@@ -4022,18 +4022,18 @@ Compressed pickle files
40224022
'''''''''''''''''''''''
40234023

40244024
:func:`read_pickle`, :meth:`DataFrame.to_pickle` and :meth:`Series.to_pickle` can read
4025-
and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz`` are supported for reading and writing.
4025+
and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz``, ``zstd`` are supported for reading and writing.
40264026
The ``zip`` file format only supports reading and must contain only one data file
40274027
to be read.
40284028

40294029
The compression type can be an explicit parameter or be inferred from the file extension.
4030-
If 'infer', then use ``gzip``, ``bz2``, ``zip``, or ``xz`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, or
4031-
``'.xz'``, respectively.
4030+
If 'infer', then use ``gzip``, ``bz2``, ``zip``, ``xz``, ``zstd`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``,
4031+
``'.xz'``, or ``'.zst'``, respectively.
40324032

40334033
The compression parameter can also be a ``dict`` in order to pass options to the
40344034
compression protocol. It must have a ``'method'`` key set to the name
40354035
of the compression protocol, which must be one of
4036-
{``'zip'``, ``'gzip'``, ``'bz2'``}. All other key-value pairs are passed to
4036+
{``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``}. All other key-value pairs are passed to
40374037
the underlying compression library.
40384038

40394039
.. ipython:: python

pandas/_testing/_io.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
ReadPickleBuffer,
1616
)
1717
from pandas.compat import get_lzma_file
18+
from pandas.compat._optional import import_optional_dependency
1819

1920
import pandas as pd
2021
from pandas._testing._random import rands
@@ -364,7 +365,7 @@ def write_to_compressed(compression, path, data, dest="test"):
364365
365366
Parameters
366367
----------
367-
compression : {'gzip', 'bz2', 'zip', 'xz'}
368+
compression : {'gzip', 'bz2', 'zip', 'xz', 'zstd'}
368369
The compression type to use.
369370
path : str
370371
The file path to write the data.
@@ -391,6 +392,8 @@ def write_to_compressed(compression, path, data, dest="test"):
391392
compress_method = gzip.GzipFile
392393
elif compression == "bz2":
393394
compress_method = bz2.BZ2File
395+
elif compression == "zstd":
396+
compress_method = import_optional_dependency("zstandard").open
394397
elif compression == "xz":
395398
compress_method = get_lzma_file()
396399
else:

pandas/_testing/contexts.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ def decompress_file(path, compression):
2929
path : str
3030
The path where the file is read from.
3131
32-
compression : {'gzip', 'bz2', 'zip', 'xz', None}
32+
compression : {'gzip', 'bz2', 'zip', 'xz', 'zstd', None}
3333
Name of the decompression to use
3434
3535
Returns

pandas/compat/_optional.py

+1
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
"xlwt": "1.3.0",
3535
"xlsxwriter": "1.2.2",
3636
"numba": "0.50.1",
37+
"zstandard": "0.15.2",
3738
}
3839

3940
# A mapping from import name to package name (on PyPI) for packages where

pandas/conftest.py

+19-2
Original file line numberDiff line numberDiff line change
@@ -267,15 +267,32 @@ def other_closed(request):
267267
return request.param
268268

269269

270-
@pytest.fixture(params=[None, "gzip", "bz2", "zip", "xz"])
270+
@pytest.fixture(
271+
params=[
272+
None,
273+
"gzip",
274+
"bz2",
275+
"zip",
276+
"xz",
277+
pytest.param("zstd", marks=td.skip_if_no("zstandard")),
278+
]
279+
)
271280
def compression(request):
272281
"""
273282
Fixture for trying common compression types in compression tests.
274283
"""
275284
return request.param
276285

277286

278-
@pytest.fixture(params=["gzip", "bz2", "zip", "xz"])
287+
@pytest.fixture(
288+
params=[
289+
"gzip",
290+
"bz2",
291+
"zip",
292+
"xz",
293+
pytest.param("zstd", marks=td.skip_if_no("zstandard")),
294+
]
295+
)
279296
def compression_only(request):
280297
"""
281298
Fixture for trying common compression types in compression tests excluding

pandas/core/describe.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,6 @@
3535

3636
from pandas.core.reshape.concat import concat
3737

38-
from pandas.io.formats.format import format_percentiles
39-
4038
if TYPE_CHECKING:
4139
from pandas import (
4240
DataFrame,
@@ -230,6 +228,8 @@ def describe_numeric_1d(series: Series, percentiles: Sequence[float]) -> Series:
230228
"""
231229
from pandas import Series
232230

231+
from pandas.io.formats.format import format_percentiles
232+
233233
# error: Argument 1 to "format_percentiles" has incompatible type "Sequence[float]";
234234
# expected "Union[ndarray, List[Union[int, float]], List[float], List[Union[str,
235235
# float]]]"

pandas/core/frame.py

+11-18
Original file line numberDiff line numberDiff line change
@@ -2486,7 +2486,10 @@ def _from_arrays(
24862486
)
24872487
return cls(mgr)
24882488

2489-
@doc(storage_options=generic._shared_docs["storage_options"])
2489+
@doc(
2490+
storage_options=generic._shared_docs["storage_options"],
2491+
compression_options=generic._shared_docs["compression_options"] % "path",
2492+
)
24902493
@deprecate_kwarg(old_arg_name="fname", new_arg_name="path")
24912494
def to_stata(
24922495
self,
@@ -2565,16 +2568,7 @@ def to_stata(
25652568
format. Only available if version is 117. Storing strings in the
25662569
StrL format can produce smaller dta files if strings have more than
25672570
8 characters and values are repeated.
2568-
compression : str or dict, default 'infer'
2569-
For on-the-fly compression of the output dta. If string, specifies
2570-
compression mode. If dict, value at key 'method' specifies
2571-
compression mode. Compression mode must be one of {{'infer', 'gzip',
2572-
'bz2', 'zip', 'xz', None}}. If compression mode is 'infer' and
2573-
`fname` is path-like, then detect compression from the following
2574-
extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
2575-
compression). If dict and compression mode is one of {{'zip',
2576-
'gzip', 'bz2'}}, or inferred as one of the above, other entries
2577-
passed as additional compression options.
2571+
{compression_options}
25782572
25792573
.. versionadded:: 1.1.0
25802574
@@ -2943,7 +2937,11 @@ def to_html(
29432937
render_links=render_links,
29442938
)
29452939

2946-
@doc(storage_options=generic._shared_docs["storage_options"])
2940+
@doc(
2941+
storage_options=generic._shared_docs["storage_options"],
2942+
compression_options=generic._shared_docs["compression_options"]
2943+
% "path_or_buffer",
2944+
)
29472945
def to_xml(
29482946
self,
29492947
path_or_buffer: FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None = None,
@@ -3020,12 +3018,7 @@ def to_xml(
30203018
layout of elements and attributes from original output. This
30213019
argument requires ``lxml`` to be installed. Only XSLT 1.0
30223020
scripts and not later versions is currently supported.
3023-
compression : {{'infer', 'gzip', 'bz2', 'zip', 'xz', None}}, default 'infer'
3024-
For on-the-fly decompression of on-disk data. If 'infer', then use
3025-
gzip, bz2, zip or xz if path_or_buffer is a string ending in
3026-
'.gz', '.bz2', '.zip', or 'xz', respectively, and no decompression
3027-
otherwise. If using 'zip', the ZIP file must contain only one data
3028-
file to be read in. Set to None for no decompression.
3021+
{compression_options}
30293022
{storage_options}
30303023
30313024
Returns

pandas/core/generic.py

+11-12
Original file line numberDiff line numberDiff line change
@@ -2406,7 +2406,7 @@ def to_json(
24062406
throw ValueError if incorrect 'orient' since others are not
24072407
list-like.
24082408
2409-
compression : {{'infer', 'gzip', 'bz2', 'zip', 'xz', None}}
2409+
compression : {{'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', None}}
24102410
24112411
A string representing the compression to use in the output file,
24122412
only used when the first argument is a filename. By default, the
@@ -2933,16 +2933,16 @@ def to_pickle(
29332933
----------
29342934
path : str
29352935
File path where the pickled object will be stored.
2936-
compression : {{'infer', 'gzip', 'bz2', 'zip', 'xz', None}}, \
2936+
compression : {{'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', None}}, \
29372937
default 'infer'
29382938
A string representing the compression to use in the output file. By
29392939
default, infers from the file extension in specified path.
29402940
Compression mode may be any of the following possible
2941-
values: {{infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}}. If compression
2942-
mode is infer and path_or_buf is path-like, then detect
2941+
values: {{'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', None}}.
2942+
If compression mode is 'infer' and path_or_buf is path-like, then detect
29432943
compression mode from the following extensions:
2944-
.gz’, ‘.bz2’, ‘.zip’ or ‘.xz. (otherwise no compression).
2945-
If dict given and mode is zip or inferred as zip, other entries
2944+
'.gz', '.bz2', '.zip', '.xz', '.zst'. (otherwise no compression).
2945+
If dict given and mode is 'zip' or inferred as 'zip', other entries
29462946
passed as additional compression options.
29472947
protocol : int
29482948
Int which indicates which protocol should be used by the pickler,
@@ -3406,11 +3406,11 @@ def to_csv(
34063406
compression : str or dict, default 'infer'
34073407
If str, represents compression mode. If dict, value at 'method' is
34083408
the compression mode. Compression mode may be any of the following
3409-
possible values: {{'infer', 'gzip', 'bz2', 'zip', 'xz', None}}. If
3410-
compression mode is 'infer' and `path_or_buf` is path-like, then
3409+
possible values: {{'infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd', None}}.
3410+
If compression mode is 'infer' and `path_or_buf` is path-like, then
34113411
detect compression mode from the following extensions: '.gz',
3412-
'.bz2', '.zip' or '.xz'. (otherwise no compression). If dict given
3413-
and mode is one of {{'zip', 'gzip', 'bz2'}}, or inferred as
3412+
'.bz2', '.zip', '.xz', '.zst'. (otherwise no compression). If dict given
3413+
and mode is one of {{'zip', 'gzip', 'bz2', 'zstd'}}, or inferred as
34143414
one of the above, other entries passed as
34153415
additional compression options.
34163416
If `path_or_buf` is omitted or `None` or is a file opened in text
@@ -3426,8 +3426,7 @@ def to_csv(
34263426
.. versionchanged:: 1.1.0
34273427
34283428
Passing compression options as keys in dict is
3429-
supported for compression modes 'gzip' and 'bz2'
3430-
as well as 'zip'.
3429+
supported for compression modes 'gzip', 'bz2', 'zstd', and 'zip'.
34313430
34323431
.. versionchanged:: 1.2.0
34333432

pandas/core/shared_docs.py

+29
Original file line numberDiff line numberDiff line change
@@ -402,6 +402,35 @@
402402
starting with "s3://", and "gcs://") the key-value pairs are forwarded to
403403
``fsspec``. Please see ``fsspec`` and ``urllib`` for more details."""
404404

405+
_shared_docs[
406+
"compression_options"
407+
] = """compression : str or dict, default 'infer'
408+
For on-the-fly compression of the output data. If 'infer' and '%s'
409+
path-like, then detect compression from the following extensions: '.gz',
410+
'.bz2', '.zip', '.xz', or '.zst' (otherwise no compression). Set to
411+
``None`` for no compression. Can also be a dict with key ``'method'`` set
412+
to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other
413+
key-value pairs are forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``,
414+
``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``, respectively. As an
415+
example, the following could be passed for faster compression and to create
416+
a reproducible gzip archive: ``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``.
417+
"""
418+
419+
_shared_docs[
420+
"decompression_options"
421+
] = """compression : str or dict, default 'infer'
422+
For on-the-fly decompression of on-disk data. If 'infer' and '%s' is
423+
path-like, then detect compression from the following extensions: '.gz',
424+
'.bz2', '.zip', '.xz', or '.zst' (otherwise no compression). If using
425+
'zip', the ZIP file must contain only one data file to be read in. Set to
426+
``None`` for no decompression. Can also be a dict with key ``'method'`` set
427+
to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other
428+
key-value pairs are forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``,
429+
``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``, respectively. As an
430+
example, the following could be passed for Zstandard decompression using a
431+
custom compression dictionary: ``compression={'method': 'zstd', 'dict_data': my_compression_dict}``.
432+
"""
433+
405434
_shared_docs[
406435
"replace"
407436
] = """

0 commit comments

Comments
 (0)