Skip to content

Commit 8bd477f

Browse files
committed
Add Zstandard compression support
1 parent d64df84 commit 8bd477f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+379
-311
lines changed

MANIFEST.in

+1
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ global-exclude *.xpt
3636
global-exclude *.cpt
3737
global-exclude *.xz
3838
global-exclude *.zip
39+
global-exclude *.zst
3940
global-exclude *~
4041
global-exclude .DS_Store
4142
global-exclude .git*

ci/deps/actions-38-slow.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,4 @@ dependencies:
3434
- xlsxwriter
3535
- xlwt
3636
- numba
37+
- zstandard

ci/deps/actions-39-slow.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ dependencies:
3737
- xlsxwriter
3838
- xlwt
3939
- pyreadstat
40+
- zstandard
4041
- pip
4142
- pip:
4243
- pyxlsb

ci/deps/actions-39.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ dependencies:
3636
- xlsxwriter
3737
- xlwt
3838
- pyreadstat
39+
- zstandard
3940
- pip
4041
- pip:
4142
- pyxlsb

ci/deps/azure-macos-38.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ dependencies:
3030
- xlrd
3131
- xlsxwriter
3232
- xlwt
33+
- zstandard
3334
- pip
3435
- pip:
3536
- cython>=0.29.24

ci/deps/azure-windows-38.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,4 @@ dependencies:
3232
- xlrd
3333
- xlsxwriter
3434
- xlwt
35+
- zstandard

ci/deps/azure-windows-39.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ dependencies:
3636
- xlsxwriter
3737
- xlwt
3838
- pyreadstat
39+
- zstandard
3940
- pip
4041
- pip:
4142
- pyxlsb

ci/deps/circle-38-arm64.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ dependencies:
1515
- numpy
1616
- python-dateutil
1717
- pytz
18+
- zstandard
1819
- pip
1920
- flask
2021
- pip:

doc/source/getting_started/install.rst

+10
Original file line numberDiff line numberDiff line change
@@ -402,3 +402,13 @@ qtpy Clipboard I/O
402402
xclip Clipboard I/O on linux
403403
xsel Clipboard I/O on linux
404404
========================= ================== =============================================================
405+
406+
407+
Compression
408+
^^^^^^^^^^^
409+
410+
========================= ================== =============================================================
411+
Dependency Minimum Version Notes
412+
========================= ================== =============================================================
413+
Zstandard Zstandard compression
414+
========================= ================== =============================================================

doc/source/user_guide/io.rst

+9-9
Original file line numberDiff line numberDiff line change
@@ -316,14 +316,14 @@ chunksize : int, default ``None``
316316
Quoting, compression, and file format
317317
+++++++++++++++++++++++++++++++++++++
318318

319-
compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``, ``dict``}, default ``'infer'``
319+
compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``'zstd'``, ``None``, ``dict``}, default ``'infer'``
320320
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
321-
bz2, zip, or xz if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2',
322-
'.zip', or '.xz', respectively, and no decompression otherwise. If using 'zip',
321+
bz2, zip, xz, or zstandard if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2',
322+
'.zip', '.xz', '.zst', respectively, and no decompression otherwise. If using 'zip',
323323
the ZIP file must contain only one data file to be read in.
324324
Set to ``None`` for no decompression. Can also be a dict with key ``'method'``
325-
set to one of {``'zip'``, ``'gzip'``, ``'bz2'``} and other key-value pairs are
326-
forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, or ``bz2.BZ2File``.
325+
set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other key-value pairs are
326+
forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``.
327327
As an example, the following could be passed for faster compression and to
328328
create a reproducible gzip archive:
329329
``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``.
@@ -4032,18 +4032,18 @@ Compressed pickle files
40324032
'''''''''''''''''''''''
40334033

40344034
:func:`read_pickle`, :meth:`DataFrame.to_pickle` and :meth:`Series.to_pickle` can read
4035-
and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz`` are supported for reading and writing.
4035+
and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz``, ``zstd`` are supported for reading and writing.
40364036
The ``zip`` file format only supports reading and must contain only one data file
40374037
to be read.
40384038

40394039
The compression type can be an explicit parameter or be inferred from the file extension.
4040-
If 'infer', then use ``gzip``, ``bz2``, ``zip``, or ``xz`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, or
4041-
``'.xz'``, respectively.
4040+
If 'infer', then use ``gzip``, ``bz2``, ``zip``, ``xz``, ``zstd`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``,
4041+
``'.xz'``, or ``'.zst'``, respectively.
40424042

40434043
The compression parameter can also be a ``dict`` in order to pass options to the
40444044
compression protocol. It must have a ``'method'`` key set to the name
40454045
of the compression protocol, which must be one of
4046-
{``'zip'``, ``'gzip'``, ``'bz2'``}. All other key-value pairs are passed to
4046+
{``'zip'``, ``'gzip'``, ``'bz2'``, ``'xz'``, ``'zstd'``}. All other key-value pairs are passed to
40474047
the underlying compression library.
40484048

40494049
.. ipython:: python

doc/source/whatsnew/v1.4.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,7 @@ Other enhancements
234234
- :meth:`DataFrame.take` now raises a ``TypeError`` when passed a scalar for the indexer (:issue:`42875`)
235235
- :meth:`is_list_like` now identifies duck-arrays as list-like unless ``.ndim == 0`` (:issue:`35131`)
236236
- :class:`ExtensionDtype` and :class:`ExtensionArray` are now (de)serialized when exporting a :class:`DataFrame` with :meth:`DataFrame.to_json` using ``orient='table'`` (:issue:`20612`, :issue:`44705`).
237+
- Add support for `Zstandard <http://facebook.github.io/zstd/>`_ compression to :meth:`DataFrame.to_pickle`/:meth:`read_pickle` and friends (:issue:`43925`)
237238
-
238239

239240

pandas/_testing/_io.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
ReadPickleBuffer,
1616
)
1717
from pandas.compat import get_lzma_file
18+
from pandas.compat._optional import import_optional_dependency
1819

1920
import pandas as pd
2021
from pandas._testing._random import rands
@@ -364,7 +365,7 @@ def write_to_compressed(compression, path, data, dest="test"):
364365
365366
Parameters
366367
----------
367-
compression : {'gzip', 'bz2', 'zip', 'xz'}
368+
compression : {'gzip', 'bz2', 'zip', 'xz', 'zstd'}
368369
The compression type to use.
369370
path : str
370371
The file path to write the data.
@@ -391,6 +392,8 @@ def write_to_compressed(compression, path, data, dest="test"):
391392
compress_method = gzip.GzipFile
392393
elif compression == "bz2":
393394
compress_method = bz2.BZ2File
395+
elif compression == "zstd":
396+
compress_method = import_optional_dependency("zstandard").open
394397
elif compression == "xz":
395398
compress_method = get_lzma_file()
396399
else:

pandas/_testing/contexts.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ def decompress_file(path, compression):
2929
path : str
3030
The path where the file is read from.
3131
32-
compression : {'gzip', 'bz2', 'zip', 'xz', None}
32+
compression : {'gzip', 'bz2', 'zip', 'xz', 'zstd', None}
3333
Name of the decompression to use
3434
3535
Returns

pandas/_typing.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -243,7 +243,7 @@ def closed(self) -> bool:
243243
# compression keywords and compression
244244
CompressionDict = Dict[str, Any]
245245
CompressionOptions = Optional[
246-
Union[Literal["infer", "gzip", "bz2", "zip", "xz"], CompressionDict]
246+
Union[Literal["infer", "gzip", "bz2", "zip", "xz", "zstd"], CompressionDict]
247247
]
248248

249249

pandas/compat/_optional.py

+1
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
"xlwt": "1.3.0",
3535
"xlsxwriter": "1.2.2",
3636
"numba": "0.50.1",
37+
"zstandard": "0.15.2",
3738
}
3839

3940
# A mapping from import name to package name (on PyPI) for packages where

pandas/conftest.py

+19-2
Original file line numberDiff line numberDiff line change
@@ -267,15 +267,32 @@ def other_closed(request):
267267
return request.param
268268

269269

270-
@pytest.fixture(params=[None, "gzip", "bz2", "zip", "xz"])
270+
@pytest.fixture(
271+
params=[
272+
None,
273+
"gzip",
274+
"bz2",
275+
"zip",
276+
"xz",
277+
pytest.param("zstd", marks=td.skip_if_no("zstandard")),
278+
]
279+
)
271280
def compression(request):
272281
"""
273282
Fixture for trying common compression types in compression tests.
274283
"""
275284
return request.param
276285

277286

278-
@pytest.fixture(params=["gzip", "bz2", "zip", "xz"])
287+
@pytest.fixture(
288+
params=[
289+
"gzip",
290+
"bz2",
291+
"zip",
292+
"xz",
293+
pytest.param("zstd", marks=td.skip_if_no("zstandard")),
294+
]
295+
)
279296
def compression_only(request):
280297
"""
281298
Fixture for trying common compression types in compression tests excluding

pandas/core/frame.py

+18-24
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,6 @@
133133
from pandas.core import (
134134
algorithms,
135135
common as com,
136-
generic,
137136
nanops,
138137
ops,
139138
)
@@ -155,10 +154,7 @@
155154
sanitize_array,
156155
sanitize_masked_array,
157156
)
158-
from pandas.core.generic import (
159-
NDFrame,
160-
_shared_docs,
161-
)
157+
from pandas.core.generic import NDFrame
162158
from pandas.core.indexers import check_key_length
163159
from pandas.core.indexes.api import (
164160
DatetimeIndex,
@@ -194,6 +190,7 @@
194190
)
195191
from pandas.core.reshape.melt import melt
196192
from pandas.core.series import Series
193+
from pandas.core.shared_docs import _shared_docs
197194
from pandas.core.sorting import (
198195
get_group_index,
199196
lexsort_indexer,
@@ -2482,7 +2479,10 @@ def _from_arrays(
24822479
)
24832480
return cls(mgr)
24842481

2485-
@doc(storage_options=generic._shared_docs["storage_options"])
2482+
@doc(
2483+
storage_options=_shared_docs["storage_options"],
2484+
compression_options=_shared_docs["compression_options"] % "path",
2485+
)
24862486
@deprecate_kwarg(old_arg_name="fname", new_arg_name="path")
24872487
def to_stata(
24882488
self,
@@ -2561,19 +2561,12 @@ def to_stata(
25612561
format. Only available if version is 117. Storing strings in the
25622562
StrL format can produce smaller dta files if strings have more than
25632563
8 characters and values are repeated.
2564-
compression : str or dict, default 'infer'
2565-
For on-the-fly compression of the output dta. If string, specifies
2566-
compression mode. If dict, value at key 'method' specifies
2567-
compression mode. Compression mode must be one of {{'infer', 'gzip',
2568-
'bz2', 'zip', 'xz', None}}. If compression mode is 'infer' and
2569-
`fname` is path-like, then detect compression from the following
2570-
extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
2571-
compression). If dict and compression mode is one of {{'zip',
2572-
'gzip', 'bz2'}}, or inferred as one of the above, other entries
2573-
passed as additional compression options.
2564+
{compression_options}
25742565
25752566
.. versionadded:: 1.1.0
25762567
2568+
.. versionchanged:: 1.4.0 Zstandard support.
2569+
25772570
{storage_options}
25782571
25792572
.. versionadded:: 1.2.0
@@ -2734,7 +2727,7 @@ def to_markdown(
27342727
handles.handle.write(result)
27352728
return None
27362729

2737-
@doc(storage_options=generic._shared_docs["storage_options"])
2730+
@doc(storage_options=_shared_docs["storage_options"])
27382731
@deprecate_kwarg(old_arg_name="fname", new_arg_name="path")
27392732
def to_parquet(
27402733
self,
@@ -2939,7 +2932,10 @@ def to_html(
29392932
render_links=render_links,
29402933
)
29412934

2942-
@doc(storage_options=generic._shared_docs["storage_options"])
2935+
@doc(
2936+
storage_options=_shared_docs["storage_options"],
2937+
compression_options=_shared_docs["compression_options"] % "path_or_buffer",
2938+
)
29432939
def to_xml(
29442940
self,
29452941
path_or_buffer: FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None = None,
@@ -3016,12 +3012,10 @@ def to_xml(
30163012
layout of elements and attributes from original output. This
30173013
argument requires ``lxml`` to be installed. Only XSLT 1.0
30183014
scripts and not later versions is currently supported.
3019-
compression : {{'infer', 'gzip', 'bz2', 'zip', 'xz', None}}, default 'infer'
3020-
For on-the-fly decompression of on-disk data. If 'infer', then use
3021-
gzip, bz2, zip or xz if path_or_buffer is a string ending in
3022-
'.gz', '.bz2', '.zip', or 'xz', respectively, and no decompression
3023-
otherwise. If using 'zip', the ZIP file must contain only one data
3024-
file to be read in. Set to None for no decompression.
3015+
{compression_options}
3016+
3017+
.. versionchanged:: 1.4.0 Zstandard support.
3018+
30253019
{storage_options}
30263020
30273021
Returns

0 commit comments

Comments
 (0)