Skip to content

Commit 2fbabb1

Browse files
abeltavaresAbel Tavares
and
Abel Tavares
authored
DOC: Add documentation on parquet categorical data handling (#58245)
Co-authored-by: Abel Tavares <[email protected]>
1 parent e13d808 commit 2fbabb1

File tree

1 file changed

+10
-3
lines changed

1 file changed

+10
-3
lines changed

pandas/core/frame.py

+10-3
Original file line numberDiff line numberDiff line change
@@ -2877,9 +2877,16 @@ def to_parquet(
28772877
28782878
Notes
28792879
-----
2880-
This function requires either the `fastparquet
2881-
<https://pypi.org/project/fastparquet>`_ or `pyarrow
2882-
<https://arrow.apache.org/docs/python/>`_ library.
2880+
* This function requires either the `fastparquet
2881+
<https://pypi.org/project/fastparquet>`_ or `pyarrow
2882+
<https://arrow.apache.org/docs/python/>`_ library.
2883+
* When saving a DataFrame with categorical columns to parquet,
2884+
the file size may increase due to the inclusion of all possible
2885+
categories, not just those present in the data. This behavior
2886+
is expected and consistent with pandas' handling of categorical data.
2887+
To manage file size and ensure a more predictable roundtrip process,
2888+
consider using :meth:`Categorical.remove_unused_categories` on the
2889+
DataFrame before saving.
28832890
28842891
Examples
28852892
--------

0 commit comments

Comments
 (0)