DOC: Add documentation on parquet categorical data handling (#58245)

abeltavares · Abel Tavares · web-flow · commit 2fbabb11db42 · 2024-04-18T17:01:54.000-07:00
Co-authored-by: Abel Tavares &lt;abel.tavares@ctw.bmwgroup.com&gt;
diff --git a/pandas/core/frame.py b/pandas/core/frame.py
@@ -2877,9 +2877,16 @@ def to_parquet(
 
         Notes
         -----
-        This function requires either the `fastparquet
-        <https://pypi.org/project/fastparquet>`_ or `pyarrow
-        <https://arrow.apache.org/docs/python/>`_ library.
+        * This function requires either the `fastparquet
+          <https://pypi.org/project/fastparquet>`_ or `pyarrow
+          <https://arrow.apache.org/docs/python/>`_ library.
+        * When saving a DataFrame with categorical columns to parquet,
+          the file size may increase due to the inclusion of all possible
+          categories, not just those present in the data. This behavior
+          is expected and consistent with pandas' handling of categorical data.
+          To manage file size and ensure a more predictable roundtrip process,
+          consider using :meth:`Categorical.remove_unused_categories` on the
+          DataFrame before saving.
 
         Examples
         --------