DOC: Add documentation on parquet categorical data handling #58245

abeltavares · 2024-04-13T00:24:12Z

closes BUG: Parquet size grows exponential for categorical data #55776
All [code checks passed]

abeltavares · 2024-04-13T08:33:05Z

This PR is ready for review. Thanks.

mroeschke · 2024-04-15T16:55:56Z

/preview

github-actions · 2024-04-15T16:57:05Z

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/58245/

mroeschke · 2024-04-15T17:00:25Z

pandas/core/frame.py

+          categories, not just those present in the data. This behavior
+          is expected and consistent with pandas' handling of categorical data.
+        * To manage file size and ensure a more predictable roundtrip process,
+          consider using `remove_unused_categories` on the DataFrame before saving.


Suggested change

consider using `remove_unused_categories` on the DataFrame before saving.

consider using :meth:`Categorical.remove_unused_categories` on the DataFrame before saving.

mroeschke · 2024-04-15T17:00:52Z

pandas/core/frame.py

+          is expected and consistent with pandas' handling of categorical data.
+        * To manage file size and ensure a more predictable roundtrip process,


Suggested change

is expected and consistent with pandas' handling of categorical data.

* To manage file size and ensure a more predictable roundtrip process,

is expected and consistent with pandas' handling of categorical data.

To manage file size and ensure a more predictable roundtrip process,

mroeschke · 2024-04-19T00:02:00Z

Thanks @abeltavares

…ev#58245) Co-authored-by: Abel Tavares <[email protected]>

abeltavares force-pushed the BUG/Parquet-size-grows-exponential-for-categorical-data branch from 8a5fe98 to c74b597 Compare April 13, 2024 00:24

mroeschke reviewed Apr 15, 2024

View reviewed changes

mroeschke added IO CSV read_csv, to_csv IO Parquet parquet, feather Docs and removed IO CSV read_csv, to_csv labels Apr 15, 2024

DOC: Add documentation on parquet categorical data handling

d5020e1

abeltavares force-pushed the BUG/Parquet-size-grows-exponential-for-categorical-data branch from c74b597 to d5020e1 Compare April 18, 2024 19:45

abeltavares requested a review from mroeschke April 18, 2024 20:35

mroeschke approved these changes Apr 19, 2024

View reviewed changes

mroeschke added this to the 3.0 milestone Apr 19, 2024

mroeschke merged commit 2fbabb1 into pandas-dev:main Apr 19, 2024
46 checks passed

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024

DOC: Add documentation on parquet categorical data handling (pandas-d…

5f00812

…ev#58245) Co-authored-by: Abel Tavares <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Add documentation on parquet categorical data handling #58245

DOC: Add documentation on parquet categorical data handling #58245

abeltavares commented Apr 13, 2024 •

edited

Loading

abeltavares commented Apr 13, 2024

mroeschke commented Apr 15, 2024

github-actions bot commented Apr 15, 2024

mroeschke Apr 15, 2024

mroeschke Apr 15, 2024

mroeschke commented Apr 19, 2024

	consider using `remove_unused_categories` on the DataFrame before saving.
	consider using :meth:`Categorical.remove_unused_categories` on the DataFrame before saving.

		is expected and consistent with pandas' handling of categorical data.
		* To manage file size and ensure a more predictable roundtrip process,

DOC: Add documentation on parquet categorical data handling #58245

DOC: Add documentation on parquet categorical data handling #58245

Conversation

abeltavares commented Apr 13, 2024 • edited Loading

abeltavares commented Apr 13, 2024

mroeschke commented Apr 15, 2024

github-actions bot commented Apr 15, 2024

mroeschke Apr 15, 2024

Choose a reason for hiding this comment

mroeschke Apr 15, 2024

Choose a reason for hiding this comment

mroeschke commented Apr 19, 2024

abeltavares commented Apr 13, 2024 •

edited

Loading