-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Add documentation on parquet categorical data handling #58245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Add documentation on parquet categorical data handling #58245
Conversation
abeltavares
commented
Apr 13, 2024
•
edited
Loading
edited
- closes BUG: Parquet size grows exponential for categorical data #55776
- All [code checks passed]
8a5fe98
to
c74b597
Compare
This PR is ready for review. Thanks. |
/preview |
Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/58245/ |
pandas/core/frame.py
Outdated
categories, not just those present in the data. This behavior | ||
is expected and consistent with pandas' handling of categorical data. | ||
* To manage file size and ensure a more predictable roundtrip process, | ||
consider using `remove_unused_categories` on the DataFrame before saving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using `remove_unused_categories` on the DataFrame before saving. | |
consider using :meth:`Categorical.remove_unused_categories` on the DataFrame before saving. |
pandas/core/frame.py
Outdated
is expected and consistent with pandas' handling of categorical data. | ||
* To manage file size and ensure a more predictable roundtrip process, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is expected and consistent with pandas' handling of categorical data. | |
* To manage file size and ensure a more predictable roundtrip process, | |
is expected and consistent with pandas' handling of categorical data. | |
To manage file size and ensure a more predictable roundtrip process, |
c74b597
to
d5020e1
Compare
Thanks @abeltavares |
…ev#58245) Co-authored-by: Abel Tavares <[email protected]>