Skip to content

DOC: Add documentation on parquet categorical data handling #58245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

abeltavares
Copy link
Contributor

@abeltavares abeltavares commented Apr 13, 2024

@abeltavares abeltavares force-pushed the BUG/Parquet-size-grows-exponential-for-categorical-data branch from 8a5fe98 to c74b597 Compare April 13, 2024 00:24
@abeltavares
Copy link
Contributor Author

This PR is ready for review. Thanks.

@mroeschke
Copy link
Member

/preview

Copy link
Contributor

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/58245/

categories, not just those present in the data. This behavior
is expected and consistent with pandas' handling of categorical data.
* To manage file size and ensure a more predictable roundtrip process,
consider using `remove_unused_categories` on the DataFrame before saving.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
consider using `remove_unused_categories` on the DataFrame before saving.
consider using :meth:`Categorical.remove_unused_categories` on the DataFrame before saving.

Comment on lines 2886 to 2887
is expected and consistent with pandas' handling of categorical data.
* To manage file size and ensure a more predictable roundtrip process,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is expected and consistent with pandas' handling of categorical data.
* To manage file size and ensure a more predictable roundtrip process,
is expected and consistent with pandas' handling of categorical data.
To manage file size and ensure a more predictable roundtrip process,

@mroeschke mroeschke added IO CSV read_csv, to_csv IO Parquet parquet, feather Docs and removed IO CSV read_csv, to_csv labels Apr 15, 2024
@abeltavares abeltavares force-pushed the BUG/Parquet-size-grows-exponential-for-categorical-data branch from c74b597 to d5020e1 Compare April 18, 2024 19:45
@abeltavares abeltavares requested a review from mroeschke April 18, 2024 20:35
@mroeschke mroeschke added this to the 3.0 milestone Apr 19, 2024
@mroeschke mroeschke merged commit 2fbabb1 into pandas-dev:main Apr 19, 2024
46 checks passed
@mroeschke
Copy link
Member

Thanks @abeltavares

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Parquet size grows exponential for categorical data
2 participants