-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Update index parameter in pandas to_parquet #28217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Update index parameter in pandas to_parquet #28217
Conversation
Thanks for working on this @galuhsahid, looks great. There are couple of tests that need to be updated (see the CI). Let me know if you need help with it. @wesm does this change make sense? I think by default we're saving the index with parquet with both engines. But we've got this |
@datapythonista Okay so because of the failing tests I tried I'm not really sure if this difference has any impact to our proposal, should we take this into consideration? Otherwise I can update the test to expect
Output:
Output:
|
Thanks @galuhsahid for doing the research. I see that there is a difference between With a default index: >>> import pandas
>>> pandas.DataFrame({'foo': [1, 2]}).to_parquet('foo.parquet', engine='fastparquet', index=None)
>>> print(pandas.read_parquet('foo.parquet').index)
RangeIndex(start=0, stop=2, step=1)
>>> pandas.DataFrame({'foo': [1, 2]}).to_parquet('foo.parquet', engine='fastparquet', index=True)
>>> print(pandas.read_parquet('foo.parquet').index)
Int64Index([0, 1], dtype='int64', name='index') With a set index: >>> import pandas
>>> pandas.DataFrame({'foo': [1, 2]}, index=['a', 'b']).to_parquet('foo.parquet', engine='fastparquet', index=None)
>>> print(pandas.read_parquet('foo.parquet').index)
Index(['a', 'b'], dtype='object', name='index')
>>> pandas.DataFrame({'foo': [1, 2]}, index=['a', 'b']).to_parquet('foo.parquet', engine='fastparquet', index=True)
>>> print(pandas.read_parquet('foo.parquet').index)
Index(['a', 'b'], dtype='object', name='index') This looks like a bug to me, I don't see why we would like that behavior. @martindurant any thoughts on this. Do you think it makes sense to just have |
Can @jorisvandenbossche have a closer look at this before merging? I'm concerned this may cause a performance regression. We made changes in pyarrow to use |
Yes, @wesm is right. In the latest pyarrow release, |
Thanks for the info @jorisvandenbossche, should we just clarify in the documentation what |
Yes, that is a good idea. From their docs, it seems that fastparquet uses the same logic of None vs True/False. |
@galuhsahid do you want to update this PR with what discussed above? Restore the default parameter, and explain in the docstring what |
@datapythonista Sure thanks for notifying! Will push an update later today. |
Hello @galuhsahid! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-09-15 15:33:29 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the the update!
pandas/core/frame.py
Outdated
@@ -2148,10 +2148,10 @@ def to_parquet( | |||
'pyarrow' is unavailable. | |||
compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy' | |||
Name of the compression to use. Use ``None`` for no compression. | |||
index : bool, default None | |||
index : bool, default True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is still None, so need to change this back I think
pandas/core/frame.py
Outdated
If ``False``, they will not be written to the file. If ``None``, | ||
the behavior depends on the chosen engine. | ||
If ``False``, they will not be written to the file. | ||
If ``None``, RangeIndex will be stored as metadata-only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention that other indexes are included as columns in the output in this case?
pandas/core/frame.py
Outdated
If ``False``, they will not be written to the file. If ``None``, | ||
the behavior depends on the chosen engine. | ||
If ``False``, they will not be written to the file. | ||
If ``None``, RangeIndex will be stored as metadata-only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @galuhsahid for the update.
Can you change the default back to None
in the docstrings too?
And for the None
description, looks good, but I'm wondering if it'd be clearer to explain that None
is like True
and the index is stored, but that a RangeIndex (the default in pandas) is not stored as the values like True
, but as a range in the metadata, so it doesn't require so much space, and it's faster.
The same you're saying, but a bit more extended, so people without much knowledge about parquet or pandas indices can still understand whether it's a good choice for them.
Btw, minor thing. Every time you push, people who participated in the conversation of the PR receives a notification. I think it's a good practice to commit as many times as you think it's useful, but when it doesn't make a difference, it may be good to just push when your local changes are ready for another review. Not a big deal if pushing more often is useful to you, and you surely want to push too if you stop working on this for the day and want to back up your changes in your remote branch... But wanted to let you know, in case it's useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think that's a good idea - I've expanded the explanation, would appreciate any input :) also thanks for the tip, I'll be more mindful when pushing my commits!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @galuhsahid, lgtm.
@jorisvandenbossche do you want to have a look?
Thanks a lot @galuhsahid ! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The documentation (https://dev.pandas.io/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet) says that for
DataFrame.to_parquet
when the value for the index parameter isNone
, the behavior will depend on the engine. However if we try this both engines,pyarrow
andfastparquet
, will keep the index. This PR proposes to set the default value of the index parameter asindex=True
instead ofindex=None
since both engines keep the index anyway. Would appreciate any discussion/feedback :)Sample code to demonstrate how both engines treat the index:
Output:
Output:
Output: