Skip to content

DOC: Update index parameter in pandas to_parquet #28217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 16, 2019

Conversation

galuhsahid
Copy link
Contributor

The documentation (https://dev.pandas.io/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet) says that for DataFrame.to_parquet when the value for the index parameter is None, the behavior will depend on the engine. However if we try this both engines, pyarrow and fastparquet, will keep the index. This PR proposes to set the default value of the index parameter as index=True instead of index=None since both engines keep the index anyway. Would appreciate any discussion/feedback :)

Sample code to demonstrate how both engines treat the index:

import pandas as pd 
  
data = {'name':['a', 'b', 'c', 'd']} 
df = pd.DataFrame(data, index =['rank_1', 'rank_2', 'rank_3', 'rank_4'])

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, rank_1 to rank_4
Data columns (total 1 columns):
name    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
# pyarrow
df.to_parquet("test_pyarrow.parquet", engine="pyarrow")
df_read_pyarrow = pd.read_parquet("test_pyarrow.parquet", engine="pyarrow")
df_read_pyarrow.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, rank_1 to rank_4
Data columns (total 1 columns):
name    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
# fastparquet
df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet = pd.read_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, rank_1 to rank_4
Data columns (total 1 columns):
name    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes

@datapythonista datapythonista added IO Parquet parquet, feather Clean labels Aug 29, 2019
@datapythonista
Copy link
Member

Thanks for working on this @galuhsahid, looks great. There are couple of tests that need to be updated (see the CI). Let me know if you need help with it.

@wesm does this change make sense? I think by default we're saving the index with parquet with both engines. But we've got this index=None that the documentation doesn't clarify what it does, and personally I think it doesn't add value, but make things confusing (may be I'm missing something).

@galuhsahid
Copy link
Contributor Author

galuhsahid commented Aug 29, 2019

@datapythonista Okay so because of the failing tests I tried to_parquet without defining an index, and I observed a different behavior when I used the fastparquet engine for index=True and index=None. If I use index=True it will return Int64Index, and if I use index=None it will return RangeIndex.

I'm not really sure if this difference has any impact to our proposal, should we take this into consideration? Otherwise I can update the test to expect Int64Index instead of RangeIndex.

import pandas as pd

data = {'name':['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet = pd.read_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet.index

Output:

RangeIndex(start=0, stop=4, step=1)
import pandas as pd

data = {'name':['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet = pd.read_parquet("test_fastparquet.parquet", engine="fastparquet", index=True)
df_read_fastparquet.index

Output:

Int64Index([0, 1, 2, 3], dtype='int64', name='index')

@datapythonista
Copy link
Member

Thanks @galuhsahid for doing the research. I see that there is a difference between index=None and index=True with fastparquet, but only with the default RangeIndex.

With a default index:

>>> import pandas
>>> pandas.DataFrame({'foo': [1, 2]}).to_parquet('foo.parquet', engine='fastparquet', index=None)
>>> print(pandas.read_parquet('foo.parquet').index)
RangeIndex(start=0, stop=2, step=1)
>>> pandas.DataFrame({'foo': [1, 2]}).to_parquet('foo.parquet', engine='fastparquet', index=True)
>>> print(pandas.read_parquet('foo.parquet').index)
Int64Index([0, 1], dtype='int64', name='index')

With a set index:

>>> import pandas
>>> pandas.DataFrame({'foo': [1, 2]}, index=['a', 'b']).to_parquet('foo.parquet', engine='fastparquet', index=None)
>>> print(pandas.read_parquet('foo.parquet').index)
Index(['a', 'b'], dtype='object', name='index')
>>> pandas.DataFrame({'foo': [1, 2]}, index=['a', 'b']).to_parquet('foo.parquet', engine='fastparquet', index=True)
>>> print(pandas.read_parquet('foo.parquet').index)
Index(['a', 'b'], dtype='object', name='index')

This looks like a bug to me, I don't see why we would like that behavior. @martindurant any thoughts on this. Do you think it makes sense to just have index=True and index=False and avoid this difference when index=None?

@wesm
Copy link
Member

wesm commented Aug 31, 2019

Can @jorisvandenbossche have a closer look at this before merging? I'm concerned this may cause a performance regression.

We made changes in pyarrow to use preserve_index=True force serialization of the index column as data in the Parquet file, even for RangeIndex. If preserve_index=None then metadata-only RangeIndex is stored.

@jorisvandenbossche
Copy link
Member

Yes, @wesm is right. In the latest pyarrow release, preserve_index=None (the default) or preserve_index=True mean different things (the main difference is for RangeIndex being written as a column or just as metadata). So if pandas would change the default from None to True, that means it would override the default of pyarrow (leading again to writing the RangeIndex as a full column).

@datapythonista
Copy link
Member

Thanks for the info @jorisvandenbossche, should we just clarify in the documentation what None means then?

@jorisvandenbossche
Copy link
Member

Yes, that is a good idea. From their docs, it seems that fastparquet uses the same logic of None vs True/False.

@datapythonista
Copy link
Member

@galuhsahid do you want to update this PR with what discussed above? Restore the default parameter, and explain in the docstring what None means.

@galuhsahid
Copy link
Contributor Author

@datapythonista Sure thanks for notifying! Will push an update later today.

@pep8speaks
Copy link

pep8speaks commented Sep 13, 2019

Hello @galuhsahid! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-15 15:33:29 UTC

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the the update!

@@ -2148,10 +2148,10 @@ def to_parquet(
'pyarrow' is unavailable.
compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
Name of the compression to use. Use ``None`` for no compression.
index : bool, default None
index : bool, default True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is still None, so need to change this back I think

If ``False``, they will not be written to the file. If ``None``,
the behavior depends on the chosen engine.
If ``False``, they will not be written to the file.
If ``None``, RangeIndex will be stored as metadata-only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention that other indexes are included as columns in the output in this case?

If ``False``, they will not be written to the file. If ``None``,
the behavior depends on the chosen engine.
If ``False``, they will not be written to the file.
If ``None``, RangeIndex will be stored as metadata-only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @galuhsahid for the update.

Can you change the default back to None in the docstrings too?

And for the None description, looks good, but I'm wondering if it'd be clearer to explain that None is like True and the index is stored, but that a RangeIndex (the default in pandas) is not stored as the values like True, but as a range in the metadata, so it doesn't require so much space, and it's faster.

The same you're saying, but a bit more extended, so people without much knowledge about parquet or pandas indices can still understand whether it's a good choice for them.

Btw, minor thing. Every time you push, people who participated in the conversation of the PR receives a notification. I think it's a good practice to commit as many times as you think it's useful, but when it doesn't make a difference, it may be good to just push when your local changes are ready for another review. Not a big deal if pushing more often is useful to you, and you surely want to push too if you stop working on this for the day and want to back up your changes in your remote branch... But wanted to let you know, in case it's useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that's a good idea - I've expanded the explanation, would appreciate any input :) also thanks for the tip, I'll be more mindful when pushing my commits!

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @galuhsahid, lgtm.

@jorisvandenbossche do you want to have a look?

@jorisvandenbossche jorisvandenbossche changed the title CLN: Update index parameter in pandas to_parquet DOC: Update index parameter in pandas to_parquet Sep 16, 2019
@jorisvandenbossche jorisvandenbossche merged commit 9ef67b1 into pandas-dev:master Sep 16, 2019
@jorisvandenbossche
Copy link
Member

Thanks a lot @galuhsahid !

@galuhsahid galuhsahid deleted the parquet-index branch September 16, 2019 12:45
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants