DOC: Update index parameter in pandas to_parquet #28217

galuhsahid · 2019-08-29T12:52:00Z

closes #xxxx (reference: Update index parameter in pandas to_parquet python-sprints/pandas-mentoring#156)
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

The documentation (https://dev.pandas.io/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet) says that for DataFrame.to_parquet when the value for the index parameter is None, the behavior will depend on the engine. However if we try this both engines, pyarrow and fastparquet, will keep the index. This PR proposes to set the default value of the index parameter as index=True instead of index=None since both engines keep the index anyway. Would appreciate any discussion/feedback :)

Sample code to demonstrate how both engines treat the index:

import pandas as pd 
  
data = {'name':['a', 'b', 'c', 'd']} 
df = pd.DataFrame(data, index =['rank_1', 'rank_2', 'rank_3', 'rank_4'])

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, rank_1 to rank_4
Data columns (total 1 columns):
name    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes

# pyarrow
df.to_parquet("test_pyarrow.parquet", engine="pyarrow")
df_read_pyarrow = pd.read_parquet("test_pyarrow.parquet", engine="pyarrow")
df_read_pyarrow.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, rank_1 to rank_4
Data columns (total 1 columns):
name    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes

# fastparquet
df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet = pd.read_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, rank_1 to rank_4
Data columns (total 1 columns):
name    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes

datapythonista · 2019-08-29T13:57:32Z

Thanks for working on this @galuhsahid, looks great. There are couple of tests that need to be updated (see the CI). Let me know if you need help with it.

@wesm does this change make sense? I think by default we're saving the index with parquet with both engines. But we've got this index=None that the documentation doesn't clarify what it does, and personally I think it doesn't add value, but make things confusing (may be I'm missing something).

galuhsahid · 2019-08-29T17:06:24Z

@datapythonista Okay so because of the failing tests I tried to_parquet without defining an index, and I observed a different behavior when I used the fastparquet engine for index=True and index=None. If I use index=True it will return Int64Index, and if I use index=None it will return RangeIndex.

I'm not really sure if this difference has any impact to our proposal, should we take this into consideration? Otherwise I can update the test to expect Int64Index instead of RangeIndex.

import pandas as pd

data = {'name':['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet = pd.read_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet.index

Output:

RangeIndex(start=0, stop=4, step=1)

import pandas as pd

data = {'name':['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
df.to_parquet("test_fastparquet.parquet", engine="fastparquet")
df_read_fastparquet = pd.read_parquet("test_fastparquet.parquet", engine="fastparquet", index=True)
df_read_fastparquet.index

Output:

Int64Index([0, 1, 2, 3], dtype='int64', name='index')

datapythonista · 2019-08-31T00:06:26Z

Thanks @galuhsahid for doing the research. I see that there is a difference between index=None and index=True with fastparquet, but only with the default RangeIndex.

With a default index:

>>> import pandas
>>> pandas.DataFrame({'foo': [1, 2]}).to_parquet('foo.parquet', engine='fastparquet', index=None)
>>> print(pandas.read_parquet('foo.parquet').index)
RangeIndex(start=0, stop=2, step=1)
>>> pandas.DataFrame({'foo': [1, 2]}).to_parquet('foo.parquet', engine='fastparquet', index=True)
>>> print(pandas.read_parquet('foo.parquet').index)
Int64Index([0, 1], dtype='int64', name='index')

With a set index:

>>> import pandas
>>> pandas.DataFrame({'foo': [1, 2]}, index=['a', 'b']).to_parquet('foo.parquet', engine='fastparquet', index=None)
>>> print(pandas.read_parquet('foo.parquet').index)
Index(['a', 'b'], dtype='object', name='index')
>>> pandas.DataFrame({'foo': [1, 2]}, index=['a', 'b']).to_parquet('foo.parquet', engine='fastparquet', index=True)
>>> print(pandas.read_parquet('foo.parquet').index)
Index(['a', 'b'], dtype='object', name='index')

This looks like a bug to me, I don't see why we would like that behavior. @martindurant any thoughts on this. Do you think it makes sense to just have index=True and index=False and avoid this difference when index=None?

wesm · 2019-08-31T18:54:20Z

Can @jorisvandenbossche have a closer look at this before merging? I'm concerned this may cause a performance regression.

We made changes in pyarrow to use preserve_index=True force serialization of the index column as data in the Parquet file, even for RangeIndex. If preserve_index=None then metadata-only RangeIndex is stored.

jorisvandenbossche · 2019-09-11T08:35:39Z

Yes, @wesm is right. In the latest pyarrow release, preserve_index=None (the default) or preserve_index=True mean different things (the main difference is for RangeIndex being written as a column or just as metadata). So if pandas would change the default from None to True, that means it would override the default of pyarrow (leading again to writing the RangeIndex as a full column).

datapythonista · 2019-09-11T09:12:42Z

Thanks for the info @jorisvandenbossche, should we just clarify in the documentation what None means then?

jorisvandenbossche · 2019-09-11T09:30:52Z

Yes, that is a good idea. From their docs, it seems that fastparquet uses the same logic of None vs True/False.

datapythonista · 2019-09-11T09:32:31Z

@galuhsahid do you want to update this PR with what discussed above? Restore the default parameter, and explain in the docstring what None means.

galuhsahid · 2019-09-11T23:43:01Z

@datapythonista Sure thanks for notifying! Will push an update later today.

pep8speaks · 2019-09-13T08:51:00Z

Hello @galuhsahid! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-15 15:33:29 UTC

jorisvandenbossche

Thanks for the the update!

jorisvandenbossche · 2019-09-13T09:17:09Z

pandas/core/frame.py

@@ -2148,10 +2148,10 @@ def to_parquet(
            'pyarrow' is unavailable.
        compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
            Name of the compression to use. Use ``None`` for no compression.
-        index : bool, default None
+        index : bool, default True


The default is still None, so need to change this back I think

jorisvandenbossche · 2019-09-13T09:17:50Z

pandas/core/frame.py

-            If ``False``, they will not be written to the file. If ``None``,
-            the behavior depends on the chosen engine.
+            If ``False``, they will not be written to the file.
+            If ``None``, RangeIndex will be stored as metadata-only.


Maybe mention that other indexes are included as columns in the output in this case?

datapythonista · 2019-09-13T09:20:44Z

pandas/core/frame.py

-            If ``False``, they will not be written to the file. If ``None``,
-            the behavior depends on the chosen engine.
+            If ``False``, they will not be written to the file.
+            If ``None``, RangeIndex will be stored as metadata-only.


Thanks @galuhsahid for the update.

Can you change the default back to None in the docstrings too?

And for the None description, looks good, but I'm wondering if it'd be clearer to explain that None is like True and the index is stored, but that a RangeIndex (the default in pandas) is not stored as the values like True, but as a range in the metadata, so it doesn't require so much space, and it's faster.

The same you're saying, but a bit more extended, so people without much knowledge about parquet or pandas indices can still understand whether it's a good choice for them.

Btw, minor thing. Every time you push, people who participated in the conversation of the PR receives a notification. I think it's a good practice to commit as many times as you think it's useful, but when it doesn't make a difference, it may be good to just push when your local changes are ready for another review. Not a big deal if pushing more often is useful to you, and you surely want to push too if you stop working on this for the day and want to back up your changes in your remote branch... But wanted to let you know, in case it's useful

Yes I think that's a good idea - I've expanded the explanation, would appreciate any input :) also thanks for the tip, I'll be more mindful when pushing my commits!

datapythonista

Thanks @galuhsahid, lgtm.

@jorisvandenbossche do you want to have a look?

jorisvandenbossche · 2019-09-16T11:50:35Z

Thanks a lot @galuhsahid !

CLN: Update index parameter in pandas to_parquet

991b8cb

datapythonista added IO Parquet parquet, feather Clean labels Aug 29, 2019

galuhsahid added 2 commits September 13, 2019 15:39

Restore default parameter

1e7c8a8

DOC: Update definition of index=None in pandas to_parquet

50cb5f0

Remove trailing whitespace

66f119e

jorisvandenbossche reviewed Sep 13, 2019

View reviewed changes

datapythonista reviewed Sep 13, 2019

View reviewed changes

DOC: Update default index in docstrings and expand explanation

e0e9fbf

datapythonista approved these changes Sep 16, 2019

View reviewed changes

jorisvandenbossche approved these changes Sep 16, 2019

View reviewed changes

jorisvandenbossche changed the title ~~CLN: Update index parameter in pandas to_parquet~~ DOC: Update index parameter in pandas to_parquet Sep 16, 2019

jorisvandenbossche merged commit 9ef67b1 into pandas-dev:master Sep 16, 2019

galuhsahid deleted the parquet-index branch September 16, 2019 12:45

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

DOC: Update index parameter in pandas to_parquet (pandas-dev#28217)

0a636d4

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

DOC: Update index parameter in pandas to_parquet (pandas-dev#28217)

91a85f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Update index parameter in pandas to_parquet #28217

DOC: Update index parameter in pandas to_parquet #28217

galuhsahid commented Aug 29, 2019

datapythonista commented Aug 29, 2019

galuhsahid commented Aug 29, 2019 •

edited

Loading

datapythonista commented Aug 31, 2019

wesm commented Aug 31, 2019

jorisvandenbossche commented Sep 11, 2019

datapythonista commented Sep 11, 2019

jorisvandenbossche commented Sep 11, 2019

datapythonista commented Sep 11, 2019

galuhsahid commented Sep 11, 2019

pep8speaks commented Sep 13, 2019 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Sep 13, 2019

jorisvandenbossche Sep 13, 2019

datapythonista Sep 13, 2019

galuhsahid Sep 15, 2019

datapythonista left a comment

jorisvandenbossche commented Sep 16, 2019

DOC: Update index parameter in pandas to_parquet #28217

DOC: Update index parameter in pandas to_parquet #28217

Conversation

galuhsahid commented Aug 29, 2019

datapythonista commented Aug 29, 2019

galuhsahid commented Aug 29, 2019 • edited Loading

datapythonista commented Aug 31, 2019

wesm commented Aug 31, 2019

jorisvandenbossche commented Sep 11, 2019

datapythonista commented Sep 11, 2019

jorisvandenbossche commented Sep 11, 2019

datapythonista commented Sep 11, 2019

galuhsahid commented Sep 11, 2019

pep8speaks commented Sep 13, 2019 • edited Loading

Comment last updated at 2019-09-15 15:33:29 UTC

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Sep 13, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 13, 2019

Choose a reason for hiding this comment

datapythonista Sep 13, 2019

Choose a reason for hiding this comment

galuhsahid Sep 15, 2019

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 16, 2019

galuhsahid commented Aug 29, 2019 •

edited

Loading

pep8speaks commented Sep 13, 2019 •

edited

Loading