Skip to content

DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wesm
Copy link
Member

@wesm wesm commented Mar 13, 2019

Closes #25672

  • closes #xxxx
  • tests added / passed (N/A)
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff (N/A)
  • whatsnew entry (N/A)

@wesm
Copy link
Member Author

wesm commented Mar 13, 2019

cc @cpcloud @TomAugspurger @martindurant @xhochy @jreback for review. This is pending for Apache Arrow 0.13.0 so if we want to make any changes it would be good to do it soon =)


.. code-block:: python

{'kind': range,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the range here be s string literal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will fix

@codecov
Copy link

codecov bot commented Mar 13, 2019

Codecov Report

Merging #25709 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #25709   +/-   ##
=======================================
  Coverage   91.25%   91.25%           
=======================================
  Files         172      172           
  Lines       52963    52963           
=======================================
  Hits        48330    48330           
  Misses       4633     4633
Flag Coverage Δ
#multiple 89.82% <ø> (ø) ⬆️
#single 41.73% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86879ac...766aa50. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 13, 2019

Codecov Report

Merging #25709 into master will decrease coverage by 1.21%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25709      +/-   ##
==========================================
- Coverage   93.02%   91.81%   -1.22%     
==========================================
  Files         187      175      -12     
  Lines       50109    52581    +2472     
==========================================
+ Hits        46613    48276    +1663     
- Misses       3496     4305     +809
Flag Coverage Δ
#multiple 90.36% <ø> (-1.33%) ⬇️
#single 41.89% <ø> (-0.65%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_misc.py 38.46% <0%> (-26.41%) ⬇️
pandas/io/gbq.py 75% <0%> (-25%) ⬇️
pandas/compat/__init__.py 70.7% <0%> (-21.61%) ⬇️
pandas/io/gcs.py 80% <0%> (-20%) ⬇️
pandas/io/s3.py 89.47% <0%> (-10.53%) ⬇️
pandas/core/computation/expr.py 88.52% <0%> (-9.26%) ⬇️
pandas/core/groupby/base.py 91.83% <0%> (-8.17%) ⬇️
pandas/io/excel/_xlrd.py 93.93% <0%> (-6.07%) ⬇️
pandas/core/groupby/categorical.py 95.45% <0%> (-4.55%) ⬇️
pandas/core/indexing.py 90.88% <0%> (-4.17%) ⬇️
... and 183 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fae84ec...10d1e86. Read the comment docs.

@wesm
Copy link
Member Author

wesm commented Mar 13, 2019

Fixed the range thing

@jreback jreback added the IO Parquet parquet, feather label Mar 14, 2019
@jreback jreback added this to the 0.25.0 milestone Mar 14, 2019
index = pd.RangeIndex(0, 10, 2)
{'kind': 'range',
'name': index.name,
'start': index._start,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#25720 was just merged, prob could just change this (though of course implementing this for <0.25.x will still require this)

disambiguation. The ``'field_name'`` is the actual name of the column in the
serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute,
then it can be found in the ``name`` field of the metadata for that serialized
data column as described below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually correct? (although it was already there in the doc before, to be clear)
This seems to indicate that an index always gets a __index_level_x__ name as the field_name, regardless of the name it has (so not only if it is None).

But this is not what I see from a quick test:

In [3]: pyarrow.__version__                                                                                                                                                     
Out[3]: '0.12.0'

In [4]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}).set_index('a')                                                                                                      

In [5]: df                                                                                                                                                                      
Out[5]: 
   b
a   
1  3
2  4
3  5

In [6]: pyarrow.Table.from_pandas(df)                                                                                                                                           
Out[6]: 
pyarrow.Table
b: int64
a: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": ["a"], "column_indexes": [{"name": null, "'
              b'field_name": null, "pandas_type": "unicode", "numpy_type": "'
              b'object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"'
              b'name": "b", "field_name": "b", "pandas_type": "int64", "nump'
              b'y_type": "int64", "metadata": null}, {"name": "a", "field_na'
              b'me": "a", "pandas_type": "int64", "numpy_type": "int64", "me'
              b'tadata": null}], "pandas_version": "0.23.4"}')])

(I remember that we had a discussion about this before, but can't directly remember the outcome of that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the current behavior is to use the name of the index if it does not conflict with any of the data columns. So we should update these docs to reflect that

@wesm
Copy link
Member Author

wesm commented Mar 25, 2019

To enhance backwards compatibility I'm going to change the {'kind': 'serialized', 'field_name': $COLUMN} to simply $COLUMN. So only "special" indexes will have the dict representation. Stay tuned

@jorisvandenbossche
Copy link
Member

I think this is ready to be merged? (at least, I think it reflects the situation in the last pyarrow release).

@martindurant is this OK from the fastparquet side? I think reading such metadata is not yet supported? (dask/fastparquet#414)
I don't know if fastparquet has plans to also support it when writing parquet files in the long term. If not, we could maybe make this "optional" in the description (as pyarrow can read serialized range indexes fine).

@martindurant
Copy link
Contributor

No, this is not yet handled in fastparquet :(
I have found that users of fastparquet are much less likely to ask for a non-useful index to be written, and if there is none when reading, indeed a standard rangeindex is made.

@jorisvandenbossche
Copy link
Member

I have found that users of fastparquet are much less likely to ask for a non-useful index to be written

The default is also to not write such indexes in fastparquet, which was probably a good choice. But basically this is a somewhat different way of pyarrow to deal with it, by only storing it in the metadata.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@jorisvandenbossche is this ready to merge?

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.25.0, 1.0 Jul 16, 2019
@jorisvandenbossche
Copy link
Member

Yes, this can be merged now. Did a small clean-up (_start -> start etc)

@jorisvandenbossche jorisvandenbossche merged commit 78c6843 into pandas-dev:master Aug 8, 2019
quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expand Parquet pandas schema metadata to store RangeIndex without serialization
5 participants