DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709

wesm · 2019-03-13T14:56:55Z

closes #xxxx
tests added / passed (N/A)
passes git diff upstream/master -u -- "*.py" | flake8 --diff (N/A)
whatsnew entry (N/A)

…ustom metadata

wesm · 2019-03-13T14:57:40Z

cc @cpcloud @TomAugspurger @martindurant @xhochy @jreback for review. This is pending for Apache Arrow 0.13.0 so if we want to make any changes it would be good to do it soon =)

TomAugspurger · 2019-03-13T15:01:36Z

doc/source/development/developer.rst

+
+.. code-block:: python
+
+   {'kind': range,


Should the range here be s string literal?

yes, will fix

codecov · 2019-03-13T15:38:53Z

Codecov Report

Merging #25709 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #25709   +/-   ##
=======================================
  Coverage   91.25%   91.25%           
=======================================
  Files         172      172           
  Lines       52963    52963           
=======================================
  Hits        48330    48330           
  Misses       4633     4633

Flag	Coverage Δ
#multiple	`89.82% <ø> (ø)`	⬆️
#single	`41.73% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86879ac...766aa50. Read the comment docs.

codecov · 2019-03-13T15:38:53Z

Codecov Report

Merging #25709 into master will decrease coverage by 1.21%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25709      +/-   ##
==========================================
- Coverage   93.02%   91.81%   -1.22%     
==========================================
  Files         187      175      -12     
  Lines       50109    52581    +2472     
==========================================
+ Hits        46613    48276    +1663     
- Misses       3496     4305     +809

Flag	Coverage Δ
#multiple	`90.36% <ø> (-1.33%)`	⬇️
#single	`41.89% <ø> (-0.65%)`	⬇️

Impacted Files	Coverage Δ
pandas/plotting/_misc.py	`38.46% <0%> (-26.41%)`	⬇️
pandas/io/gbq.py	`75% <0%> (-25%)`	⬇️
pandas/compat/__init__.py	`70.7% <0%> (-21.61%)`	⬇️
pandas/io/gcs.py	`80% <0%> (-20%)`	⬇️
pandas/io/s3.py	`89.47% <0%> (-10.53%)`	⬇️
pandas/core/computation/expr.py	`88.52% <0%> (-9.26%)`	⬇️
pandas/core/groupby/base.py	`91.83% <0%> (-8.17%)`	⬇️
pandas/io/excel/_xlrd.py	`93.93% <0%> (-6.07%)`	⬇️
pandas/core/groupby/categorical.py	`95.45% <0%> (-4.55%)`	⬇️
pandas/core/indexing.py	`90.88% <0%> (-4.17%)`	⬇️
... and 183 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fae84ec...10d1e86. Read the comment docs.

wesm · 2019-03-13T17:50:32Z

Fixed the range thing

doc/source/development/developer.rst

Co-Authored-By: wesm <[email protected]>

jreback · 2019-03-14T12:45:36Z

doc/source/development/developer.rst

+   index = pd.RangeIndex(0, 10, 2)
+   {'kind': 'range',
+    'name': index.name,
+    'start': index._start,


#25720 was just merged, prob could just change this (though of course implementing this for <0.25.x will still require this)

jorisvandenbossche · 2019-03-14T19:56:39Z

doc/source/development/developer.rst

+disambiguation. The ``'field_name'`` is the actual name of the column in the
+serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute,
+then it can be found in the ``name`` field of the metadata for that serialized
+data column as described below.


Is this actually correct? (although it was already there in the doc before, to be clear)
This seems to indicate that an index always gets a __index_level_x__ name as the field_name, regardless of the name it has (so not only if it is None).

But this is not what I see from a quick test:

In [3]: pyarrow.__version__ Out[3]: '0.12.0' In [4]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}).set_index('a') In [5]: df Out[5]: b a 1 3 2 4 3 5 In [6]: pyarrow.Table.from_pandas(df) Out[6]: pyarrow.Table b: int64 a: int64 metadata -------- OrderedDict([(b'pandas', b'{"index_columns": ["a"], "column_indexes": [{"name": null, "' b'field_name": null, "pandas_type": "unicode", "numpy_type": "' b'object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"' b'name": "b", "field_name": "b", "pandas_type": "int64", "nump' b'y_type": "int64", "metadata": null}, {"name": "a", "field_na' b'me": "a", "pandas_type": "int64", "numpy_type": "int64", "me' b'tadata": null}], "pandas_version": "0.23.4"}')])

(I remember that we had a discussion about this before, but can't directly remember the outcome of that)

It appears that the current behavior is to use the name of the index if it does not conflict with any of the data columns. So we should update these docs to reflect that

wesm · 2019-03-25T23:22:19Z

To enhance backwards compatibility I'm going to change the {'kind': 'serialized', 'field_name': $COLUMN} to simply $COLUMN. So only "special" indexes will have the dict representation. Stay tuned

… non-RangeIndex

jorisvandenbossche · 2019-05-02T10:38:43Z

I think this is ready to be merged? (at least, I think it reflects the situation in the last pyarrow release).

@martindurant is this OK from the fastparquet side? I think reading such metadata is not yet supported? (dask/fastparquet#414)
I don't know if fastparquet has plans to also support it when writing parquet files in the long term. If not, we could maybe make this "optional" in the description (as pyarrow can read serialized range indexes fine).

martindurant · 2019-05-02T13:11:43Z

No, this is not yet handled in fastparquet :(
I have found that users of fastparquet are much less likely to ask for a non-useful index to be written, and if there is none when reading, indeed a standard rangeindex is made.

jorisvandenbossche · 2019-05-02T13:30:05Z

I have found that users of fastparquet are much less likely to ask for a non-useful index to be written

The default is also to not write such indexes in fastparquet, which was probably a good choice. But basically this is a somewhat different way of pyarrow to deal with it, by only storing it in the metadata.

jreback · 2019-06-08T20:28:27Z

@jorisvandenbossche is this ready to merge?

…parquet-metadata-for-range-index

jorisvandenbossche · 2019-08-08T13:38:33Z

Yes, this can be merged now. Did a small clean-up (_start -> start etc)

…metadata in Parquet file schema (pandas-dev#25709)

Add specification for RangeIndex-as-metadata in Parquet file schema c…

766aa50

…ustom metadata

TomAugspurger reviewed Mar 13, 2019

View reviewed changes

Add string quotes to range

2c8431c

TomAugspurger reviewed Mar 13, 2019

View reviewed changes

doc/source/development/developer.rst Show resolved Hide resolved

Update doc/source/development/developer.rst

931ca2c

Co-Authored-By: wesm <[email protected]>

jreback added the IO Parquet parquet, feather label Mar 14, 2019

jreback added this to the 0.25.0 milestone Mar 14, 2019

jreback reviewed Mar 14, 2019

View reviewed changes

jorisvandenbossche reviewed Mar 14, 2019

View reviewed changes

wesm mentioned this pull request Mar 21, 2019

ARROW-4872: [Python] Keep backward compatibility for ParquetDatasetPiece apache/arrow#3988

Closed

martindurant mentioned this pull request Mar 28, 2019

column index name(s) not persisted on save or load dask/fastparquet#409

Closed

Revert to current scheme of index column names in 'index_columns' for…

d3cd904

… non-RangeIndex

jorisvandenbossche modified the milestones: 0.25.0, 1.0 Jul 16, 2019

jorisvandenbossche added 2 commits August 8, 2019 14:13

Merge remote-tracking branch 'upstream/master' into pandas-devGH-25672-…

8f6d6a7

…parquet-metadata-for-range-index

small clean-up

10d1e86

jorisvandenbossche merged commit 78c6843 into pandas-dev:master Aug 8, 2019

quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019

DOC: Add expanded index descriptors for specifying for RangeIndex-as-…

47eca43

…metadata in Parquet file schema (pandas-dev#25709)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709

DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709

wesm commented Mar 13, 2019

wesm commented Mar 13, 2019

TomAugspurger Mar 13, 2019

wesm Mar 13, 2019

codecov bot commented Mar 13, 2019

codecov bot commented Mar 13, 2019 •

edited

Loading

wesm commented Mar 13, 2019

jreback Mar 14, 2019

jorisvandenbossche Mar 14, 2019

wesm Mar 14, 2019

wesm commented Mar 25, 2019

jorisvandenbossche commented May 2, 2019

martindurant commented May 2, 2019

jorisvandenbossche commented May 2, 2019

jreback commented Jun 8, 2019

jorisvandenbossche commented Aug 8, 2019

DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709

DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709

Conversation

wesm commented Mar 13, 2019

wesm commented Mar 13, 2019

TomAugspurger Mar 13, 2019

Choose a reason for hiding this comment

wesm Mar 13, 2019

Choose a reason for hiding this comment

codecov bot commented Mar 13, 2019

Codecov Report

codecov bot commented Mar 13, 2019 • edited Loading

Codecov Report

wesm commented Mar 13, 2019

jreback Mar 14, 2019

Choose a reason for hiding this comment

jorisvandenbossche Mar 14, 2019

Choose a reason for hiding this comment

wesm Mar 14, 2019

Choose a reason for hiding this comment

wesm commented Mar 25, 2019

jorisvandenbossche commented May 2, 2019

martindurant commented May 2, 2019

jorisvandenbossche commented May 2, 2019

jreback commented Jun 8, 2019

jorisvandenbossche commented Aug 8, 2019

codecov bot commented Mar 13, 2019 •

edited

Loading