-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709
Conversation
cc @cpcloud @TomAugspurger @martindurant @xhochy @jreback for review. This is pending for Apache Arrow 0.13.0 so if we want to make any changes it would be good to do it soon =) |
doc/source/development/developer.rst
Outdated
|
||
.. code-block:: python | ||
|
||
{'kind': range, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the range
here be s string literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will fix
Codecov Report
@@ Coverage Diff @@
## master #25709 +/- ##
=======================================
Coverage 91.25% 91.25%
=======================================
Files 172 172
Lines 52963 52963
=======================================
Hits 48330 48330
Misses 4633 4633
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25709 +/- ##
==========================================
- Coverage 93.02% 91.81% -1.22%
==========================================
Files 187 175 -12
Lines 50109 52581 +2472
==========================================
+ Hits 46613 48276 +1663
- Misses 3496 4305 +809
Continue to review full report at Codecov.
|
Fixed the range thing |
Co-Authored-By: wesm <[email protected]>
doc/source/development/developer.rst
Outdated
index = pd.RangeIndex(0, 10, 2) | ||
{'kind': 'range', | ||
'name': index.name, | ||
'start': index._start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#25720 was just merged, prob could just change this (though of course implementing this for <0.25.x will still require this)
doc/source/development/developer.rst
Outdated
disambiguation. The ``'field_name'`` is the actual name of the column in the | ||
serialized Parquet table. If the ``Index`` has a non-None ``name`` attribute, | ||
then it can be found in the ``name`` field of the metadata for that serialized | ||
data column as described below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually correct? (although it was already there in the doc before, to be clear)
This seems to indicate that an index always gets a __index_level_x__
name as the field_name
, regardless of the name it has (so not only if it is None).
But this is not what I see from a quick test:
In [3]: pyarrow.__version__
Out[3]: '0.12.0'
In [4]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}).set_index('a')
In [5]: df
Out[5]:
b
a
1 3
2 4
3 5
In [6]: pyarrow.Table.from_pandas(df)
Out[6]:
pyarrow.Table
b: int64
a: int64
metadata
--------
OrderedDict([(b'pandas',
b'{"index_columns": ["a"], "column_indexes": [{"name": null, "'
b'field_name": null, "pandas_type": "unicode", "numpy_type": "'
b'object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"'
b'name": "b", "field_name": "b", "pandas_type": "int64", "nump'
b'y_type": "int64", "metadata": null}, {"name": "a", "field_na'
b'me": "a", "pandas_type": "int64", "numpy_type": "int64", "me'
b'tadata": null}], "pandas_version": "0.23.4"}')])
(I remember that we had a discussion about this before, but can't directly remember the outcome of that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that the current behavior is to use the name of the index if it does not conflict with any of the data columns. So we should update these docs to reflect that
To enhance backwards compatibility I'm going to change the |
I think this is ready to be merged? (at least, I think it reflects the situation in the last pyarrow release). @martindurant is this OK from the fastparquet side? I think reading such metadata is not yet supported? (dask/fastparquet#414) |
No, this is not yet handled in fastparquet :( |
The default is also to not write such indexes in fastparquet, which was probably a good choice. But basically this is a somewhat different way of pyarrow to deal with it, by only storing it in the metadata. |
@jorisvandenbossche is this ready to merge? |
…parquet-metadata-for-range-index
Yes, this can be merged now. Did a small clean-up ( |
…metadata in Parquet file schema (pandas-dev#25709)
Closes #25672
git diff upstream/master -u -- "*.py" | flake8 --diff
(N/A)