Expand Parquet pandas schema metadata to store RangeIndex without serialization #25672

wesm · 2019-03-11T18:37:29Z

In https://github.com/pandas-dev/pandas/blob/master/doc/source/development/developer.rst, there is no affordance for storing RangeIndex without serializing it to a column of integers. This wastes both memory and time

I'll propose an evolution of the metadata that permits "non-serialized" indexes like RangeIndex to be stored without a conversion step of some kind

This will have to mind forward compatibility (so we can read old files, but not backward compatibility -- i.e. allowing new files to be read by old readers -- see below). I would suggest changing the index_columns to include dictionaries like

{
    'kind': 'range',
    'start': 0,
    'stop': 10,
    'step': 1
}

versus

{
    'kind': 'serialized',
    'field_name': '__index_level_0__'
}

So if a string is encountered in this field (instead of a dict), we know it is "old" metadata. This will break old readers but I think that is OK

Cross ref with https://issues.apache.org/jira/browse/ARROW-1639

cc @cpcloud @martindurant @kszucs @xhochy

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-03-11T18:55:18Z

For reference, when I benchmarked using Arrow for serializing DataFrames for dask, this seemed to be one of the few spots Arrow was "slower" than pickle.

wesm · 2019-03-11T19:20:47Z

Makes sense. I'm working on a reference implementation for Arrow and will open a PR into pandas with the proposed metadata extensions

…pandas instead of converting to a column of integers This ended up being much more difficult than anticipated due to the spaghetti-like state (as the result of many hacks) of pyarrow/pandas_compat.py. This is partly a performance and memory use optimization. It has consequences, though, namely tables will have some index data discarded when concatenated from multiple pandas DataFrame objects that were converted to Arrow. I think this is OK, though, since the preservation of pandas indexes is generally something that's handled at the granularity of a single DataFrame. One always has the option of calling `reset_index` to convert a RangeIndex if that's what is desired. This patch also implements proposed extensions to the serialized pandas metadata to accommodate indexes-as-columns vs. indexes-represented-as-metadata, as described in pandas-dev/pandas#25672 Author: Wes McKinney <[email protected]> Closes #3868 from wesm/ARROW-1639 and squashes the following commits: ec929ae <Wes McKinney> Add pandas_metadata attribute to pyarrow.Schema to make interactions simpler 670dc6f <Wes McKinney> Add compatibility tests for pre-0.13 metadata. Add Arrow version to pandas metadata 0ca1bfc <Wes McKinney> Add benchmark 9ba4131 <Wes McKinney> Serialize RangeIndex as metadata via Table.from_pandas instead of converting to data column. This affects serialize_pandas and writing to Parquet format

…parquet-metadata-for-range-index

TomAugspurger added IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather labels Mar 11, 2019

TomAugspurger added this to the Contributions Welcome milestone Mar 11, 2019

This was referenced Mar 11, 2019

ARROW-1639: [Python] Serialize RangeIndex as metadata via Table.from_pandas instead of converting to a column of integers apache/arrow#3868

Closed

DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema #25709

Merged

jreback modified the milestones: Contributions Welcome, 0.25.0 Mar 14, 2019

bchu mentioned this issue Apr 2, 2019

Cannot read pyarrow RangeIndex dask/fastparquet#414

Closed

jorisvandenbossche removed this from the 0.25.0 milestone Jun 30, 2019

jorisvandenbossche added this to the 1.0 milestone Aug 8, 2019

jorisvandenbossche added a commit to wesm/pandas that referenced this issue Aug 8, 2019

Merge remote-tracking branch 'upstream/master' into pandas-devGH-25672-…

8f6d6a7

…parquet-metadata-for-range-index

jorisvandenbossche closed this as completed in #25709 Aug 8, 2019

rachtsingh mentioned this issue Mar 2, 2023

BUG: read_parquet does not respect index for arrow dtype backend #51726

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand Parquet pandas schema metadata to store RangeIndex without serialization #25672

Expand Parquet pandas schema metadata to store RangeIndex without serialization #25672

wesm commented Mar 11, 2019

TomAugspurger commented Mar 11, 2019

wesm commented Mar 11, 2019

Expand Parquet pandas schema metadata to store RangeIndex without serialization #25672

Expand Parquet pandas schema metadata to store RangeIndex without serialization #25672

Comments

wesm commented Mar 11, 2019

TomAugspurger commented Mar 11, 2019

wesm commented Mar 11, 2019