Skip to content

Expand Parquet pandas schema metadata to store RangeIndex without serialization #25672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Mar 11, 2019 · 2 comments · Fixed by #25709
Closed

Expand Parquet pandas schema metadata to store RangeIndex without serialization #25672

wesm opened this issue Mar 11, 2019 · 2 comments · Fixed by #25709
Labels
IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather
Milestone

Comments

@wesm
Copy link
Member

wesm commented Mar 11, 2019

In https://github.com/pandas-dev/pandas/blob/master/doc/source/development/developer.rst, there is no affordance for storing RangeIndex without serializing it to a column of integers. This wastes both memory and time

I'll propose an evolution of the metadata that permits "non-serialized" indexes like RangeIndex to be stored without a conversion step of some kind

This will have to mind forward compatibility (so we can read old files, but not backward compatibility -- i.e. allowing new files to be read by old readers -- see below). I would suggest changing the index_columns to include dictionaries like

{
    'kind': 'range',
    'start': 0,
    'stop': 10,
    'step': 1
}

versus

{
    'kind': 'serialized',
    'field_name': '__index_level_0__'
}

So if a string is encountered in this field (instead of a dict), we know it is "old" metadata. This will break old readers but I think that is OK

Cross ref with https://issues.apache.org/jira/browse/ARROW-1639

cc @cpcloud @martindurant @kszucs @xhochy

@TomAugspurger TomAugspurger added IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather labels Mar 11, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Mar 11, 2019
@TomAugspurger
Copy link
Contributor

For reference, when I benchmarked using Arrow for serializing DataFrames for dask, this seemed to be one of the few spots Arrow was "slower" than pickle.

@wesm
Copy link
Member Author

wesm commented Mar 11, 2019

Makes sense. I'm working on a reference implementation for Arrow and will open a PR into pandas with the proposed metadata extensions

wesm added a commit to apache/arrow that referenced this issue Mar 13, 2019
…pandas instead of converting to a column of integers

This ended up being much more difficult than anticipated due to the spaghetti-like state (as the result of many hacks) of pyarrow/pandas_compat.py.

This is partly a performance and memory use optimization. It has consequences, though, namely tables will have some index data discarded when concatenated from multiple pandas DataFrame objects that were converted to Arrow. I think this is OK, though, since the preservation of pandas indexes is generally something that's handled at the granularity of a single DataFrame. One always has the option of calling `reset_index` to convert a RangeIndex if that's what is desired.

This patch also implements proposed extensions to the serialized pandas metadata to accommodate indexes-as-columns vs. indexes-represented-as-metadata, as described in

pandas-dev/pandas#25672

Author: Wes McKinney <[email protected]>

Closes #3868 from wesm/ARROW-1639 and squashes the following commits:

ec929ae <Wes McKinney> Add pandas_metadata attribute to pyarrow.Schema to make interactions simpler
670dc6f <Wes McKinney> Add compatibility tests for pre-0.13 metadata. Add Arrow version to pandas metadata
0ca1bfc <Wes McKinney> Add benchmark
9ba4131 <Wes McKinney> Serialize RangeIndex as metadata via Table.from_pandas instead of converting to data column. This affects serialize_pandas and writing to Parquet format
@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 Mar 14, 2019
@jorisvandenbossche jorisvandenbossche removed this from the 0.25.0 milestone Jun 30, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Aug 8, 2019
jorisvandenbossche added a commit to wesm/pandas that referenced this issue Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather
Projects
None yet
4 participants