-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Expand Parquet pandas schema metadata to store RangeIndex without serialization #25672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Milestone
Comments
For reference, when I benchmarked using Arrow for serializing DataFrames for dask, this seemed to be one of the few spots Arrow was "slower" than pickle. |
Makes sense. I'm working on a reference implementation for Arrow and will open a PR into pandas with the proposed metadata extensions |
wesm
added a commit
to apache/arrow
that referenced
this issue
Mar 13, 2019
…pandas instead of converting to a column of integers This ended up being much more difficult than anticipated due to the spaghetti-like state (as the result of many hacks) of pyarrow/pandas_compat.py. This is partly a performance and memory use optimization. It has consequences, though, namely tables will have some index data discarded when concatenated from multiple pandas DataFrame objects that were converted to Arrow. I think this is OK, though, since the preservation of pandas indexes is generally something that's handled at the granularity of a single DataFrame. One always has the option of calling `reset_index` to convert a RangeIndex if that's what is desired. This patch also implements proposed extensions to the serialized pandas metadata to accommodate indexes-as-columns vs. indexes-represented-as-metadata, as described in pandas-dev/pandas#25672 Author: Wes McKinney <[email protected]> Closes #3868 from wesm/ARROW-1639 and squashes the following commits: ec929ae <Wes McKinney> Add pandas_metadata attribute to pyarrow.Schema to make interactions simpler 670dc6f <Wes McKinney> Add compatibility tests for pre-0.13 metadata. Add Arrow version to pandas metadata 0ca1bfc <Wes McKinney> Add benchmark 9ba4131 <Wes McKinney> Serialize RangeIndex as metadata via Table.from_pandas instead of converting to data column. This affects serialize_pandas and writing to Parquet format
jorisvandenbossche
added a commit
to wesm/pandas
that referenced
this issue
Aug 8, 2019
…parquet-metadata-for-range-index
5 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
In https://github.com/pandas-dev/pandas/blob/master/doc/source/development/developer.rst, there is no affordance for storing RangeIndex without serializing it to a column of integers. This wastes both memory and time
I'll propose an evolution of the metadata that permits "non-serialized" indexes like RangeIndex to be stored without a conversion step of some kind
This will have to mind forward compatibility (so we can read old files, but not backward compatibility -- i.e. allowing new files to be read by old readers -- see below). I would suggest changing the
index_columns
to include dictionaries likeversus
So if a string is encountered in this field (instead of a dict), we know it is "old" metadata. This will break old readers but I think that is OK
Cross ref with https://issues.apache.org/jira/browse/ARROW-1639
cc @cpcloud @martindurant @kszucs @xhochy
The text was updated successfully, but these errors were encountered: