Skip to content

Commit 86f480a

Browse files
committed
ARROW-1639: [Python] Serialize RangeIndex as metadata via Table.from_pandas instead of converting to a column of integers
This ended up being much more difficult than anticipated due to the spaghetti-like state (as the result of many hacks) of pyarrow/pandas_compat.py. This is partly a performance and memory use optimization. It has consequences, though, namely tables will have some index data discarded when concatenated from multiple pandas DataFrame objects that were converted to Arrow. I think this is OK, though, since the preservation of pandas indexes is generally something that's handled at the granularity of a single DataFrame. One always has the option of calling `reset_index` to convert a RangeIndex if that's what is desired. This patch also implements proposed extensions to the serialized pandas metadata to accommodate indexes-as-columns vs. indexes-represented-as-metadata, as described in pandas-dev/pandas#25672 Author: Wes McKinney <[email protected]> Closes #3868 from wesm/ARROW-1639 and squashes the following commits: ec929ae <Wes McKinney> Add pandas_metadata attribute to pyarrow.Schema to make interactions simpler 670dc6f <Wes McKinney> Add compatibility tests for pre-0.13 metadata. Add Arrow version to pandas metadata 0ca1bfc <Wes McKinney> Add benchmark 9ba4131 <Wes McKinney> Serialize RangeIndex as metadata via Table.from_pandas instead of converting to data column. This affects serialize_pandas and writing to Parquet format
1 parent 0c4f857 commit 86f480a

File tree

8 files changed

+545
-237
lines changed

8 files changed

+545
-237
lines changed

python/benchmarks/convert_pandas.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,3 +90,18 @@ def time_deserialize_from_buffer(self):
9090

9191
def time_deserialize_from_components(self):
9292
pa.deserialize_components(self.as_components)
93+
94+
95+
class SerializeDeserializePandas(object):
96+
97+
def setup(self):
98+
# 10 million length
99+
n = 10000000
100+
self.df = pd.DataFrame({'data': np.random.randn(n)})
101+
self.serialized = pa.serialize_pandas(self.df)
102+
103+
def time_serialize_pandas(self):
104+
pa.serialize_pandas(self.df)
105+
106+
def time_deserialize_pandas(self):
107+
pa.deserialize_pandas(self.serialized)

0 commit comments

Comments
 (0)