Skip to content

Commit 3b000b7

Browse files
committed
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options
The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in pandas-dev/pandas#10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b42602 <Wes McKinney> Code review comments 8f39cce <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280 <Wes McKinney> Fix one MSVC cast warning 4306803 <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <[email protected]> Signed-off-by: Wes McKinney <[email protected]>
1 parent b4c72fe commit 3b000b7

File tree

13 files changed

+1419
-1384
lines changed

13 files changed

+1419
-1384
lines changed

cpp/src/arrow/python/arrow_to_pandas.cc

Lines changed: 1065 additions & 1261 deletions
Large diffs are not rendered by default.

cpp/src/arrow/python/arrow_to_pandas.h

Lines changed: 30 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,7 @@
1818
// Functions for converting between pandas's NumPy-based data representation
1919
// and Arrow data structures
2020

21-
#ifndef ARROW_PYTHON_ADAPTERS_PANDAS_H
22-
#define ARROW_PYTHON_ADAPTERS_PANDAS_H
21+
#pragma once
2322

2423
#include "arrow/python/platform.h"
2524

@@ -53,26 +52,43 @@ struct PandasOptions {
5352
bool date_as_object = false;
5453
bool use_threads = false;
5554

55+
/// Coerce all date and timestamp to datetime64[ns]
56+
bool coerce_temporal_nanoseconds = false;
57+
5658
/// \brief If true, do not create duplicate PyObject versions of equal
5759
/// objects. This only applies to immutable objects like strings or datetime
5860
/// objects
5961
bool deduplicate_objects = false;
62+
63+
/// \brief If true, create one block per column rather than consolidated
64+
/// blocks (1 per data type). Do zero-copy wrapping when there are no
65+
/// nulls. pandas currently will consolidate the blocks on its own, causing
66+
/// increased memory use, so keep this in mind if you are working on a
67+
/// memory-constrained situation.
68+
bool split_blocks = false;
69+
70+
/// \brief If true, attempt to deallocate buffers in passed Arrow object if
71+
/// it is the only remaining shared_ptr copy of it. See ARROW-3789 for
72+
/// original context for this feature. Only currently implemented for Table
73+
/// conversions
74+
bool self_destruct = false;
75+
76+
// Columns that should be casted to categorical
77+
std::unordered_set<std::string> categorical_columns;
78+
79+
// Columns that should be passed through to be converted to
80+
// ExtensionArray/Block
81+
std::unordered_set<std::string> extension_columns;
6082
};
6183

6284
ARROW_PYTHON_EXPORT
63-
Status ConvertArrayToPandas(const PandasOptions& options,
64-
const std::shared_ptr<Array>& arr, PyObject* py_ref,
65-
PyObject** out);
85+
Status ConvertArrayToPandas(const PandasOptions& options, std::shared_ptr<Array> arr,
86+
PyObject* py_ref, PyObject** out);
6687

6788
ARROW_PYTHON_EXPORT
6889
Status ConvertChunkedArrayToPandas(const PandasOptions& options,
69-
const std::shared_ptr<ChunkedArray>& col,
70-
PyObject* py_ref, PyObject** out);
71-
72-
ARROW_PYTHON_EXPORT
73-
Status ConvertColumnToPandas(const PandasOptions& options,
74-
const std::shared_ptr<Column>& col, PyObject* py_ref,
75-
PyObject** out);
90+
std::shared_ptr<ChunkedArray> col, PyObject* py_ref,
91+
PyObject** out);
7692

7793
// Convert a whole table as efficiently as possible to a pandas.DataFrame.
7894
//
@@ -81,25 +97,8 @@ Status ConvertColumnToPandas(const PandasOptions& options,
8197
//
8298
// tuple item: (indices: ndarray[int32], block: ndarray[TYPE, ndim=2])
8399
ARROW_PYTHON_EXPORT
84-
Status ConvertTableToPandas(const PandasOptions& options,
85-
const std::shared_ptr<Table>& table, PyObject** out);
86-
87-
/// Convert a whole table as efficiently as possible to a pandas.DataFrame.
88-
///
89-
/// Explicitly name columns that should be a categorical
90-
/// This option is only used on conversions that are applied to a table.
91-
ARROW_PYTHON_EXPORT
92-
Status ConvertTableToPandas(const PandasOptions& options,
93-
const std::unordered_set<std::string>& categorical_columns,
94-
const std::shared_ptr<Table>& table, PyObject** out);
95-
96-
ARROW_PYTHON_EXPORT
97-
Status ConvertTableToPandas(const PandasOptions& options,
98-
const std::unordered_set<std::string>& categorical_columns,
99-
const std::unordered_set<std::string>& extension_columns,
100-
const std::shared_ptr<Table>& table, PyObject** out);
100+
Status ConvertTableToPandas(const PandasOptions& options, std::shared_ptr<Table> table,
101+
PyObject** out);
101102

102103
} // namespace py
103104
} // namespace arrow
104-
105-
#endif // ARROW_PYTHON_ADAPTERS_PANDAS_H

cpp/src/arrow/python/type_traits.h

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,10 @@
3232

3333
namespace arrow {
3434
namespace py {
35+
36+
static constexpr int64_t kPandasTimestampNull = std::numeric_limits<int64_t>::min();
37+
constexpr int64_t kNanosecondsInDay = 86400000000000LL;
38+
3539
namespace internal {
3640

3741
//
@@ -86,6 +90,8 @@ struct npy_traits<NPY_FLOAT16> {
8690
using TypeClass = HalfFloatType;
8791
using BuilderClass = HalfFloatBuilder;
8892

93+
static constexpr npy_half na_sentinel = NPY_HALF_NAN;
94+
8995
static constexpr bool supports_nulls = true;
9096

9197
static inline bool isnull(npy_half v) { return v == NPY_HALF_NAN; }
@@ -97,6 +103,8 @@ struct npy_traits<NPY_FLOAT32> {
97103
using TypeClass = FloatType;
98104
using BuilderClass = FloatBuilder;
99105

106+
static constexpr float na_sentinel = NAN;
107+
100108
static constexpr bool supports_nulls = true;
101109

102110
static inline bool isnull(float v) { return v != v; }
@@ -108,6 +116,8 @@ struct npy_traits<NPY_FLOAT64> {
108116
using TypeClass = DoubleType;
109117
using BuilderClass = DoubleBuilder;
110118

119+
static constexpr double na_sentinel = NAN;
120+
111121
static constexpr bool supports_nulls = true;
112122

113123
static inline bool isnull(double v) { return v != v; }
@@ -208,10 +218,6 @@ struct arrow_traits<Type::DOUBLE> {
208218
typedef typename npy_traits<NPY_FLOAT64>::value_type T;
209219
};
210220

211-
static constexpr int64_t kPandasTimestampNull = std::numeric_limits<int64_t>::min();
212-
213-
constexpr int64_t kNanosecondsInDay = 86400000000000LL;
214-
215221
template <>
216222
struct arrow_traits<Type::TIMESTAMP> {
217223
static constexpr int npy_type = NPY_DATETIME;
@@ -287,6 +293,21 @@ struct arrow_traits<Type::BINARY> {
287293
static constexpr bool supports_nulls = true;
288294
};
289295

296+
static inline NPY_DATETIMEUNIT NumPyFrequency(TimeUnit::type unit) {
297+
switch (unit) {
298+
case TimestampType::Unit::SECOND:
299+
return NPY_FR_s;
300+
case TimestampType::Unit::MILLI:
301+
return NPY_FR_ms;
302+
break;
303+
case TimestampType::Unit::MICRO:
304+
return NPY_FR_us;
305+
default:
306+
// NANO
307+
return NPY_FR_ns;
308+
}
309+
}
310+
290311
static inline int NumPyTypeSize(int npy_type) {
291312
npy_type = fix_numpy_type_num(npy_type);
292313

cpp/src/arrow/table.cc

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -425,6 +425,22 @@ class SimpleTable : public Table {
425425

426426
Table::Table() : num_rows_(0) {}
427427

428+
std::vector<std::shared_ptr<ChunkedArray>> Table::columns() const {
429+
std::vector<std::shared_ptr<ChunkedArray>> result;
430+
for (int i = 0; i < this->num_columns(); ++i) {
431+
result.emplace_back(this->column(i));
432+
}
433+
return result;
434+
}
435+
436+
std::vector<std::shared_ptr<Field>> Table::fields() const {
437+
std::vector<std::shared_ptr<Field>> result;
438+
for (int i = 0; i < this->num_columns(); ++i) {
439+
result.emplace_back(this->field(i));
440+
}
441+
return result;
442+
}
443+
428444
std::shared_ptr<Table> Table::Make(
429445
const std::shared_ptr<Schema>& schema,
430446
const std::vector<std::shared_ptr<ChunkedArray>>& columns, int64_t num_rows) {

cpp/src/arrow/table.h

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -185,15 +185,21 @@ class ARROW_EXPORT Table {
185185
static Status FromChunkedStructArray(const std::shared_ptr<ChunkedArray>& array,
186186
std::shared_ptr<Table>* table);
187187

188-
/// Return the table schema
188+
/// \brief Return the table schema
189189
std::shared_ptr<Schema> schema() const { return schema_; }
190190

191-
/// Return a column by index
191+
/// \brief Return a column by index
192192
virtual std::shared_ptr<ChunkedArray> column(int i) const = 0;
193193

194+
/// \brief Return vector of all columns for table
195+
std::vector<std::shared_ptr<ChunkedArray>> columns() const;
196+
194197
/// Return a column's field by index
195198
std::shared_ptr<Field> field(int i) const { return schema_->field(i); }
196199

200+
/// \brief Return vector of all fields for table
201+
std::vector<std::shared_ptr<Field>> fields() const;
202+
197203
/// \brief Construct a zero-copy slice of the table with the
198204
/// indicated offset and length
199205
///

cpp/src/arrow/table_test.cc

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,29 @@ TEST_F(TestTable, InvalidColumns) {
273273
ASSERT_RAISES(Invalid, table_->ValidateFull());
274274
}
275275

276+
TEST_F(TestTable, AllColumnsAndFields) {
277+
const int length = 100;
278+
MakeExample1(length);
279+
table_ = Table::Make(schema_, columns_);
280+
281+
auto columns = table_->columns();
282+
auto fields = table_->fields();
283+
284+
for (int i = 0; i < table_->num_columns(); ++i) {
285+
AssertChunkedEqual(*table_->column(i), *columns[i]);
286+
AssertFieldEqual(*table_->field(i), *fields[i]);
287+
}
288+
289+
// Zero length
290+
std::vector<std::shared_ptr<Array>> t2_columns;
291+
auto t2 = Table::Make(::arrow::schema({}), t2_columns);
292+
columns = t2->columns();
293+
fields = t2->fields();
294+
295+
ASSERT_EQ(0, columns.size());
296+
ASSERT_EQ(0, fields.size());
297+
}
298+
276299
TEST_F(TestTable, Equals) {
277300
const int length = 100;
278301
MakeExample1(length);

cpp/src/arrow/type_traits.h

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -703,6 +703,19 @@ static inline bool is_primitive(Type::type type_id) {
703703
return false;
704704
}
705705

706+
static inline bool is_base_binary_like(Type::type type_id) {
707+
switch (type_id) {
708+
case Type::BINARY:
709+
case Type::LARGE_BINARY:
710+
case Type::STRING:
711+
case Type::LARGE_STRING:
712+
return true;
713+
default:
714+
break;
715+
}
716+
return false;
717+
}
718+
706719
static inline bool is_binary_like(Type::type type_id) {
707720
switch (type_id) {
708721
case Type::BINARY:

docs/source/python/pandas.rst

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -221,3 +221,75 @@ Time types
221221
~~~~~~~~~~
222222

223223
TODO
224+
225+
Memory Usage and Zero Copy
226+
--------------------------
227+
228+
When converting from Arrow data structures to pandas objects using various
229+
``to_pandas`` methods, one must occasionally be mindful of issues related to
230+
performance and memory usage.
231+
232+
Since pandas's internal data representation is generally different from the
233+
Arrow columnar format, zero copy conversions (where no memory allocation or
234+
computation is required) are only possible in certain limited cases.
235+
236+
In the worst case scenario, calling ``to_pandas`` will result in two versions
237+
of the data in memory, one for Arrow and one for pandas, yielding approximately
238+
twice the memory footprint. We have implement some mitigations for this case,
239+
particularly when creating large ``DataFrame`` objects, that we describe below.
240+
241+
Zero Copy Series Conversions
242+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
243+
244+
Zero copy conversions from ``Array`` or ``ChunkedArray`` to NumPy arrays or
245+
pandas Series are possible in certain narrow cases:
246+
247+
* The Arrow data is stored in an integer (signed or unsigned ``int8`` through
248+
``int64``) or floating point type (``float16`` through ``float64``). This
249+
includes many numeric types as well as timestamps.
250+
* The Arrow data has no null values (since these are represented using bitmaps
251+
which are not supported by pandas).
252+
* For ``ChunkedArray``, the data consists of a single chunk,
253+
i.e. ``arr.num_chunks == 1``. Multiple chunks will always require a copy
254+
because of pandas's contiguousness requirement.
255+
256+
In these scenarios, ``to_pandas`` or ``to_numpy`` will be zero copy. In all
257+
other scenarios, a copy will be required.
258+
259+
Reducing Memory Use in ``Table.to_pandas``
260+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
261+
262+
As of this writing, pandas applies a data management strategy called
263+
"consolidation" to collect like-typed DataFrame columns in two-dimensional
264+
NumPy arrays, referred to internally as "blocks". We have gone to great effort
265+
to construct the precise "consolidated" blocks so that pandas will not perform
266+
any further allocation or copies after we hand off the data to
267+
``pandas.DataFrame``. The obvious downside of this consolidation strategy is
268+
that it forces a "memory doubling".
269+
270+
To try to limit the potential effects of "memory doubling" during
271+
``Table.to_pandas``, we provide a couple of options:
272+
273+
* ``split_blocks=True``, when enabled ``Table.to_pandas`` produces one internal
274+
DataFrame "block" for each column, skipping the "consolidation" step. Note
275+
that many pandas operations will trigger consolidation anyway, but the peak
276+
memory use may be less than the worst case scenario of a full memory
277+
doubling. As a result of this option, we are able to do zero copy conversions
278+
of columns in the same cases where we can do zero copy with ``Array`` and
279+
``ChunkedArray``.
280+
* ``self_destruct=True``, this destroys the internal Arrow memory buffers in
281+
each column ``Table`` object as they are converted to the pandas-compatible
282+
representation, potentially releasing memory to the operating system as soon
283+
as a column is converted. Note that this renders the calling ``Table`` object
284+
unsafe for further use, and any further methods called will cause your Python
285+
process to crash.
286+
287+
Used together, the call
288+
289+
.. code-block:: python
290+
291+
df = table.to_pandas(split_blocks=True, self_destruct=True)
292+
del table # not necessary, but a good practice
293+
294+
will yield significantly lower memory usage in some scenarios. Without these
295+
options, ``to_pandas`` will always double memory.

0 commit comments

Comments
 (0)