Skip to content

Commit 129108f

Browse files
authored
PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ (#47781)
* Bypass chunking/validation logic in __from_arrow__ Instead of converting each chunk to a StringArray after casting to array and then concatenating, instead use pyarrow to concatenate chunks and convert to numpy. Finally, we bypass validation the validation logic by initializing NDArrayBacked instead of StringArray. * Handle zero-chunks correctly & convert None to NA * Add change to whatsnew * Add to v2 whatsnew * Add GH issue to comment about validation bypass * Add # to GH issue * Move release note to v2.1.0
1 parent 920c025 commit 129108f

File tree

2 files changed

+13
-9
lines changed

2 files changed

+13
-9
lines changed

doc/source/whatsnew/v2.1.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ Performance improvements
102102
~~~~~~~~~~~~~~~~~~~~~~~~
103103
- Performance improvement in :meth:`DataFrame.first_valid_index` and :meth:`DataFrame.last_valid_index` for extension array dtypes (:issue:`51549`)
104104
- Performance improvement in :meth:`DataFrame.clip` and :meth:`Series.clip` (:issue:`51472`)
105+
- Performance improvement in :func:`read_parquet` on string columns when using ``use_nullable_dtypes=True`` (:issue:`47345`)
105106
-
106107

107108
.. ---------------------------------------------------------------------------

pandas/core/arrays/string_.py

+12-9
Original file line numberDiff line numberDiff line change
@@ -203,16 +203,19 @@ def __from_arrow__(
203203
# pyarrow.ChunkedArray
204204
chunks = array.chunks
205205

206-
results = []
207-
for arr in chunks:
208-
# using _from_sequence to ensure None is converted to NA
209-
str_arr = StringArray._from_sequence(np.array(arr))
210-
results.append(str_arr)
211-
212-
if results:
213-
return StringArray._concat_same_type(results)
206+
if len(chunks) == 0:
207+
arr = np.array([], dtype=object)
214208
else:
215-
return StringArray(np.array([], dtype="object"))
209+
arr = pyarrow.concat_arrays(chunks).to_numpy(zero_copy_only=False)
210+
arr = lib.convert_nans_to_NA(arr)
211+
# Bypass validation inside StringArray constructor, see GH#47781
212+
new_string_array = StringArray.__new__(StringArray)
213+
NDArrayBacked.__init__(
214+
new_string_array,
215+
arr,
216+
StringDtype(storage="python"),
217+
)
218+
return new_string_array
216219

217220

218221
class BaseStringArray(ExtensionArray):

0 commit comments

Comments
 (0)