-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: DataFrame.transpose resulting in not contiguous data on nullable EAs #57474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # doc/source/whatsnew/v2.2.1.rst
pandas/core/arrays/masked.py
Outdated
@@ -1661,7 +1661,9 @@ def transpose_homogeneous_masked_arrays( | |||
arr_type = dtype.construct_array_type() | |||
transposed_arrays: list[BaseMaskedArray] = [] | |||
for i in range(transposed_values.shape[1]): | |||
transposed_arr = arr_type(transposed_values[:, i], mask=transposed_masks[:, i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
masked_arrays = list(masked_arrays)
dtype = masked_arrays[0].dtype
values = [arr._data.reshape(1, -1) for arr in masked_arrays]
transposed_values = np.concatenate(values, axis=0, out=np.empty((len(masked_arrays), len(masked_arrays[0])), order="F", dtype=dtype.numpy_dtype))
masks = [arr._mask.reshape(1, -1) for arr in masked_arrays]
transposed_masks = np.concatenate(masks, axis=0, out=np.empty_like(transposed_values, dtype=np.bool))
arr_type = dtype.construct_array_type()
transposed_arrays: list[BaseMaskedArray] = []
for i in range(transposed_values.shape[1]):
transposed_arr = arr_type(
transposed_values[:, i], mask=transposed_masks[:, i]
)
transposed_arrays.append(transposed_arr)
return transposed_arrays
this should be faster (avoids the second copy, but not sure if I can come up with a meaningful benchmark...)
arr = [
pd.array(np.random.randint(1, 1_000_000, (100, )), dtype="Int64"),
pd.array(np.random.randint(1, 1_000_000, (100, )), dtype="Int64"),
pd.array(np.random.randint(1, 1_000_000, (100, )), dtype="Int64"),
pd.array(np.random.randint(1, 1_000_000, (100, )), dtype="Int64"),
]
arr = arr * 10_000
transpose_homogeneous_masked_arrays(arr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
size = 100_000
df = pd.DataFrame(
{
"a": pd.array([0]*size, dtype="Int8"),
"b": pd.array([0]*size, dtype="Int8"),
}
)
%timeit df.T
# main
416 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# orig PR
477 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# phofl
417 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice that we keep consistent on the Python dominated benchmark as well
b3332bb
to
7cf832c
Compare
thx @rhshadrach |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
…not contiguous data on nullable EAs
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.