Skip to content

REGR: DataFrame.transpose resulting in not contiguous data on nullable EAs #57474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 18, 2024

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version labels Feb 17, 2024
@rhshadrach rhshadrach added this to the 2.2.1 milestone Feb 17, 2024
@@ -1661,7 +1661,9 @@ def transpose_homogeneous_masked_arrays(
arr_type = dtype.construct_array_type()
transposed_arrays: list[BaseMaskedArray] = []
for i in range(transposed_values.shape[1]):
transposed_arr = arr_type(transposed_values[:, i], mask=transposed_masks[:, i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    masked_arrays = list(masked_arrays)
    dtype = masked_arrays[0].dtype

    values = [arr._data.reshape(1, -1) for arr in masked_arrays]
    transposed_values = np.concatenate(values, axis=0, out=np.empty((len(masked_arrays), len(masked_arrays[0])), order="F", dtype=dtype.numpy_dtype))

    masks = [arr._mask.reshape(1, -1) for arr in masked_arrays]
    transposed_masks = np.concatenate(masks, axis=0, out=np.empty_like(transposed_values, dtype=np.bool))

    arr_type = dtype.construct_array_type()
    transposed_arrays: list[BaseMaskedArray] = []
    for i in range(transposed_values.shape[1]):
        transposed_arr = arr_type(
            transposed_values[:, i], mask=transposed_masks[:, i]
        )
        transposed_arrays.append(transposed_arr)

    return transposed_arrays

this should be faster (avoids the second copy, but not sure if I can come up with a meaningful benchmark...)

arr = [
    pd.array(np.random.randint(1, 1_000_000, (100, )), dtype="Int64"),
    pd.array(np.random.randint(1, 1_000_000, (100, )), dtype="Int64"),
    pd.array(np.random.randint(1, 1_000_000, (100, )), dtype="Int64"),
    pd.array(np.random.randint(1, 1_000_000, (100, )), dtype="Int64"),
]

arr = arr * 10_000

transpose_homogeneous_masked_arrays(arr)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

size = 100_000
df = pd.DataFrame(
    {
        "a": pd.array([0]*size, dtype="Int8"),
        "b": pd.array([0]*size, dtype="Int8"),
    }
)
%timeit df.T

# main
416 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# orig PR
477 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# phofl
417 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice that we keep consistent on the Python dominated benchmark as well

@phofl phofl merged commit abc3efb into pandas-dev:main Feb 18, 2024
@phofl
Copy link
Member

phofl commented Feb 18, 2024

thx @rhshadrach

Copy link

lumberbot-app bot commented Feb 18, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.2.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 abc3efb63f02814047a1b291ac2b1ee3cd70252f
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #57474: REGR: DataFrame.transpose resulting in not contiguous data on nullable EAs'
  1. Push to a named branch:
git push YOURFORK 2.2.x:auto-backport-of-pr-57474-on-2.2.x
  1. Create a PR against branch 2.2.x, I would have named this PR:

"Backport PR #57474 on branch 2.2.x (REGR: DataFrame.transpose resulting in not contiguous data on nullable EAs)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@rhshadrach rhshadrach deleted the regr_transpose_ea branch February 19, 2024 03:00
rhshadrach added a commit to rhshadrach/pandas that referenced this pull request Feb 19, 2024
phofl pushed a commit that referenced this pull request Feb 19, 2024
…ng in not contiguous data on nullable EAs) (#57496)

Backport PR #57474: REGR: DataFrame.transpose resulting in not contiguous data on nullable EAs
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: "ValueError: ndarray is not C-contiguous" when aggregating concatenated dataframe
3 participants