Skip to content

PERF: dataframe construction from recarray is slow #44826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
GYHHAHA opened this issue Dec 9, 2021 · 0 comments · Fixed by #44827
Closed
2 of 3 tasks

PERF: dataframe construction from recarray is slow #44826

GYHHAHA opened this issue Dec 9, 2021 · 0 comments · Fixed by #44827
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance

Comments

@GYHHAHA
Copy link
Contributor

GYHHAHA commented Dec 9, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I find dataframe construction from recarray is slow, but from_records() is fast. This is unreasonable.

Suppose the recarray is generated from the following step:

n = 7
df = pd.DataFrame(
    {
        "A": np.random.rand(int(10**n)),
        "B": np.random.rand(int(10**n)),
        "C": ["a"]*int(10**n)
    }
)
arr = df.to_records(index=False)

Time comparison:

# Nearly 1 minute
>>>df_new = pd.DataFrame(arr)
# less than 0.1 second
>>>df_new = pd.DataFrame.from_records(arr)

The reason for this odd behaviour results from pandas.core.internals.construction.rec_array_to_mgr. The following code passes the recarray into _get_names_from_index, which has a large and totally unnecessary loop across the array.

index = _get_names_from_index(fdata)

has_some_name = any(getattr(s, "name", None) is not None for s in data)

And actually I personally believe this function is designed for the nested data since it's called in nested_data_to_arrays.

index = _get_names_from_index(data)

Thus maybe directly change to use default_index(len(data)) is a fix.

Installed Versions

1.3.4

Prior Performance

No response

@GYHHAHA GYHHAHA added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant