PERF: dataframe construction from recarray is slow #44826

GYHHAHA · 2021-12-09T12:43:19Z

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I find dataframe construction from recarray is slow, but from_records() is fast. This is unreasonable.

Suppose the recarray is generated from the following step:

n = 7
df = pd.DataFrame(
    {
        "A": np.random.rand(int(10**n)),
        "B": np.random.rand(int(10**n)),
        "C": ["a"]*int(10**n)
    }
)
arr = df.to_records(index=False)

Time comparison:

# Nearly 1 minute
>>>df_new = pd.DataFrame(arr)
# less than 0.1 second
>>>df_new = pd.DataFrame.from_records(arr)

The reason for this odd behaviour results from pandas.core.internals.construction.rec_array_to_mgr. The following code passes the recarray into _get_names_from_index, which has a large and totally unnecessary loop across the array.

pandas/pandas/core/internals/construction.py

Line 179 in a3702e2

index = _get_names_from_index(fdata)

pandas/pandas/core/internals/construction.py

Line 724 in a3702e2

has_some_name = any(getattr(s, "name", None) is not None for s in data)

And actually I personally believe this function is designed for the nested data since it's called in nested_data_to_arrays.

pandas/pandas/core/internals/construction.py

Line 518 in a3702e2

index = _get_names_from_index(data)

Thus maybe directly change to use default_index(len(data)) is a fix.

Installed Versions

1.3.4

Prior Performance

No response

GYHHAHA added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Dec 9, 2021

GYHHAHA mentioned this issue Dec 9, 2021

PERF: faster dataframe construction from recarray #44827

Merged

4 tasks

jreback closed this as completed in #44827 Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: dataframe construction from recarray is slow #44826

PERF: dataframe construction from recarray is slow #44826

GYHHAHA commented Dec 9, 2021 •

edited

Loading

PERF: dataframe construction from recarray is slow #44826

PERF: dataframe construction from recarray is slow #44826

Comments

GYHHAHA commented Dec 9, 2021 • edited Loading

Reproducible Example

Installed Versions

Prior Performance

GYHHAHA commented Dec 9, 2021 •

edited

Loading