Skip to content

Extremely slow construction of large DataFrames out of ndarrays? #8161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aldanor opened this issue Sep 2, 2014 · 5 comments
Closed

Extremely slow construction of large DataFrames out of ndarrays? #8161

aldanor opened this issue Sep 2, 2014 · 5 comments
Labels
API Design Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance

Comments

@aldanor
Copy link
Contributor

aldanor commented Sep 2, 2014

I was trying to figure out what's the most efficient way to create a dataframe out of a large numpy record array (I initially naively thought this could be done in a zero-copy way) and ran some simple tests:


n_columns = 2
n_records = int(15e6)

data = np.zeros((n_columns, n_records), np.int32)
arr = np.core.records.fromarrays(data)

%timeit -n1 -r1 df = pd.DataFrame(arr)
%timeit -n1 -r1 df = pd.DataFrame(arr, copy=False)
%timeit -n1 -r1 f0, f1 = arr['f0'].copy(), arr['f1'].copy()
f0, f1 = arr['f0'].copy(), arr['f1'].copy()
%timeit -n1 -r1 df = pd.DataFrame({'f0': f0, 'f1': f1})
%timeit -n1 -r1 df = pd.DataFrame({'f0': f0, 'f1': f1}, copy=False)

1 loops, best of 1: 2.47 s per loop
1 loops, best of 1: 2.42 s per loop
1 loops, best of 1: 48.2 ms per loop
1 loops, best of 1: 4.25 s per loop
1 loops, best of 1: 4.57 s per loop

I wonder what's the DataFrame constructor doing for several seconds straight even if it's provided with a materialized ndarray (note that copying both arrays takes < 50ms)? This is on v0.14.1.

P.S. Is there a more efficientway to make a dataframe out of a recarray?

@jreback
Copy link
Contributor

jreback commented Sep 2, 2014

use .from_record

what you are doing forces copies irregardless of the copy flag

@jreback
Copy link
Contributor

jreback commented Sep 2, 2014

sorry : from_records

@aldanor
Copy link
Contributor Author

aldanor commented Sep 2, 2014

Ok, indeed, that seems more like it:

%timeit -n1 -r1 df = pd.DataFrame.from_records({'f0': f0, 'f1': f1})
%timeit -n1 -r1 df = pd.DataFrame.from_records(arr)

1 loops, best of 1: 46.5 ms per loop
1 loops, best of 1: 46.2 ms per loop

I just came across #4916 today and wrongly assumed that __init__ does the same as .from_records for recarrays. It's also quite unobvious that the constructor forces copying and ignores the copy flag in all the above cases (the last two in the first example in particular)... And why does it take 10x more time than it's needed to actually copy the ndarrays?

@jreback
Copy link
Contributor

jreback commented Sep 2, 2014

this is a longstanding API issue
pr is welcome (it's not really that hard actually just need a dispatch from the DataFrame.init constructor to actually call from_records essentially

@jreback
Copy link
Contributor

jreback commented Sep 2, 2014

closing this as it's a dupe (will note in the other issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants