Slow (and weird) empty dataframe creation #28188

astyonax · 2019-08-28T11:31:07Z

I measured the creation of an empty dataframe with 3 similar arguments:

import pandas as pd
cols = np.arange(100)
index = np.arange(1000)

%timeit pd.DataFrame(columns=cols, index=index)
# 100 loops, best of 3: 18.8 ms per loop
%timeit pd.DataFrame({}, columns=cols, index=index)
100 loops, best of 3: 18.7 ms per loop
%timeit pd.DataFrame(np.nan ,columns=cols, index=index)
1000 loops, best of 3: 434 µs per loop

z1 = pd.DataFrame(columns=cols,index=index) # dtype -> object
z2 = pd.DataFrame({},columns=cols,index=index) # dtype -> object
z3 = pd.DataFrame(np.nan,columns=cols,index=index) # dtype -> float

If I understand correctly the code in here:

pandas/pandas/core/frame.py

Line 399 in 171c716

data = {}

the first 2 constructions are the same: the default value of data is an empty dictionary.
The output is a dataframe of type object.

The 3rd, with np.nan as initial value, is about 40 times faster, and the dtype is float, as expected.

So I would change:

the default value of data to np.nan. If one creates an empty dataframe with no specific dtype, then the best pandas can do is to return the cheapest and fastest dataframe that can contain "empty" values. A dataframe of type float filled with np.nan is so a good candidate.
the documentation to report the current behavior.

What is your opinion?

WillAyd · 2019-08-28T15:26:30Z

I vaguely recall there being an issue with performance of frame construction from a dict (couldn't find it - @TomAugspurger might know) so may be related to that. Though if you have a simple solution here as well you are certainly welcome to submit a PR for review

astyonax · 2019-08-28T15:32:09Z

Ok. I'll come up with something asap

machow · 2019-08-30T18:23:58Z

@WillAyd I wonder if this is related to the issue with _try_cast running too deep on Series construction #28145

For the slow case, it takes up a third of the construction time. _try_cast does expensive checks on object or int dyptes, to see whether it can cast it to a datetime.

Note how much time is spent sanitizing the array

Zooming in further, a big chunk of this is _try_cast

astyonax · 2019-08-31T20:40:35Z

It may well be the case that with my changes _try_cast is not called anymore. Here follow the snakeviz (whose existence I didn't know before, btw) for this code

%%snakeviz
for j in range(10000):
    pd.DataFrame(columns=cols, index=index)

and zooming in construction.py

I think that the culprit is that by assigning np.nan to data, we avoid the init_dict function

BTW: I didn't have time to work on the tests yet

simonjayhawkins · 2020-06-05T17:26:42Z

closing as duplicate of #25887

WillAyd added the Performance Memory or execution speed performance label Aug 28, 2019

astyonax mentioned this issue Aug 29, 2019

PERF: data = np.nan to speed up empty dataframe creation #28225

Closed

1 task

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Feb 25, 2020

simonjayhawkins closed this as completed Jun 5, 2020

simonjayhawkins added the Duplicate Report Duplicate issue or pull request label Jun 5, 2020

simonjayhawkins mentioned this issue Jun 5, 2020

Empty dataframe creation #25887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow (and weird) empty dataframe creation #28188

Slow (and weird) empty dataframe creation #28188

astyonax commented Aug 28, 2019 •

edited

Loading

WillAyd commented Aug 28, 2019

astyonax commented Aug 28, 2019

machow commented Aug 30, 2019 •

edited

Loading

astyonax commented Aug 31, 2019 •

edited

Loading

simonjayhawkins commented Jun 5, 2020

Slow (and weird) empty dataframe creation #28188

Slow (and weird) empty dataframe creation #28188

Comments

astyonax commented Aug 28, 2019 • edited Loading

WillAyd commented Aug 28, 2019

astyonax commented Aug 28, 2019

machow commented Aug 30, 2019 • edited Loading

astyonax commented Aug 31, 2019 • edited Loading

simonjayhawkins commented Jun 5, 2020

astyonax commented Aug 28, 2019 •

edited

Loading

machow commented Aug 30, 2019 •

edited

Loading

astyonax commented Aug 31, 2019 •

edited

Loading