Skip to content

Slow (and weird) empty dataframe creation #28188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
astyonax opened this issue Aug 28, 2019 · 5 comments
Closed

Slow (and weird) empty dataframe creation #28188

astyonax opened this issue Aug 28, 2019 · 5 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance

Comments

@astyonax
Copy link

astyonax commented Aug 28, 2019

I measured the creation of an empty dataframe with 3 similar arguments:

import pandas as pd
cols = np.arange(100)
index = np.arange(1000)

%timeit pd.DataFrame(columns=cols, index=index)
# 100 loops, best of 3: 18.8 ms per loop
%timeit pd.DataFrame({}, columns=cols, index=index)
100 loops, best of 3: 18.7 ms per loop
%timeit pd.DataFrame(np.nan ,columns=cols, index=index)
1000 loops, best of 3: 434 µs per loop

z1 = pd.DataFrame(columns=cols,index=index) # dtype -> object
z2 = pd.DataFrame({},columns=cols,index=index) # dtype -> object
z3 = pd.DataFrame(np.nan,columns=cols,index=index) # dtype -> float

If I understand correctly the code in here:

data = {}

the first 2 constructions are the same: the default value of data is an empty dictionary.
The output is a dataframe of type object.

The 3rd, with np.nan as initial value, is about 40 times faster, and the dtype is float, as expected.

So I would change:

  1. the default value of data to np.nan. If one creates an empty dataframe with no specific dtype, then the best pandas can do is to return the cheapest and fastest dataframe that can contain "empty" values. A dataframe of type float filled with np.nan is so a good candidate.

  2. the documentation to report the current behavior.

What is your opinion?

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

I vaguely recall there being an issue with performance of frame construction from a dict (couldn't find it - @TomAugspurger might know) so may be related to that. Though if you have a simple solution here as well you are certainly welcome to submit a PR for review

@WillAyd WillAyd added the Performance Memory or execution speed performance label Aug 28, 2019
@astyonax
Copy link
Author

Ok. I'll come up with something asap

@machow
Copy link

machow commented Aug 30, 2019

@WillAyd I wonder if this is related to the issue with _try_cast running too deep on Series construction #28145

For the slow case, it takes up a third of the construction time. _try_cast does expensive checks on object or int dyptes, to see whether it can cast it to a datetime.

Note how much time is spent sanitizing the array

image

Zooming in further, a big chunk of this is _try_cast

image

@astyonax
Copy link
Author

astyonax commented Aug 31, 2019

It may well be the case that with my changes _try_cast is not called anymore. Here follow the snakeviz (whose existence I didn't know before, btw) for this code

%%snakeviz
for j in range(10000):
    pd.DataFrame(columns=cols, index=index)

Screenshot from 2019-08-31 22-35-23

and zooming in construction.py
Screenshot from 2019-08-31 22-35-56

I think that the culprit is that by assigning np.nan to data, we avoid the init_dict function

BTW: I didn't have time to work on the tests yet

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Feb 25, 2020
@simonjayhawkins
Copy link
Member

closing as duplicate of #25887

@simonjayhawkins simonjayhawkins added the Duplicate Report Duplicate issue or pull request label Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants