-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Slow (and weird) empty dataframe creation #28188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I vaguely recall there being an issue with performance of frame construction from a dict (couldn't find it - @TomAugspurger might know) so may be related to that. Though if you have a simple solution here as well you are certainly welcome to submit a PR for review |
Ok. I'll come up with something asap |
@WillAyd I wonder if this is related to the issue with For the slow case, it takes up a third of the construction time. _try_cast does expensive checks on object or int dyptes, to see whether it can cast it to a datetime. Note how much time is spent sanitizing the array Zooming in further, a big chunk of this is |
closing as duplicate of #25887 |
I measured the creation of an empty dataframe with 3 similar arguments:
If I understand correctly the code in here:
pandas/pandas/core/frame.py
Line 399 in 171c716
the first 2 constructions are the same: the default value of data is an empty dictionary.
The output is a dataframe of type object.
The 3rd, with np.nan as initial value, is about 40 times faster, and the dtype is float, as expected.
So I would change:
the default value of data to np.nan. If one creates an empty dataframe with no specific dtype, then the best pandas can do is to return the cheapest and fastest dataframe that can contain "empty" values. A dataframe of type float filled with np.nan is so a good candidate.
the documentation to report the current behavior.
What is your opinion?
The text was updated successfully, but these errors were encountered: