PERF: data = np.nan to speed up empty dataframe creation #28225

astyonax · 2019-08-29T20:57:08Z

closes Slow (and weird) empty dataframe creation #28188
Short summary
If we have enough information to know the shape of the final
dataframe, we set the value of data to np.nan if not set.
Longer explanation
If I understand correctly the __init__ of DataFrame, with this change
when the first if evaluates True then all following ifs evaluate False
until we enter the else in frame.py:459, and we are done because now the dataframe is initialized with a ndarray.
With the former behavior data defaults always to None,
hence the output dataframe has all dtypes object and its creation
is slow.
With the proposed code the output dataframe has all dtypes float and is
faster to create than before, but only when columns and index are given.

I don't know how to loose the current restriction of dtype=None. It would be nice to have a function that verifies that a type is compatible with nan.

Check that we have enough informations to know the shape of the final dataframe and set the value of data to np.nan if not set. With the former behaviour data is defaults always to None, hence the output dataframe has dtypes object and its creation is slow. With the proposed code the output dataframe has dtypes float and is faster to create but only when columns and index are given.

jbrockmendel · 2019-08-29T21:27:09Z

first thing we look for a is a test

dsaxton · 2019-08-30T00:23:09Z

pandas/core/frame.py

@@ -387,8 +387,12 @@ def _constructor_expanddim(self):
    # Constructors

    def __init__(self, data=None, index=None, columns=None, dtype=None, copy=False):
+        if data is None and index is not None and columns is not None and dtype is None:
+            data = np.nan
+
        if data is None:


I think this will always override the previous code block

No.
Because when the if in line 390 is true, data becomes np.nan
and the following if evaluates to False because np.nan is not None

>>> import numpy >>> numpy.nan is None False

Sorry, you're right

astyonax · 2019-08-30T06:47:21Z

first thing we look for a is a test

ofc! But I wanted to see also the results of CI to check what I'm breaking ;)

WillAyd · 2019-09-02T21:34:16Z

Hi @astyonax - thanks for the contribution. It looks like this might be an incomplete change however so going to close for the time being to keep our PR queue down.

Best advice I can give is to run the tests locally and make sure the change doesn't cause any regressions before pushing to CI. We have performance benchmarks you can run thereafter to measure improvements.

More info can be found in the contributing guide so be sure to give that a look:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#contributing-to-pandas

kostyafarber · 2023-01-08T20:06:26Z

What kind of tests are needed on this one?

astyonax added 2 commits August 29, 2019 22:17

if shape given, data should be nan

8a0700d

astyonax changed the title ~~PERF: data = np.nan to speed up empty dataframe creation (#28188)~~ PERF: data = np.nan to speed up empty dataframe creation Aug 29, 2019

simonjayhawkins added the Performance Memory or execution speed performance label Aug 30, 2019

dsaxton reviewed Aug 30, 2019

View reviewed changes

WillAyd closed this Sep 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: data = np.nan to speed up empty dataframe creation #28225

PERF: data = np.nan to speed up empty dataframe creation #28225

astyonax commented Aug 29, 2019

jbrockmendel commented Aug 29, 2019

dsaxton Aug 30, 2019

astyonax Aug 30, 2019

dsaxton Aug 30, 2019

astyonax commented Aug 30, 2019

WillAyd commented Sep 2, 2019

kostyafarber commented Jan 8, 2023

PERF: data = np.nan to speed up empty dataframe creation #28225

PERF: data = np.nan to speed up empty dataframe creation #28225

Conversation

astyonax commented Aug 29, 2019

jbrockmendel commented Aug 29, 2019

dsaxton Aug 30, 2019

Choose a reason for hiding this comment

astyonax Aug 30, 2019

Choose a reason for hiding this comment

dsaxton Aug 30, 2019

Choose a reason for hiding this comment

astyonax commented Aug 30, 2019

WillAyd commented Sep 2, 2019

kostyafarber commented Jan 8, 2023