Skip to content

PERF: data = np.nan to speed up empty dataframe creation #28225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

astyonax
Copy link

  • closes Slow (and weird) empty dataframe creation #28188

  • Short summary
    If we have enough information to know the shape of the final
    dataframe, we set the value of data to np.nan if not set.

  • Longer explanation
    If I understand correctly the __init__ of DataFrame, with this change
    when the first if evaluates True then all following ifs evaluate False
    until we enter the else in frame.py:459, and we are done because now the dataframe is initialized with a ndarray.
    With the former behavior data defaults always to None,
    hence the output dataframe has all dtypes object and its creation
    is slow.
    With the proposed code the output dataframe has all dtypes float and is
    faster to create than before, but only when columns and index are given.

I don't know how to loose the current restriction of dtype=None. It would be nice to have a function that verifies that a type is compatible with nan.

Check that we have enough informations to know the shape of the final
dataframe and set the value of data to np.nan if not set.
With the former behaviour data is defaults always to None,
hence the output dataframe has dtypes object and its creation
is slow.
With the proposed code the output dataframe has dtypes float and is
faster to create but only when columns and index are given.
@astyonax astyonax changed the title PERF: data = np.nan to speed up empty dataframe creation (#28188) PERF: data = np.nan to speed up empty dataframe creation Aug 29, 2019
@jbrockmendel
Copy link
Member

first thing we look for a is a test

@simonjayhawkins simonjayhawkins added the Performance Memory or execution speed performance label Aug 30, 2019
@@ -387,8 +387,12 @@ def _constructor_expanddim(self):
# Constructors

def __init__(self, data=None, index=None, columns=None, dtype=None, copy=False):
if data is None and index is not None and columns is not None and dtype is None:
data = np.nan

if data is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will always override the previous code block

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.
Because when the if in line 390 is true, data becomes np.nan
and the following if evaluates to False because np.nan is not None

>>> import numpy
>>> numpy.nan is None
False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right

@astyonax
Copy link
Author

first thing we look for a is a test

ofc! But I wanted to see also the results of CI to check what I'm breaking ;)

@WillAyd
Copy link
Member

WillAyd commented Sep 2, 2019

Hi @astyonax - thanks for the contribution. It looks like this might be an incomplete change however so going to close for the time being to keep our PR queue down.

Best advice I can give is to run the tests locally and make sure the change doesn't cause any regressions before pushing to CI. We have performance benchmarks you can run thereafter to measure improvements.

More info can be found in the contributing guide so be sure to give that a look:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#contributing-to-pandas

@WillAyd WillAyd closed this Sep 2, 2019
@kostyafarber
Copy link
Contributor

What kind of tests are needed on this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow (and weird) empty dataframe creation
6 participants