-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: data = np.nan to speed up empty dataframe creation #28225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Check that we have enough informations to know the shape of the final dataframe and set the value of data to np.nan if not set. With the former behaviour data is defaults always to None, hence the output dataframe has dtypes object and its creation is slow. With the proposed code the output dataframe has dtypes float and is faster to create but only when columns and index are given.
first thing we look for a is a test |
@@ -387,8 +387,12 @@ def _constructor_expanddim(self): | |||
# Constructors | |||
|
|||
def __init__(self, data=None, index=None, columns=None, dtype=None, copy=False): | |||
if data is None and index is not None and columns is not None and dtype is None: | |||
data = np.nan | |||
|
|||
if data is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will always override the previous code block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
Because when the if in line 390 is true, data becomes np.nan
and the following if evaluates to False because np.nan is not None
>>> import numpy
>>> numpy.nan is None
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right
ofc! But I wanted to see also the results of CI to check what I'm breaking ;) |
Hi @astyonax - thanks for the contribution. It looks like this might be an incomplete change however so going to close for the time being to keep our PR queue down. Best advice I can give is to run the tests locally and make sure the change doesn't cause any regressions before pushing to CI. We have performance benchmarks you can run thereafter to measure improvements. More info can be found in the contributing guide so be sure to give that a look: https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#contributing-to-pandas |
What kind of tests are needed on this one? |
closes Slow (and weird) empty dataframe creation #28188
Short summary
If we have enough information to know the shape of the final
dataframe, we set the value of data to np.nan if not set.
Longer explanation
If I understand correctly the
__init__
of DataFrame, with this changewhen the first if evaluates True then all following ifs evaluate False
until we enter the else in frame.py:459, and we are done because now the dataframe is initialized with a ndarray.
With the former behavior
data
defaults always to None,hence the output dataframe has all dtypes object and its creation
is slow.
With the proposed code the output dataframe has all dtypes float and is
faster to create than before, but only when columns and index are given.
I don't know how to loose the current restriction of
dtype=None
. It would be nice to have a function that verifies that a type is compatible with nan.