-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: DataFrame dict constructor with columns #24387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. --- ```python import pandas as pd import numpy as np a = pd.Series(np.arange(1000)) d = {i: a for i in range(30)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` before ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` With Series with sparse values instead, the problem is exacerbated (note the smaller and fewer series). ```python a = pd.Series(np.arange(1000), dtype="Sparse[int]") d = {i: a for i in range(50)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` Before ``` 213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` after ``` 4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` --- We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386
Hello @TomAugspurger! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-03-26 02:13:22 UTC |
Note that this passes the tests in |
I don't really have a preference for whether this goes in 0.24 or 0.24.1. |
Codecov Report
@@ Coverage Diff @@
## master #24387 +/- ##
===========================================
- Coverage 92.3% 42.93% -49.37%
===========================================
Files 162 162
Lines 51875 51892 +17
===========================================
- Hits 47883 22280 -25603
- Misses 3992 29612 +25620
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24387 +/- ##
===========================================
- Coverage 92.3% 42.95% -49.36%
===========================================
Files 162 162
Lines 51875 51907 +32
===========================================
- Hits 47883 22296 -25587
- Misses 3992 29611 +25619
Continue to review full report at Codecov.
|
ed70cef handles duplicates in |
cc @toobaz if you have a chance to look at this. I think you did some work on simplifying the DataFrame constructor recently? I may be undoing some of that work :/ |
With the latest all of pandas/tests/frame is passing... |
nan_dtype) | ||
arrays.loc[missing] = [val] * missing.sum() | ||
|
||
from pandas.core.series import Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is way way too complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is that comment supposed to achieve? I've put a ton of time into this. The DataFrame constructor is complicated with way too many special cases. I agree. This is what it takes to avoid the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like come on? What was the point of that? Do you think I intentionally write complex code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh course not
my point is that the perf fix is not worth it with this complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point out specific parts you find too complex instead of dismissing the whole thing? As I pointed out in the initial issue, the previous approach of passing the dict of arrays to the Series constructor isn't workable because it eventually iterates element wise over every value in every collection. If you have suggestions on how to avoid that, please put them forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have a look.
|
||
# Columns make not be unique (even though we're in init_dict and | ||
# dict keys have to be unique...). We have two possible strategies | ||
# 1.) Gracefully handle duplicates when going through data to build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the problem i have with this complexity is that 1 & 2 are not really nice, nor is the current strategy because of perf issues. Since we must generate arrays from this in the first place, why is that not a valid strategy? e.g. construct arrays of the data with a valid column mapping. THEN use columns if provided to re-order / select as a second op (you can make this better by eliminating columns on the first pass actually if columns is provided).
Further the selection of new dtypes (for columns that are not provided); why are you re-inventing the wheel and not just using construct_1d_arraylike_from_scalar
. fi you think that's wrong, then do that in a different PR.
The problem is the scope creep makes this impossible to review otherwise.
I agree that this is a pretty nasty function now, but this change as written is not better at all. Try the suggestion above. If not, then this will need need a bug fix pass first, to adjust what it is doing to avoid special casing as pre-cursor PRs rather than a large refactor.
I get that things like this take signficant time. I have done many many PRs where I spent a lot of time and they don't make things simpler, but that is just a fact of life here. As I said it is probably better to try to remove the special cases first.
@TomAugspurger still active? |
Yep.
…On Wed, Feb 27, 2019 at 5:56 PM William Ayd ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> still active?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24387 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIkbvW5XPdn2pRl6IqcqyHtXdLkS-ks5vRxspgaJpZM4Zezaf>
.
|
pushing the benchmarks here is probably worth it. other parts would need to be revisited. closing. |
I'll keep working on this. |
closing as stale. pls reopen if rebasing. |
When passing a dict and
column=
to DataFrame, we previouslypassed the dict of {column: array} to the Series constructor. This
eventually hit
construct_1d_object_array_from_listlike
1. Forextension arrays, this ends up calling
ExtensionArray.__iter__
,iterating over the elements of the ExtensionArray, which is
prohibiatively slow.
before
after
With Series with sparse values instead, the problem is exacerbated (note
the smaller and fewer series).
Before
after
We try to properly handle all the edge cases that we were papering over
earlier by just passing the
data
to Series.We fix a bug or two along the way, but don't change any tested
behavior, even if it looks fishy (e.g. #24385).
Closes #24368
Closes #24386