PERF: DataFrame dict constructor with columns #24387

TomAugspurger · 2018-12-21T19:19:15Z

When passing a dict and column= to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit construct_1d_object_array_from_listlike1. For
extension arrays, this ends up calling ExtensionArray.__iter__,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.

import pandas as pd
import numpy as np

a = pd.Series(np.arange(1000))
d = {i: a for i in range(30)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))

before

4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

after

2.54 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With Series with sparse values instead, the problem is exacerbated (note
the smaller and fewer series).

a = pd.Series(np.arange(1000), dtype="Sparse[int]")
d = {i: a for i in range(50)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))

Before

213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

after

4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We try to properly handle all the edge cases that we were papering over
earlier by just passing the data to Series.

We fix a bug or two along the way, but don't change any tested
behavior, even if it looks fishy (e.g. #24385).

Closes #24368
Closes #24386

When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. --- ```python import pandas as pd import numpy as np a = pd.Series(np.arange(1000)) d = {i: a for i in range(30)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` before ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` With Series with sparse values instead, the problem is exacerbated (note the smaller and fewer series). ```python a = pd.Series(np.arange(1000), dtype="Sparse[int]") d = {i: a for i in range(50)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` Before ``` 213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` after ``` 4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` --- We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386

pep8speaks · 2018-12-21T19:19:20Z

Hello @TomAugspurger! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-26 02:13:22 UTC

TomAugspurger · 2018-12-21T19:20:05Z

Note that this passes the tests in tests/frame/test_constructors.py, but other tests are failing. I'll be updating to handle additional edge cases that apparently aren't tested in constructors.

TomAugspurger · 2018-12-21T19:20:38Z

I don't really have a preference for whether this goes in 0.24 or 0.24.1.

codecov · 2018-12-21T20:28:58Z

Codecov Report

Merging #24387 into master will decrease coverage by 49.36%.
The diff coverage is 92.1%.

@@             Coverage Diff             @@
##           master   #24387       +/-   ##
===========================================
- Coverage    92.3%   42.93%   -49.37%     
===========================================
  Files         162      162               
  Lines       51875    51892       +17     
===========================================
- Hits        47883    22280    -25603     
- Misses       3992    29612    +25620

Flag	Coverage Δ
#multiple	`?`
#single	`42.93% <92.1%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals/construction.py	`63.7% <92.1%> (-32.95%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-98.65%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 121 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c58817...ed70cef. Read the comment docs.

codecov · 2018-12-21T20:28:58Z

Codecov Report

Merging #24387 into master will decrease coverage by 49.35%.
The diff coverage is 92.45%.

@@             Coverage Diff             @@
##           master   #24387       +/-   ##
===========================================
- Coverage    92.3%   42.95%   -49.36%     
===========================================
  Files         162      162               
  Lines       51875    51907       +32     
===========================================
- Hits        47883    22296    -25587     
- Misses       3992    29611    +25619

Flag	Coverage Δ
#multiple	`?`
#single	`42.95% <92.45%> (-0.05%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals/construction.py	`66.66% <92.45%> (-29.99%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-98.65%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 121 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c58817...c1f2a58. Read the comment docs.

TomAugspurger · 2018-12-21T20:29:05Z

ed70cef handles duplicates in columns. It wasn't straightforward.

TomAugspurger · 2018-12-21T20:29:48Z

cc @toobaz if you have a chance to look at this. I think you did some work on simplifying the DataFrame constructor recently? I may be undoing some of that work :/

TomAugspurger · 2018-12-21T20:39:21Z

With the latest all of pandas/tests/frame is passing...

jreback · 2018-12-21T20:52:00Z

pandas/core/internals/construction.py

-                                                     nan_dtype)
-            arrays.loc[missing] = [val] * missing.sum()
-
+    from pandas.core.series import Series


this is way way too complicated

What is that comment supposed to achieve? I've put a ton of time into this. The DataFrame constructor is complicated with way too many special cases. I agree. This is what it takes to avoid the issue.

Like come on? What was the point of that? Do you think I intentionally write complex code?

oh course not
my point is that the perf fix is not worth it with this complexity

Could you point out specific parts you find too complex instead of dismissing the whole thing? As I pointed out in the initial issue, the previous approach of passing the dict of arrays to the Series constructor isn't workable because it eventually iterates element wise over every value in every collection. If you have suggestions on how to avoid that, please put them forward.

I will have a look.

jreback · 2018-12-23T16:24:13Z

pandas/core/internals/construction.py


+    # Columns make not be unique (even though we're in init_dict and
+    # dict keys have to be unique...). We have two possible strategies
+    # 1.) Gracefully handle duplicates when going through data to build


so the problem i have with this complexity is that 1 & 2 are not really nice, nor is the current strategy because of perf issues. Since we must generate arrays from this in the first place, why is that not a valid strategy? e.g. construct arrays of the data with a valid column mapping. THEN use columns if provided to re-order / select as a second op (you can make this better by eliminating columns on the first pass actually if columns is provided).

Further the selection of new dtypes (for columns that are not provided); why are you re-inventing the wheel and not just using construct_1d_arraylike_from_scalar. fi you think that's wrong, then do that in a different PR.

The problem is the scope creep makes this impossible to review otherwise.

I agree that this is a pretty nasty function now, but this change as written is not better at all. Try the suggestion above. If not, then this will need need a bug fix pass first, to adjust what it is doing to avoid special casing as pre-cursor PRs rather than a large refactor.

I get that things like this take signficant time. I have done many many PRs where I spent a lot of time and they don't make things simpler, but that is just a fact of life here. As I said it is probably better to try to remove the special cases first.

WillAyd · 2019-02-27T23:56:21Z

@TomAugspurger still active?

TomAugspurger · 2019-03-05T15:52:55Z

Yep.

…

On Wed, Feb 27, 2019 at 5:56 PM William Ayd ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> still active? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24387 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkbvW5XPdn2pRl6IqcqyHtXdLkS-ks5vRxspgaJpZM4Zezaf> .

jreback · 2019-03-26T00:11:15Z

pushing the benchmarks here is probably worth it. other parts would need to be revisited. closing.

TomAugspurger · 2019-03-26T02:13:18Z

I'll keep working on this.

jreback · 2019-06-08T20:15:00Z

closing as stale. pls reopen if rebasing.

TomAugspurger added Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Dec 21, 2018

TomAugspurger mentioned this pull request Dec 21, 2018

DataFrame constructor is inconsistent when coercing values to strings with dtype=str. #24388

Closed

TomAugspurger added 2 commits December 21, 2018 13:48

Handle duplicates in columns

29a638c

Handle implicit string conversion

025fb91

TomAugspurger mentioned this pull request Dec 21, 2018

DataFrame constructor is inconsistent in handling of duplicates in columns #24389

Closed

Handle duplicates

ed70cef

Handle int dtype with no new columns

bfadc0b

TomAugspurger added 2 commits December 21, 2018 14:46

Cleanup my notes

5e1da4b

whatsnew for performance

559afc7

jreback requested changes Dec 21, 2018

View reviewed changes

TomAugspurger added 2 commits December 21, 2018 15:11

NumPy 1.12 compat

dc43a1e

multi

c1f2a58

jreback requested changes Dec 23, 2018

View reviewed changes

TomAugspurger mentioned this pull request Feb 11, 2019

Dask's Arrow serialization slow & memory intensive dask/distributed#2521

Closed

jreback closed this Mar 26, 2019

TomAugspurger reopened this Mar 26, 2019

jreback closed this Jun 8, 2019

WillAyd mentioned this pull request Jun 10, 2019

PERF: 5x speedup for read_json() with orient='index' by avoiding transpose #26773

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: DataFrame dict constructor with columns #24387

PERF: DataFrame dict constructor with columns #24387

TomAugspurger commented Dec 21, 2018 •

edited

Loading

pep8speaks commented Dec 21, 2018 •

edited

Loading

TomAugspurger commented Dec 21, 2018

TomAugspurger commented Dec 21, 2018

codecov bot commented Dec 21, 2018

codecov bot commented Dec 21, 2018 •

edited

Loading

TomAugspurger commented Dec 21, 2018

TomAugspurger commented Dec 21, 2018

TomAugspurger commented Dec 21, 2018

jreback Dec 21, 2018

TomAugspurger Dec 21, 2018

TomAugspurger Dec 21, 2018

TomAugspurger Dec 21, 2018

jreback Dec 21, 2018

TomAugspurger Dec 21, 2018

jreback Dec 21, 2018

jreback Dec 23, 2018

WillAyd commented Feb 27, 2019

TomAugspurger commented Mar 5, 2019 via email

jreback commented Mar 26, 2019

TomAugspurger commented Mar 26, 2019

jreback commented Jun 8, 2019

PERF: DataFrame dict constructor with columns #24387

PERF: DataFrame dict constructor with columns #24387

Conversation

TomAugspurger commented Dec 21, 2018 • edited Loading

pep8speaks commented Dec 21, 2018 • edited Loading

Comment last updated at 2019-03-26 02:13:22 UTC

TomAugspurger commented Dec 21, 2018

TomAugspurger commented Dec 21, 2018

codecov bot commented Dec 21, 2018

Codecov Report

codecov bot commented Dec 21, 2018 • edited Loading

Codecov Report

TomAugspurger commented Dec 21, 2018

TomAugspurger commented Dec 21, 2018

TomAugspurger commented Dec 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Feb 27, 2019

TomAugspurger commented Mar 5, 2019 via email

jreback commented Mar 26, 2019

TomAugspurger commented Mar 26, 2019

jreback commented Jun 8, 2019

TomAugspurger commented Dec 21, 2018 •

edited

Loading

pep8speaks commented Dec 21, 2018 •

edited

Loading

codecov bot commented Dec 21, 2018 •

edited

Loading