Skip to content

PERF: DataFrame dict constructor with columns #24387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Dec 21, 2018

When passing a dict and column= to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit construct_1d_object_array_from_listlike1. For
extension arrays, this ends up calling ExtensionArray.__iter__,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.


import pandas as pd
import numpy as np

a = pd.Series(np.arange(1000))
d = {i: a for i in range(30)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))

before

4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

after

2.54 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With Series with sparse values instead, the problem is exacerbated (note
the smaller and fewer series).

a = pd.Series(np.arange(1000), dtype="Sparse[int]")
d = {i: a for i in range(50)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))

Before

213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

after

4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We try to properly handle all the edge cases that we were papering over
earlier by just passing the data to Series.

We fix a bug or two along the way, but don't change any tested
behavior, even if it looks fishy (e.g. #24385).

Closes #24368
Closes #24386

When passing a dict and `column=` to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit `construct_1d_object_array_from_listlike`[1]. For
extension arrays, this ends up calling `ExtensionArray.__iter__`,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.

---

```python
import pandas as pd
import numpy as np

a = pd.Series(np.arange(1000))
d = {i: a for i in range(30)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))
```

before

```
4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

after

```
4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

With Series with sparse values instead, the problem is exacerbated (note
the smaller and fewer series).

```python
a = pd.Series(np.arange(1000), dtype="Sparse[int]")
d = {i: a for i in range(50)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))
```

Before

```
213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

after

```
4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

---

We try to properly handle all the edge cases that we were papering over
earlier by just passing the `data` to Series.

We fix a bug or two along the way, but don't change any *tested*
behavior, even if it looks fishy (e.g. pandas-dev#24385).

[1]: pandas-dev#24368 (comment)

Closes pandas-dev#24368
Closes pandas-dev#24386
@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Dec 21, 2018
@pep8speaks
Copy link

pep8speaks commented Dec 21, 2018

Hello @TomAugspurger! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-26 02:13:22 UTC

@TomAugspurger
Copy link
Contributor Author

Note that this passes the tests in tests/frame/test_constructors.py, but other tests are failing. I'll be updating to handle additional edge cases that apparently aren't tested in constructors.

@TomAugspurger
Copy link
Contributor Author

I don't really have a preference for whether this goes in 0.24 or 0.24.1.

@codecov
Copy link

codecov bot commented Dec 21, 2018

Codecov Report

Merging #24387 into master will decrease coverage by 49.36%.
The diff coverage is 92.1%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24387       +/-   ##
===========================================
- Coverage    92.3%   42.93%   -49.37%     
===========================================
  Files         162      162               
  Lines       51875    51892       +17     
===========================================
- Hits        47883    22280    -25603     
- Misses       3992    29612    +25620
Flag Coverage Δ
#multiple ?
#single 42.93% <92.1%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/construction.py 63.7% <92.1%> (-32.95%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-98.65%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 121 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c58817...ed70cef. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 21, 2018

Codecov Report

Merging #24387 into master will decrease coverage by 49.35%.
The diff coverage is 92.45%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24387       +/-   ##
===========================================
- Coverage    92.3%   42.95%   -49.36%     
===========================================
  Files         162      162               
  Lines       51875    51907       +32     
===========================================
- Hits        47883    22296    -25587     
- Misses       3992    29611    +25619
Flag Coverage Δ
#multiple ?
#single 42.95% <92.45%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/construction.py 66.66% <92.45%> (-29.99%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-98.65%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 121 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c58817...c1f2a58. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

ed70cef handles duplicates in columns. It wasn't straightforward.

@TomAugspurger
Copy link
Contributor Author

cc @toobaz if you have a chance to look at this. I think you did some work on simplifying the DataFrame constructor recently? I may be undoing some of that work :/

@TomAugspurger
Copy link
Contributor Author

With the latest all of pandas/tests/frame is passing...

nan_dtype)
arrays.loc[missing] = [val] * missing.sum()

from pandas.core.series import Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is way way too complicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is that comment supposed to achieve? I've put a ton of time into this. The DataFrame constructor is complicated with way too many special cases. I agree. This is what it takes to avoid the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like come on? What was the point of that? Do you think I intentionally write complex code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh course not
my point is that the perf fix is not worth it with this complexity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point out specific parts you find too complex instead of dismissing the whole thing? As I pointed out in the initial issue, the previous approach of passing the dict of arrays to the Series constructor isn't workable because it eventually iterates element wise over every value in every collection. If you have suggestions on how to avoid that, please put them forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a look.


# Columns make not be unique (even though we're in init_dict and
# dict keys have to be unique...). We have two possible strategies
# 1.) Gracefully handle duplicates when going through data to build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the problem i have with this complexity is that 1 & 2 are not really nice, nor is the current strategy because of perf issues. Since we must generate arrays from this in the first place, why is that not a valid strategy? e.g. construct arrays of the data with a valid column mapping. THEN use columns if provided to re-order / select as a second op (you can make this better by eliminating columns on the first pass actually if columns is provided).

Further the selection of new dtypes (for columns that are not provided); why are you re-inventing the wheel and not just using construct_1d_arraylike_from_scalar. fi you think that's wrong, then do that in a different PR.

The problem is the scope creep makes this impossible to review otherwise.

I agree that this is a pretty nasty function now, but this change as written is not better at all. Try the suggestion above. If not, then this will need need a bug fix pass first, to adjust what it is doing to avoid special casing as pre-cursor PRs rather than a large refactor.

I get that things like this take signficant time. I have done many many PRs where I spent a lot of time and they don't make things simpler, but that is just a fact of life here. As I said it is probably better to try to remove the special cases first.

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2019

@TomAugspurger still active?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 5, 2019 via email

@jreback
Copy link
Contributor

jreback commented Mar 26, 2019

pushing the benchmarks here is probably worth it. other parts would need to be revisited. closing.

@jreback jreback closed this Mar 26, 2019
@TomAugspurger
Copy link
Contributor Author

I'll keep working on this.

@TomAugspurger TomAugspurger reopened this Mar 26, 2019
@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

closing as stale. pls reopen if rebasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance
Projects
None yet
4 participants