-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame constructor with dict of series misbehaving when columns
specified
#24368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We spend basically all our time in sanitize_array / _try_cast / And of that, all the time is in NumPy.
|
When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386
When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. --- ```python import pandas as pd import numpy as np a = pd.Series(np.arange(1000)) d = {i: a for i in range(30)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` before ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` With Series with sparse values instead, the problem is exacerbated (note the smaller and fewer series). ```python a = pd.Series(np.arange(1000), dtype="Sparse[int]") d = {i: a for i in range(50)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` Before ``` 213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` after ``` 4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` --- We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386
I would like to add my vote to actually go through with this performance enhancement (more like performance bug fix): constructing a DataFrame with index and columns as Index objects is around 200 times slower than constructing with np.zeros and then assigning to index and columns... see below. It's really weird and counter-intuitive given how (seemingly?) simple the operation is
|
Yeah, we'll get there, probably for 0.24.1. Just need to do it in a way that's maintainable. |
We take a strange path...
I'm guessing we don't want to be passing a dict of Series into the Series constructor there
The text was updated successfully, but these errors were encountered: