Skip to content

DataFrame constructor with dict of series misbehaving when columns specified #24368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Dec 20, 2018 · 3 comments · Fixed by #57205
Closed

DataFrame constructor with dict of series misbehaving when columns specified #24368

TomAugspurger opened this issue Dec 20, 2018 · 3 comments · Fixed by #57205
Labels
Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Performance Memory or execution speed performance

Comments

@TomAugspurger
Copy link
Contributor

We take a strange path...

diff --git a/pandas/core/internals/construction.py b/pandas/core/internals/construction.py
index c43745679..54108ec58 100644
--- a/pandas/core/internals/construction.py
+++ b/pandas/core/internals/construction.py
@@ -552,6 +552,7 @@ def sanitize_array(data, index, dtype=None, copy=False,
             data = data.copy()
 
     # GH#846
+    import pdb; pdb.set_trace()
     if isinstance(data, (np.ndarray, Index, ABCSeries)):
 
         if dtype is not None:
In [1]: import pandas as pd; import numpy as np

In [2]: a = pd.Series(np.array([1, 2, 3]))
> /Users/taugspurger/sandbox/pandas/pandas/core/internals/construction.py(556)sanitize_array()
-> if isinstance(data, (np.ndarray, Index, ABCSeries)):
(Pdb) c

In [3]: r = pd.DataFrame({"A": a, "B": a}, columns=['A', 'B'])
> /Users/taugspurger/sandbox/pandas/pandas/core/internals/construction.py(556)sanitize_array()
-> if isinstance(data, (np.ndarray, Index, ABCSeries)):
(Pdb) up
> /Users/taugspurger/sandbox/pandas/pandas/core/series.py(259)__init__()
-> raise_cast_failure=True)
(Pdb)
> /Users/taugspurger/sandbox/pandas/pandas/core/series.py(301)_init_dict()
-> s = Series(values, index=keys, dtype=dtype)
(Pdb)
> /Users/taugspurger/sandbox/pandas/pandas/core/series.py(204)__init__()
-> data, index = self._init_dict(data, index, dtype)
(Pdb)
> /Users/taugspurger/sandbox/pandas/pandas/core/internals/construction.py(176)init_dict()
-> arrays = Series(data, index=columns, dtype=object)
(Pdb)
> /Users/taugspurger/sandbox/pandas/pandas/core/frame.py(387)__init__()
-> mgr = init_dict(data, index, columns, dtype=dtype)
(Pdb) data
{'A': 0    1
1    2
2    3
dtype: int64, 'B': 0    1
1    2
2    3
dtype: int64}
(Pdb) d
> /Users/taugspurger/sandbox/pandas/pandas/core/internals/construction.py(176)init_dict()
-> arrays = Series(data, index=columns, dtype=object)
(Pdb) l
171         Segregate Series based on type and coerce into matrices.
172         Needs to handle a lot of exceptional cases.
173         """
174         if columns is not None:
175             from pandas.core.series import Series
176  ->         arrays = Series(data, index=columns, dtype=object)
177             data_names = arrays.index
178
179             missing = arrays.isnull()
180             if index is None:
181                 # GH10856
(Pdb) data
{'A': 0    1
1    2
2    3
dtype: int64, 'B': 0    1
1    2
2    3
dtype: int64}

I'm guessing we don't want to be passing a dict of Series into the Series constructor there

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Dec 20, 2018
@TomAugspurger
Copy link
Contributor Author

We spend basically all our time in sanitize_array / _try_cast / construct_1d_object_array_from_listlike

And of that, all the time is in NumPy.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1189                                           def construct_1d_object_array_from_listlike(values):
  1190                                               """
  1191                                               Transform any list-like object in a 1-dimensional numpy array of object
  1192                                               dtype.
  1193                                           
  1194                                               Parameters
  1195                                               ----------
  1196                                               values : any iterable which has a len()
  1197                                           
  1198                                               Raises
  1199                                               ------
  1200                                               TypeError
  1201                                                   * If `values` does not have a len()
  1202                                           
  1203                                               Returns
  1204                                               -------
  1205                                               1-dimensional numpy array of dtype object
  1206                                               """
  1207                                               # numpy will try to interpret nested lists as further dimensions, hence
  1208                                               # making a 1D array that contains list-likes is a bit tricky:
  1209         3         29.0      9.7      0.0      result = np.empty(len(values), dtype='object')
  1210         3     723782.0 241260.7    100.0      result[:] = values
  1211         3          3.0      1.0      0.0      return result

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 21, 2018
When passing a dict and `column=` to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit `construct_1d_object_array_from_listlike`[1]. For
extension arrays, this ends up calling `ExtensionArray.__iter__`,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.

We try to properly handle all the edge cases that we were papering over
earlier by just passing the `data` to Series.

We fix a bug or two along the way, but don't change any *tested*
behavior, even if it looks fishy (e.g. pandas-dev#24385).

[1]: pandas-dev#24368 (comment)

Closes pandas-dev#24368
Closes pandas-dev#24386
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 21, 2018
When passing a dict and `column=` to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit `construct_1d_object_array_from_listlike`[1]. For
extension arrays, this ends up calling `ExtensionArray.__iter__`,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.

---

```python
import pandas as pd
import numpy as np

a = pd.Series(np.arange(1000))
d = {i: a for i in range(30)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))
```

before

```
4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

after

```
4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

With Series with sparse values instead, the problem is exacerbated (note
the smaller and fewer series).

```python
a = pd.Series(np.arange(1000), dtype="Sparse[int]")
d = {i: a for i in range(50)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))
```

Before

```
213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

after

```
4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

---

We try to properly handle all the edge cases that we were papering over
earlier by just passing the `data` to Series.

We fix a bug or two along the way, but don't change any *tested*
behavior, even if it looks fishy (e.g. pandas-dev#24385).

[1]: pandas-dev#24368 (comment)

Closes pandas-dev#24368
Closes pandas-dev#24386
@misyntropy
Copy link

I would like to add my vote to actually go through with this performance enhancement (more like performance bug fix): constructing a DataFrame with index and columns as Index objects is around 200 times slower than constructing with np.zeros and then assigning to index and columns... see below. It's really weird and counter-intuitive given how (seemingly?) simple the operation is

df_a = pd.DataFrame(index=list(range(1000)))
df_b = pd.DataFrame(index=list(range(10000)))

%timeit df_c = pd.DataFrame(index=df_a.index,columns=df_b.index)
800 ms ± 70 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_d = pd.DataFrame(np.zeros((df_a.shape[0],df_b.shape[0]))); df_d.index = df_a.index; df_d.columns = df_b.index
4.36 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@TomAugspurger
Copy link
Contributor Author

Yeah, we'll get there, probably for 0.24.1. Just need to do it in a way that's maintainable.

@jreback jreback added this to the Contributions Welcome milestone Jun 8, 2019
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@mroeschke mroeschke added the DataFrame DataFrame data structure label Jun 25, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Performance Memory or execution speed performance
Projects
None yet
5 participants