Skip to content

Series / DataFrame constructors inconsistent with data=None and dtype #24385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Dec 21, 2018 · 4 comments
Open
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Enhancement Error Reporting Incorrect or improved errors from pandas

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 21, 2018

The series constructed below should (I think) be a special case of the DataFrame example, but they differ.

In [4]: pd.DataFrame(None, index=[1, 2, 3], columns=['a'], dtype=int)
Out[4]:
    a
1 NaN
2 NaN
3 NaN

In [5]: pd.Series(None, index=[1, 2, 3], dtype=int)
Out[5]:
1    0
2    0
3    0
dtype: int64

I don't know which makes more sense.

@TomAugspurger TomAugspurger added the Dtype Conversions Unexpected or buggy dtype conversions label Dec 21, 2018
@WillAyd
Copy link
Member

WillAyd commented Dec 21, 2018

Kind of nuanced but I'd side with the DF constructor here

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 21, 2018

Interesting, I was going to go with the Series one, since dtype=int is an explicit request from the user, whereas data is implicit. Though neither is especially intuitive.

Perhaps raising is the best action here :) We still allow the implicitly reindexing if we force the user to do Series(0, index=[1, 2, 3], dtype=int).

@WillAyd
Copy link
Member

WillAyd commented Dec 21, 2018

Yea raising is a good option if there’s a general way of doing it. Either way is ambiguous

@TomAugspurger
Copy link
Contributor Author

Also, apparently the output of DataFrame(None, index=[1, 2], columns=['a'], dtype=object) is object, so the dtype= is only sometimes ignored.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 21, 2018
When passing a dict and `column=` to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit `construct_1d_object_array_from_listlike`[1]. For
extension arrays, this ends up calling `ExtensionArray.__iter__`,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.

We try to properly handle all the edge cases that we were papering over
earlier by just passing the `data` to Series.

We fix a bug or two along the way, but don't change any *tested*
behavior, even if it looks fishy (e.g. pandas-dev#24385).

[1]: pandas-dev#24368 (comment)

Closes pandas-dev#24368
Closes pandas-dev#24386
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 21, 2018
When passing a dict and `column=` to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit `construct_1d_object_array_from_listlike`[1]. For
extension arrays, this ends up calling `ExtensionArray.__iter__`,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.

---

```python
import pandas as pd
import numpy as np

a = pd.Series(np.arange(1000))
d = {i: a for i in range(30)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))
```

before

```
4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

after

```
4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

With Series with sparse values instead, the problem is exacerbated (note
the smaller and fewer series).

```python
a = pd.Series(np.arange(1000), dtype="Sparse[int]")
d = {i: a for i in range(50)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))
```

Before

```
213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

after

```
4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

---

We try to properly handle all the edge cases that we were papering over
earlier by just passing the `data` to Series.

We fix a bug or two along the way, but don't change any *tested*
behavior, even if it looks fishy (e.g. pandas-dev#24385).

[1]: pandas-dev#24368 (comment)

Closes pandas-dev#24368
Closes pandas-dev#24386
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Oct 17, 2019
@mroeschke mroeschke added the Bug label Jun 28, 2020
@mroeschke mroeschke added Enhancement Error Reporting Incorrect or improved errors from pandas and removed Bug labels Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Enhancement Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

4 participants