Skip to content

DataFrame constructor ignores integer dtype when dict-data and non-overlapping columns #24386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Dec 21, 2018 · 5 comments · Fixed by #34886
Closed
Assignees
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@TomAugspurger
Copy link
Contributor

This is close to #24385, but we don't have a test saying otherwise, so I'm actually going to fix this one :)

In [21]: pd.DataFrame({"a": [1, 2]}, columns=['b'], dtype=int).dtypes
Out[21]:
b    float64
dtype: object

In [22]: pd.DataFrame({"a": [1, 2]}, columns=['b'], dtype='datetime64[ns]').dtypes
Out[22]:
b    datetime64[ns]
dtype: object

Out[21] should be int64.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 21, 2018
When passing a dict and `column=` to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit `construct_1d_object_array_from_listlike`[1]. For
extension arrays, this ends up calling `ExtensionArray.__iter__`,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.

We try to properly handle all the edge cases that we were papering over
earlier by just passing the `data` to Series.

We fix a bug or two along the way, but don't change any *tested*
behavior, even if it looks fishy (e.g. pandas-dev#24385).

[1]: pandas-dev#24368 (comment)

Closes pandas-dev#24368
Closes pandas-dev#24386
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 21, 2018
When passing a dict and `column=` to DataFrame, we previously
passed the dict of {column: array} to the Series constructor. This
eventually hit `construct_1d_object_array_from_listlike`[1]. For
extension arrays, this ends up calling `ExtensionArray.__iter__`,
iterating over the elements of the ExtensionArray, which is
prohibiatively slow.

---

```python
import pandas as pd
import numpy as np

a = pd.Series(np.arange(1000))
d = {i: a for i in range(30)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))
```

before

```
4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

after

```
4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

With Series with sparse values instead, the problem is exacerbated (note
the smaller and fewer series).

```python
a = pd.Series(np.arange(1000), dtype="Sparse[int]")
d = {i: a for i in range(50)}

%timeit df = pd.DataFrame(d, columns=list(range(len(d))))
```

Before

```
213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

after

```
4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

---

We try to properly handle all the edge cases that we were papering over
earlier by just passing the `data` to Series.

We fix a bug or two along the way, but don't change any *tested*
behavior, even if it looks fishy (e.g. pandas-dev#24385).

[1]: pandas-dev#24368 (comment)

Closes pandas-dev#24368
Closes pandas-dev#24386
@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions DataFrame DataFrame data structure labels Dec 21, 2018
@chengsoonong
Copy link

It also seems to happen when constructing using index and columns only (no data).

pd.DataFrame(index=[0], columns=['b'], dtype=int).dtypes

returns

b    float64
dtype: object

@jreback jreback added this to the Contributions Welcome milestone Jun 8, 2019
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@mroeschke
Copy link
Member

This looks to be fixed on master. Could use a test

In [17]: In [21]: pd.DataFrame({"a": [1, 2]}, columns=['b'], dtype=int).dtypes
Out[17]:
b    int64
dtype: object

In [18]: pd.__version__
Out[18]: '1.1.0.dev0+1313.g1c0cc62e3'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions labels Apr 20, 2020
@avinashpancham
Copy link
Contributor

take

@jreback jreback added Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 20, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jun 20, 2020
@avinashpancham
Copy link
Contributor

take

jreback pushed a commit that referenced this issue Jun 20, 2020
…#34886)

* TST: Ensure dtypes are set correctly for empty integer columns #24386

* Add comment to refer to GH issue tracker

* Refactor check, use == instead of is

* Moved file to test_constructors.py and added test for other dtypes

* Add support for more dtypes

* Refactor testing for data types using containers in _testing.py
@zabdorff
Copy link

I'm still seeing the issue @chengsoonong raised, in 1.2.4.

It also seems to happen when constructing using index and columns only (no data).

pd.DataFrame(index=[0], columns=['b'], dtype=int).dtypes

returns

b    float64
dtype: object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
7 participants