DataFrame constructor ignores integer dtype when dict-data and non-overlapping columns #24386

TomAugspurger · 2018-12-21T18:27:35Z

This is close to #24385, but we don't have a test saying otherwise, so I'm actually going to fix this one :)

In [21]: pd.DataFrame({"a": [1, 2]}, columns=['b'], dtype=int).dtypes
Out[21]:
b    float64
dtype: object

In [22]: pd.DataFrame({"a": [1, 2]}, columns=['b'], dtype='datetime64[ns]').dtypes
Out[22]:
b    datetime64[ns]
dtype: object

Out[21] should be int64.

When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386

When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. --- ```python import pandas as pd import numpy as np a = pd.Series(np.arange(1000)) d = {i: a for i in range(30)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` before ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` With Series with sparse values instead, the problem is exacerbated (note the smaller and fewer series). ```python a = pd.Series(np.arange(1000), dtype="Sparse[int]") d = {i: a for i in range(50)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` Before ``` 213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` after ``` 4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` --- We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386

chengsoonong · 2019-01-16T03:39:01Z

It also seems to happen when constructing using index and columns only (no data).

pd.DataFrame(index=[0], columns=['b'], dtype=int).dtypes

returns

b    float64
dtype: object

mroeschke · 2020-04-20T00:23:13Z

This looks to be fixed on master. Could use a test

In [17]: In [21]: pd.DataFrame({"a": [1, 2]}, columns=['b'], dtype=int).dtypes
Out[17]:
b    int64
dtype: object

In [18]: pd.__version__
Out[18]: '1.1.0.dev0+1313.g1c0cc62e3'

avinashpancham · 2020-06-20T09:59:35Z

take

…-dev#24386

avinashpancham · 2020-06-20T13:02:38Z

take

…#34886) * TST: Ensure dtypes are set correctly for empty integer columns #24386 * Add comment to refer to GH issue tracker * Refactor check, use == instead of is * Moved file to test_constructors.py and added test for other dtypes * Add support for more dtypes * Refactor testing for data types using containers in _testing.py

zabdorff · 2021-05-12T19:26:31Z

I'm still seeing the issue @chengsoonong raised, in 1.2.4.

It also seems to happen when constructing using index and columns only (no data).
pd.DataFrame(index=[0], columns=['b'], dtype=int).dtypes
returns
b    float64
dtype: object

TomAugspurger mentioned this issue Dec 21, 2018

PERF: DataFrame dict constructor with columns #24387

Closed

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions DataFrame DataFrame data structure labels Dec 21, 2018

jreback added this to the Contributions Welcome milestone Jun 8, 2019

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions labels Apr 20, 2020

github-actions bot assigned avinashpancham Jun 20, 2020

avinashpancham added a commit to avinashpancham/pandas that referenced this issue Jun 20, 2020

TST: Ensure dtypes are set correctly for empty integer columns pandas…

931968f

…-dev#24386

avinashpancham added a commit to avinashpancham/pandas that referenced this issue Jun 20, 2020

TST: Ensure dtypes are set correctly for empty integer columns pandas…

66433d5

…-dev#24386

avinashpancham mentioned this issue Jun 20, 2020

TST: Ensure dtypes are set correctly for empty integer columns #24386 #34886

Merged

jreback added Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 20, 2020

jreback modified the milestones: Contributions Welcome, 1.1 Jun 20, 2020

jreback closed this as completed in #34886 Jun 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame constructor ignores integer dtype when dict-data and non-overlapping columns #24386

DataFrame constructor ignores integer dtype when dict-data and non-overlapping columns #24386

TomAugspurger commented Dec 21, 2018

chengsoonong commented Jan 16, 2019

mroeschke commented Apr 20, 2020

avinashpancham commented Jun 20, 2020

avinashpancham commented Jun 20, 2020

zabdorff commented May 12, 2021

DataFrame constructor ignores integer dtype when dict-data and non-overlapping columns #24386

DataFrame constructor ignores integer dtype when dict-data and non-overlapping columns #24386

Comments

TomAugspurger commented Dec 21, 2018

chengsoonong commented Jan 16, 2019

mroeschke commented Apr 20, 2020

avinashpancham commented Jun 20, 2020

avinashpancham commented Jun 20, 2020

zabdorff commented May 12, 2021