-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame constructor ignores integer dtype when dict-data and non-overlapping columns #24386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Constructors
Series/DataFrame/Index/pd.array Constructors
Dtype Conversions
Unexpected or buggy dtype conversions
good first issue
Needs Tests
Unit test(s) needed to prevent regressions
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
Milestone
Comments
TomAugspurger
added a commit
to TomAugspurger/pandas
that referenced
this issue
Dec 21, 2018
When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386
TomAugspurger
added a commit
to TomAugspurger/pandas
that referenced
this issue
Dec 21, 2018
When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. --- ```python import pandas as pd import numpy as np a = pd.Series(np.arange(1000)) d = {i: a for i in range(30)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` before ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` With Series with sparse values instead, the problem is exacerbated (note the smaller and fewer series). ```python a = pd.Series(np.arange(1000), dtype="Sparse[int]") d = {i: a for i in range(50)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` Before ``` 213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` after ``` 4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` --- We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386
It also seems to happen when constructing using index and columns only (no data).
returns
|
This looks to be fixed on master. Could use a test
|
take |
avinashpancham
added a commit
to avinashpancham/pandas
that referenced
this issue
Jun 20, 2020
avinashpancham
added a commit
to avinashpancham/pandas
that referenced
this issue
Jun 20, 2020
take |
jreback
pushed a commit
that referenced
this issue
Jun 20, 2020
…#34886) * TST: Ensure dtypes are set correctly for empty integer columns #24386 * Add comment to refer to GH issue tracker * Refactor check, use == instead of is * Moved file to test_constructors.py and added test for other dtypes * Add support for more dtypes * Refactor testing for data types using containers in _testing.py
I'm still seeing the issue @chengsoonong raised, in 1.2.4.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Constructors
Series/DataFrame/Index/pd.array Constructors
Dtype Conversions
Unexpected or buggy dtype conversions
good first issue
Needs Tests
Unit test(s) needed to prevent regressions
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
This is close to #24385, but we don't have a test saying otherwise, so I'm actually going to fix this one :)
Out[21] should be int64.
The text was updated successfully, but these errors were encountered: