-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Series / DataFrame constructors inconsistent with data=None and dtype #24385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Kind of nuanced but I'd side with the DF constructor here |
Interesting, I was going to go with the Series one, since Perhaps raising is the best action here :) We still allow the implicitly reindexing if we force the user to do |
Yea raising is a good option if there’s a general way of doing it. Either way is ambiguous |
Also, apparently the output of |
When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386
When passing a dict and `column=` to DataFrame, we previously passed the dict of {column: array} to the Series constructor. This eventually hit `construct_1d_object_array_from_listlike`[1]. For extension arrays, this ends up calling `ExtensionArray.__iter__`, iterating over the elements of the ExtensionArray, which is prohibiatively slow. --- ```python import pandas as pd import numpy as np a = pd.Series(np.arange(1000)) d = {i: a for i in range(30)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` before ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` after ``` 4.06 ms ± 53.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` With Series with sparse values instead, the problem is exacerbated (note the smaller and fewer series). ```python a = pd.Series(np.arange(1000), dtype="Sparse[int]") d = {i: a for i in range(50)} %timeit df = pd.DataFrame(d, columns=list(range(len(d)))) ``` Before ``` 213 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` after ``` 4.41 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` --- We try to properly handle all the edge cases that we were papering over earlier by just passing the `data` to Series. We fix a bug or two along the way, but don't change any *tested* behavior, even if it looks fishy (e.g. pandas-dev#24385). [1]: pandas-dev#24368 (comment) Closes pandas-dev#24368 Closes pandas-dev#24386
The series constructed below should (I think) be a special case of the DataFrame example, but they differ.
I don't know which makes more sense.
The text was updated successfully, but these errors were encountered: