-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: DataFrame dict constructor with columns #24387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a7606e7
29a638c
025fb91
ed70cef
bfadc0b
5e1da4b
559afc7
dc43a1e
c1f2a58
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,7 +21,8 @@ | |
from pandas.core.dtypes.common import ( | ||
is_categorical_dtype, is_datetime64tz_dtype, is_dtype_equal, | ||
is_extension_array_dtype, is_extension_type, is_float_dtype, | ||
is_integer_dtype, is_iterator, is_list_like, is_object_dtype, pandas_dtype) | ||
is_integer_dtype, is_iterator, is_list_like, is_object_dtype, | ||
is_string_dtype, pandas_dtype) | ||
from pandas.core.dtypes.generic import ( | ||
ABCDataFrame, ABCDatetimeIndex, ABCIndexClass, ABCPeriodIndex, ABCSeries, | ||
ABCTimedeltaIndex) | ||
|
@@ -171,44 +172,150 @@ def init_dict(data, index, columns, dtype=None): | |
Segregate Series based on type and coerce into matrices. | ||
Needs to handle a lot of exceptional cases. | ||
""" | ||
if columns is not None: | ||
from pandas.core.series import Series | ||
arrays = Series(data, index=columns, dtype=object) | ||
data_names = arrays.index | ||
from pandas.core.series import Series | ||
|
||
# Converting a dict of arrays to list of arrays sounds easy enough, | ||
# right? Well, it's a bit more nuanced that that. Some problems: | ||
# a. Pandas allows missing values in the keys. If a user provides a dict | ||
# where the keys never compare equal (np.nan, pd.NaT, float('nan')) | ||
# we can't ever do a `data[key]`. So we *have* to iterate over the | ||
# key, value pairs of `data`, no way around it. | ||
# b. The key value pairs of `data` may have | ||
# 1. A subset of the desired columns | ||
# 2. A superset of the columns | ||
# 3. Just the right columns | ||
# And may or may not be in the right order (or ordered, period). | ||
# So we need to get a mapping from `key in data -> position`. | ||
# c. Inconsistencies between the Series and DataFrame constructors | ||
# w.r.t. dtypes makes all for a lot of special casing later on. | ||
# But the basic strategy we use is | ||
# 1. Build a mapping `positions` from {key_in_data: position} | ||
# 2. Build a mapping `new_data` from `{position: array}` | ||
# 3. Update `new_data` with newly-created arrays from `columns` | ||
# 4. Covert `new_data` to a list of arrays. | ||
|
||
missing = arrays.isnull() | ||
if index is None: | ||
# GH10856 | ||
# raise ValueError if only scalars in dict | ||
index = extract_index(arrays[~missing]) | ||
else: | ||
index = ensure_index(index) | ||
if columns is None: | ||
columns = list(data) | ||
|
||
# no obvious "empty" int column | ||
if missing.any() and not is_integer_dtype(dtype): | ||
if dtype is None or np.issubdtype(dtype, np.flexible): | ||
# GH#1783 | ||
nan_dtype = object | ||
else: | ||
nan_dtype = dtype | ||
val = construct_1d_arraylike_from_scalar(np.nan, len(index), | ||
nan_dtype) | ||
arrays.loc[missing] = [val] * missing.sum() | ||
if not isinstance(columns, Index): | ||
# check for isinstance, else we lose the identity of user-provided | ||
# `columns`. | ||
columns = ensure_index(columns) | ||
|
||
# Columns make not be unique (even though we're in init_dict and | ||
# dict keys have to be unique...). We have two possible strategies | ||
# 1.) Gracefully handle duplicates when going through data to build | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so the problem i have with this complexity is that 1 & 2 are not really nice, nor is the current strategy because of perf issues. Since we must generate arrays from this in the first place, why is that not a valid strategy? e.g. construct arrays of the data with a valid column mapping. THEN use columns if provided to re-order / select as a second op (you can make this better by eliminating columns on the first pass actually if columns is provided). Further the selection of new dtypes (for columns that are not provided); why are you re-inventing the wheel and not just using The problem is the scope creep makes this impossible to review otherwise. I agree that this is a pretty nasty function now, but this change as written is not better at all. Try the suggestion above. If not, then this will need need a bug fix pass first, to adjust what it is doing to avoid special casing as pre-cursor PRs rather than a large refactor. I get that things like this take signficant time. I have done many many PRs where I spent a lot of time and they don't make things simpler, but that is just a fact of life here. As I said it is probably better to try to remove the special cases first. |
||
# new_data. | ||
# 2.) Focus only on unique values on a first pass, and insert duplicates | ||
# in the correct positions after the uniques have been handled. | ||
# We take option 2. | ||
|
||
if not columns.is_unique: | ||
columns_with_duplictes = columns.copy() | ||
columns = columns.unique() | ||
else: | ||
columns_with_duplictes = None | ||
|
||
for key in data: | ||
if (isinstance(data[key], ABCDatetimeIndex) and | ||
data[key].tz is not None): | ||
# GH#24096 need copy to be deep for datetime64tz case | ||
# TODO: See if we can avoid these copies | ||
data[key] = data[key].copy(deep=True) | ||
|
||
keys = com.dict_keys_to_ordered_list(data) | ||
columns = data_names = Index(keys) | ||
arrays = [data[k] for k in keys] | ||
|
||
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) | ||
if data: | ||
normalized_keys = Index(data.keys(), copy=False) | ||
positions = Series(columns.get_indexer_for(normalized_keys), | ||
index=normalized_keys) | ||
else: | ||
positions = Series() | ||
|
||
new_data = {} | ||
index_len = 0 if index is None else len(index) | ||
|
||
for key, val in data.items(): | ||
position = positions[key] | ||
if position < 0: | ||
# Something like data={"A": [...]}, columns={"B"} | ||
continue | ||
if (isinstance(val, ABCDatetimeIndex) and | ||
data[key].tz is not None): | ||
# GH#24096 need copy to be deep for datetime64tz case | ||
# TODO: See if we can avoid these copies | ||
val = val.copy(deep=True) | ||
|
||
elif val is None: | ||
# Users may provide scalars as keys. These are aligned to the | ||
# correct shape to align with `index`. We would use the Series | ||
# constructor, but Series(None, index=index) is converted to | ||
# NaNs. In DataFrame, | ||
# DataFrame({"A": None}, index=[1, 2], columns=["A"]) | ||
# is an array of Nones. | ||
val = Series([None] * index_len, index=index, | ||
dtype=dtype or object) | ||
|
||
elif index_len and lib.is_scalar(val): | ||
val = Series(val, index=index, dtype=dtype) | ||
|
||
new_data[position] = val | ||
|
||
# OK, so user-provided columns in `data` taken care of. Let's move on to | ||
# "extra" columns as defined by `columns`. First, we figure out the | ||
# positions of the holes we're filling in. | ||
extra_positions = np.arange(len(columns)) | ||
mask = ~Series(extra_positions).isin(positions).values | ||
extra_positions = extra_positions[mask] | ||
|
||
# And now, what should the dtype of this new guys be? We'll that's | ||
# tricky. | ||
# 1. User provided dtype, just use that... | ||
# unless the user provided dtype=int and an index (Gh-24385) | ||
# - DataFrame(None, index=idx, columns=cols, dtype=int) :: float | ||
# - DataFrame(None, index=idx, columns=cols, dtype=object) :: object | ||
# 2. Empty data.keys() & columns is object (unless specified by the user) | ||
# 3. No data and No dtype is object (unless specified by the user). | ||
# 4. For string-like `dtype`, things are even more subtle. | ||
# a.) We rely on arrays_to_mgr to coerce values to strings, when | ||
# the user provides dtype-str | ||
# b.) But we don't want the values coercion for newly-created | ||
# columns. This only partly works. See | ||
# https://github.com/pandas-dev/pandas/issues/24388 for more. | ||
|
||
empty_columns = len(positions.index & columns) == 0 | ||
any_new_columns = len(extra_positions) | ||
|
||
if empty_columns and dtype is None: | ||
dtype = object | ||
elif (index_len | ||
and is_integer_dtype(dtype) | ||
and any_new_columns): | ||
dtype = float | ||
elif not data and dtype is None: | ||
dtype = np.dtype('object') | ||
|
||
elif (empty_columns | ||
and is_string_dtype(dtype) | ||
and not is_categorical_dtype(dtype)): | ||
# For user-provided `dtype=str`, we want to preserve that so | ||
# that arrays_to_mgr handles the *values* coercion from user-provided | ||
# to strings. *But* we don't want to do that for columns that were | ||
# newly created. But, there's the bug. We only handle this correctly | ||
# when all the columns are newly created. See | ||
# https://github.com/pandas-dev/pandas/issues/24388 for more. | ||
dtype = np.dtype("object") | ||
|
||
for position in extra_positions: | ||
new_data[position] = Series(index=index, dtype=dtype) | ||
|
||
arrays = [new_data[i] for i in range(len(columns))] | ||
|
||
if columns_with_duplictes is not None: | ||
duplicated = columns_with_duplictes.duplicated() | ||
duplicate_positions = np.arange(len(duplicated))[duplicated] | ||
offset = 0 | ||
|
||
for position in duplicate_positions: | ||
key = columns_with_duplictes[position] | ||
loc = columns.get_loc(key) | ||
arrays.insert(position, arrays[loc]) | ||
offset += 1 | ||
|
||
columns = columns_with_duplictes | ||
|
||
return arrays_to_mgr(arrays, columns, index, columns, dtype=dtype) | ||
|
||
|
||
# --------------------------------------------------------------------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is way way too complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is that comment supposed to achieve? I've put a ton of time into this. The DataFrame constructor is complicated with way too many special cases. I agree. This is what it takes to avoid the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like come on? What was the point of that? Do you think I intentionally write complex code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh course not
my point is that the perf fix is not worth it with this complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point out specific parts you find too complex instead of dismissing the whole thing? As I pointed out in the initial issue, the previous approach of passing the dict of arrays to the Series constructor isn't workable because it eventually iterates element wise over every value in every collection. If you have suggestions on how to avoid that, please put them forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have a look.