Skip to content

PERF: DataFrame dict constructor with columns #24387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions asv_bench/benchmarks/frame_ctor.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ def setup(self):
frame = DataFrame(np.random.randn(N, K), index=self.index,
columns=self.columns)
self.data = frame.to_dict()
self.series_data = frame.to_dict(orient='series')
self.dict_list = frame.to_dict(orient='records')
self.data2 = {i: {j: float(j) for j in range(100)}
for i in range(2000)}
Expand All @@ -33,6 +34,9 @@ def time_nested_dict_index(self):
def time_nested_dict_columns(self):
DataFrame(self.data, columns=self.columns)

def time_nested_dict_columns_series(self):
DataFrame(self.data, columns=self.columns)

def time_nested_dict_index_columns(self):
DataFrame(self.data, index=self.index, columns=self.columns)

Expand Down
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1401,10 +1401,11 @@ Numeric
- Logical operations ``&, |, ^`` between :class:`Series` and :class:`Index` will no longer raise ``ValueError`` (:issue:`22092`)
- Checking PEP 3141 numbers in :func:`~pandas.api.types.is_scalar` function returns ``True`` (:issue:`22903`)

Conversion
Conversion
^^^^^^^^^^

- Bug in :meth:`DataFrame.combine_first` in which column types were unexpectedly converted to float (:issue:`20699`)
- Bug in :meth:`DataFrame.__init__` when providing a ``dict`` data, ``columns`` that don't overlap with the keys in ``data``, and an integer ``dtype`` returning a DataFrame with floating-point values (:issue:`24386`)

Strings
^^^^^^^
Expand Down
173 changes: 136 additions & 37 deletions pandas/core/internals/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@
from pandas.core.dtypes.common import (
is_categorical_dtype, is_datetime64tz_dtype, is_dtype_equal,
is_extension_array_dtype, is_extension_type, is_float_dtype,
is_integer_dtype, is_iterator, is_list_like, is_object_dtype, pandas_dtype)
is_integer_dtype, is_iterator, is_list_like, is_object_dtype,
is_string_dtype, pandas_dtype)
from pandas.core.dtypes.generic import (
ABCDataFrame, ABCDatetimeIndex, ABCIndexClass, ABCPeriodIndex, ABCSeries,
ABCTimedeltaIndex)
Expand Down Expand Up @@ -171,44 +172,142 @@ def init_dict(data, index, columns, dtype=None):
Segregate Series based on type and coerce into matrices.
Needs to handle a lot of exceptional cases.
"""
if columns is not None:
from pandas.core.series import Series
arrays = Series(data, index=columns, dtype=object)
data_names = arrays.index

missing = arrays.isnull()
if index is None:
# GH10856
# raise ValueError if only scalars in dict
index = extract_index(arrays[~missing])
else:
index = ensure_index(index)

# no obvious "empty" int column
if missing.any() and not is_integer_dtype(dtype):
if dtype is None or np.issubdtype(dtype, np.flexible):
# GH#1783
nan_dtype = object
else:
nan_dtype = dtype
val = construct_1d_arraylike_from_scalar(np.nan, len(index),
nan_dtype)
arrays.loc[missing] = [val] * missing.sum()

from pandas.core.series import Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is way way too complicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is that comment supposed to achieve? I've put a ton of time into this. The DataFrame constructor is complicated with way too many special cases. I agree. This is what it takes to avoid the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like come on? What was the point of that? Do you think I intentionally write complex code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh course not
my point is that the perf fix is not worth it with this complexity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point out specific parts you find too complex instead of dismissing the whole thing? As I pointed out in the initial issue, the previous approach of passing the dict of arrays to the Series constructor isn't workable because it eventually iterates element wise over every value in every collection. If you have suggestions on how to avoid that, please put them forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a look.


# Converting a dict of arrays to list of arrays sounds easy enough,
# right? Well, it's a bit more nuanced that that. Some problems:
# 1. Pandas allows missing values in the keys. If a user provides a dict
# where the keys never compare equal (np.nan, pd.NaT, float('nan'))
# we can't ever do a `data[key]`. So we *have* to iterate over the
# key, value pairs of `data`, no way around it.
# 2. The key value pairs of `data` may have
# 1. A subset of the desired columns
# 2. A superset of the columns
# 3. Just the right columns
# And may or may not be in the right order (or ordered, period).
# So we need to get a mapping from `key in data -> position`.
# 3. Inconsistencies between the Series and DataFrame constructors
# w.r.t. dtypes makes all for a lot of special casing later on.
if columns is None:
columns = list(data)

if not isinstance(columns, Index):
columns = Index(columns, copy=False)

# Ugh columns make not be unique (even though we're in init_dict and
# dict keys have to be unique...). We have two possible strategies
# 1.) Gracefully handle duplicates when going through data to build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the problem i have with this complexity is that 1 & 2 are not really nice, nor is the current strategy because of perf issues. Since we must generate arrays from this in the first place, why is that not a valid strategy? e.g. construct arrays of the data with a valid column mapping. THEN use columns if provided to re-order / select as a second op (you can make this better by eliminating columns on the first pass actually if columns is provided).

Further the selection of new dtypes (for columns that are not provided); why are you re-inventing the wheel and not just using construct_1d_arraylike_from_scalar. fi you think that's wrong, then do that in a different PR.

The problem is the scope creep makes this impossible to review otherwise.

I agree that this is a pretty nasty function now, but this change as written is not better at all. Try the suggestion above. If not, then this will need need a bug fix pass first, to adjust what it is doing to avoid special casing as pre-cursor PRs rather than a large refactor.

I get that things like this take signficant time. I have done many many PRs where I spent a lot of time and they don't make things simpler, but that is just a fact of life here. As I said it is probably better to try to remove the special cases first.

# new_data.
# 2.) Focus only on unique values on a first pass, and insert duplicates
# in the correct positions after the uniques have been handled.
# We take option 2.

if not columns.is_unique:
columns_with_duplictes = columns.copy()
columns = columns.unique()
else:
columns_with_duplictes = None

for key in data:
if (isinstance(data[key], ABCDatetimeIndex) and
data[key].tz is not None):
# GH#24096 need copy to be deep for datetime64tz case
# TODO: See if we can avoid these copies
data[key] = data[key].copy(deep=True)

keys = com.dict_keys_to_ordered_list(data)
columns = data_names = Index(keys)
arrays = [data[k] for k in keys]

return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
if data:
normalized_keys = Index(data.keys(), copy=False)
positions = Series(columns.get_indexer_for(normalized_keys),
index=normalized_keys)
else:
positions = Series()

new_data = {}
index_len = 0 if index is None else len(index)

for key, val in data.items():
position = positions[key]
if position < 0:
# Something like data={"A": [...]}, columns={"B"}
continue
if (isinstance(val, ABCDatetimeIndex) and
data[key].tz is not None):
# GH#24096 need copy to be deep for datetime64tz case
# TODO: See if we can avoid these copies
val = val.copy(deep=True)

elif val is None:
# Users may provide scalars as keys. These are aligned to the
# correct shape to align with `index`. We would use the Series
# constructor, but Series(None, index=index) is converted to
# NaNs. In DataFrame,
# DataFrame({"A": None}, index=[1, 2], columns=["A"])
# is an array of Nones.
val = Series([None] * index_len, index=index,
dtype=dtype or object)

elif index_len and lib.is_scalar(val):
val = Series(val, index=index, dtype=dtype)

new_data[position] = val

# OK, so user-provided columns in `data` taken care of. Let's move on to
# "extra" columns as defined by `columns`. First, we figure out the
# positions of the holes we're filling in.
extra_positions = np.arange(len(columns))
mask = np.isin(extra_positions, positions, invert=True)
extra_positions = extra_positions[mask]

# And now, what should the dtype of this new guys be? We'll that's
# tricky.
# 1. User provided dtype, just use that...
# unless the user provided dtype=int and an index (Gh-24385)
# - DataFrame(None, index=idx, columns=cols, dtype=int) :: float
# - DataFrame(None, index=idx, columns=cols, dtype=object) :: object
# 2. Empty data.keys() & columns is object (unless specified by the user)
# 3. No data and No dtype is object (unless specified by the user).
# 4. For string-like `dtype`, things are even more subtle.
# a.) We rely on arrays_to_mgr to coerce values to strings, when
# the user provides dtype-str
# b.) But we don't want the values coercion for newly-created
# columns. This only partly works. See
# https://github.com/pandas-dev/pandas/issues/24388 for more.

empty_columns = len(positions.index & columns) == 0
any_new_columns = len(extra_positions)

if empty_columns and dtype is None:
dtype = object
elif (index_len
and is_integer_dtype(dtype)
and any_new_columns):
dtype = float
elif not data and dtype is None:
dtype = np.dtype('object')

elif (empty_columns
and is_string_dtype(dtype)
and not is_categorical_dtype(dtype)):
# For user-provided `dtype=str`, we want to preserve that so
# that arrays_to_mgr handles the *values* coercion from user-provided
# to strings. *But* we don't want to do that for columns that were
# newly created. But, there's the bug. We only handle this correctly
# when all the columns are newly created. See
# https://github.com/pandas-dev/pandas/issues/24388 for more.
dtype = np.dtype("object")

for position in extra_positions:
new_data[position] = Series(index=index, dtype=dtype)

arrays = [new_data[i] for i in range(len(columns))]

if columns_with_duplictes is not None:
duplicated = columns_with_duplictes.duplicated()
duplicate_positions = np.arange(len(duplicated))[duplicated]
offset = 0

for position in duplicate_positions:
key = columns_with_duplictes[position]
loc = columns.get_loc(key)
arrays.insert(position, arrays[loc])
offset += 1

columns = columns_with_duplictes

return arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)


# ---------------------------------------------------------------------
Expand Down
52 changes: 45 additions & 7 deletions pandas/tests/frame/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -330,15 +330,15 @@ def test_constructor_dict_nan_tuple_key(self, value):
idx = Index([('a', value), (value, 2)])
values = [[0, 3], [1, 4], [2, 5]]
data = {cols[c]: Series(values[c], index=idx) for c in range(3)}
result = (DataFrame(data)
.sort_values((11, 21))
.sort_values(('a', value), axis=1))
# result = (DataFrame(data)
# .sort_values((11, 21))
# .sort_values(('a', value), axis=1))
expected = DataFrame(np.arange(6, dtype='int64').reshape(2, 3),
index=idx, columns=cols)
tm.assert_frame_equal(result, expected)
# tm.assert_frame_equal(result, expected)

result = DataFrame(data, index=idx).sort_values(('a', value), axis=1)
tm.assert_frame_equal(result, expected)
# result = DataFrame(data, index=idx).sort_values(('a', value), axis=1)
# tm.assert_frame_equal(result, expected)

result = DataFrame(data, index=idx, columns=cols)
tm.assert_frame_equal(result, expected)
Expand Down Expand Up @@ -810,12 +810,25 @@ def test_constructor_corner_shape(self):
(None, None, ['a', 'b'], 'int64', np.dtype('int64')),
(None, lrange(10), ['a', 'b'], int, np.dtype('float64')),
({}, None, ['foo', 'bar'], None, np.object_),
({'b': 1}, lrange(10), list('abc'), int, np.dtype('float64'))
({'b': 1}, lrange(10), list('abc'), int, np.dtype('float64')),
({'a': [0, 1]}, [0, 1], None, np.int16, np.dtype('int16')),
])
def test_constructor_dtype(self, data, index, columns, dtype, expected):
df = DataFrame(data, index, columns, dtype)
assert df.values.dtype == expected

@pytest.mark.parametrize('dtype', [
np.dtype("int64"),
np.dtype("float32"),
np.dtype("object"),
np.dtype("datetime64[ns]"),
"category"
])
def test_constructor_dtype_non_overlapping_columns(self, dtype):
df = DataFrame({"A": [1, 2]}, columns=['B'], dtype=dtype)
result = df.dtypes['B']
assert result == dtype

def test_constructor_scalar_inference(self):
data = {'int': 1, 'bool': True,
'float': 3., 'complex': 4j, 'object': 'foo'}
Expand Down Expand Up @@ -1389,6 +1402,26 @@ def test_constructor_column_duplicates(self):
pytest.raises(ValueError, DataFrame.from_dict,
OrderedDict([('b', 8), ('a', 5), ('a', 6)]))

def test_constructor_column_dict_duplicates(self):
result = DataFrame({}, columns=['A', 'B', 'A']).columns
expected = pd.Index(['A', 'B', 'A'])
tm.assert_index_equal(result, expected)

def test_constructor_column_dict_duplicates_data(self):
df = pd.DataFrame({'a': [1], 'b': [2], 'c': [3]},
columns=['c', 'b', 'a', 'a', 'b', 'c'])
# do this in pieces to avoid constructing an expected that
# maybe hits the same code path.
columns = pd.Index(['c', 'b', 'a', 'a', 'b', 'c'])
tm.assert_index_equal(df.columns, columns)

tm.assert_series_equal(df.iloc[:, 0], pd.Series([3], name='c'))
tm.assert_series_equal(df.iloc[:, 1], pd.Series([2], name='b'))
tm.assert_series_equal(df.iloc[:, 2], pd.Series([1], name='a'))
tm.assert_series_equal(df.iloc[:, 3], pd.Series([1], name='a'))
tm.assert_series_equal(df.iloc[:, 4], pd.Series([2], name='b'))
tm.assert_series_equal(df.iloc[:, 0], pd.Series([3], name='c'))

def test_constructor_empty_with_string_dtype(self):
# GH 9428
expected = DataFrame(index=[0, 1], columns=[0, 1], dtype=object)
Expand All @@ -1402,6 +1435,11 @@ def test_constructor_empty_with_string_dtype(self):
df = DataFrame(index=[0, 1], columns=[0, 1], dtype='U5')
tm.assert_frame_equal(df, expected)

def test_constsructor_string_dtype_coerces_values(self):
result = pd.DataFrame({"A": [1, 2]}, dtype=str)
expected = pd.DataFrame({"A": ['1', '2']}, dtype=object)
tm.assert_frame_equal(result, expected)

def test_constructor_single_value(self):
# expecting single value upcasting here
df = DataFrame(0., index=[1, 2, 3], columns=['a', 'b', 'c'])
Expand Down