Skip to content

DEPR: concat ignoring empty objects #52532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jul 10, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
63292d4
DEPR: concat with empty objects
jbrockmendel Apr 7, 2023
2ace79c
xfail on 32bit
jbrockmendel Apr 8, 2023
6258adf
missing reason
jbrockmendel Apr 8, 2023
bfd969f
Merge branch 'main' into depr-concat-empty
jbrockmendel Apr 10, 2023
51e6d36
Fix AM build
jbrockmendel Apr 10, 2023
52ce0d7
post-merge fixup
jbrockmendel Apr 10, 2023
f8dc81e
Merge branch 'main' into depr-concat-empty
jbrockmendel Apr 11, 2023
163bf8a
catch more specifically
jbrockmendel Apr 11, 2023
03a0641
un-xfail
jbrockmendel Apr 12, 2023
49a7146
Merge branch 'main' into depr-concat-empty
jbrockmendel Apr 12, 2023
7e2e995
mypy fixup
jbrockmendel Apr 12, 2023
7c0c715
Merge branch 'main' into depr-concat-empty
jbrockmendel Apr 13, 2023
7f2977a
Merge branch 'main' into depr-concat-empty
jbrockmendel Apr 13, 2023
0eaf359
Merge branch 'main' into depr-concat-empty
jbrockmendel Apr 17, 2023
a878fea
Merge branch 'main' into depr-concat-empty
jbrockmendel Apr 18, 2023
75d5041
update test
jbrockmendel Apr 18, 2023
9e2de8f
Merge branch 'main' into depr-concat-empty
jbrockmendel May 4, 2023
392b40a
Fix broken test
jbrockmendel May 4, 2023
465c141
Merge branch 'main' into depr-concat-empty
jbrockmendel May 16, 2023
3666bca
remove duplicate whatsnew entries
jbrockmendel May 16, 2023
390d4ef
Merge branch 'main' into depr-concat-empty
jbrockmendel May 22, 2023
aa5794f
Merge branch 'main' into depr-concat-empty
jbrockmendel May 23, 2023
1277b26
Merge branch 'main' into depr-concat-empty
jbrockmendel May 24, 2023
5cddae9
Merge branch 'main' into depr-concat-empty
jbrockmendel May 24, 2023
8e58bff
Merge branch 'main' into depr-concat-empty
jbrockmendel May 25, 2023
47a17b3
Merge branch 'main' into depr-concat-empty
jbrockmendel May 25, 2023
e696c53
remove unused
jbrockmendel May 25, 2023
7f07121
Merge branch 'main' into depr-concat-empty
jbrockmendel May 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,7 @@ Deprecations
- Deprecated silently dropping unrecognized timezones when parsing strings to datetimes (:issue:`18702`)
- Deprecated :meth:`DataFrame._data` and :meth:`Series._data`, use public APIs instead (:issue:`33333`)
- Deprecated :meth:`.Groupby.all` and :meth:`.GroupBy.any` with datetime64 or :class:`PeriodDtype` values, matching the :class:`Series` and :class:`DataFrame` deprecations (:issue:`34479`)
- Deprecated :func:`concat` behavior when any of the objects being concatenated have length 0; in the past the dtypes of empty objects were ignored when determining the resulting dtype, in a future version they will not (:issue:`39122`)
- Deprecating pinning ``group.name`` to each group in :meth:`SeriesGroupBy.aggregate` aggregations; if your operation requires utilizing the groupby keys, iterate over the groupby object instead (:issue:`41090`)
- Deprecated the behavior of :func:`concat` with both ``len(keys) != len(objs)``, in a future version this will raise instead of truncating to the shorter of the two sequences (:issue:`43485`)
- Deprecated the default of ``observed=False`` in :meth:`DataFrame.groupby` and :meth:`Series.groupby`; this will default to ``True`` in a future version (:issue:`43999`)
Expand Down
21 changes: 21 additions & 0 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@
Sequence,
cast,
)
import warnings

import numpy as np

from pandas._libs import lib
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.astype import astype_array
from pandas.core.dtypes.cast import (
Expand Down Expand Up @@ -42,6 +44,9 @@
)


_dtype_obj = np.dtype(object)


def _is_nonempty(x, axis) -> bool:
# filter empty arrays
# 1-d dtypes always are included here
Expand Down Expand Up @@ -104,6 +109,22 @@ def concat_compat(
non_empties = [x for x in to_concat if _is_nonempty(x, axis)]
if non_empties and axis == 0 and not ea_compat_axis:
# ea_compat_axis see GH#39574
if len(non_empties) < len(to_concat) and not any(
obj.dtype == _dtype_obj for obj in non_empties
):
# Check for object dtype is an imperfect proxy for checking if
# the result dtype is going to change once the deprecation is
Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we verify this exactly? (checking what the common dtype would be with or without the empties?)

Because I think the check above will easily give false positives (unless I am misreading it): for example, for just a mixture of int types, of which one is empty, will trigger the warning? (since none of them is object dtype) While it will typically not change the resulting dtype (except if the input arrays with the largest bitwidth are all empty)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a try! Will take a look

# enforced.
# GH#39122
warnings.warn(
"The behavior of array concatenation with empty entries is "
"deprecated. In a future version, this will no longer exclude "
"empty items when determining the result dtype. "
"To retain the old behavior, exclude the empty entries before "
"the concat operation.",
FutureWarning,
stacklevel=find_stack_level(),
)
to_concat = non_empties

dtypes = {obj.dtype for obj in to_concat}
Expand Down
57 changes: 52 additions & 5 deletions pandas/core/internals/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
Sequence,
cast,
)
import warnings

import numpy as np

Expand All @@ -16,6 +17,7 @@
)
from pandas._libs.missing import NA
from pandas.util._decorators import cache_readonly
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.astype import astype_array
from pandas.core.dtypes.cast import (
Expand Down Expand Up @@ -429,7 +431,9 @@ def is_na(self) -> bool:

values = blk.values
if values.size == 0:
# GH#39122 this case will return False once deprecation is enforced
return True

if isinstance(values.dtype, SparseDtype):
return False

Expand All @@ -447,6 +451,19 @@ def is_na(self) -> bool:
return False
return all(isna_all(row) for row in values)

@cache_readonly
def is_na_after_size_deprecation(self) -> bool:
"""
Will self.is_na be True after values.size == 0 deprecation is enforced?
"""
blk = self.block
if blk.dtype.kind == "V":
return True

if not blk._can_hold_na:
return False
return self.is_na and blk.values.size != 0

def get_reindexed_values(self, empty_dtype: DtypeObj, upcasted_na) -> ArrayLike:
values: ArrayLike

Expand Down Expand Up @@ -526,7 +543,7 @@ def _concatenate_join_units(join_units: list[JoinUnit], copy: bool) -> ArrayLike
"""
Concatenate values from several join units along axis=1.
"""
empty_dtype = _get_empty_dtype(join_units)
empty_dtype, empty_dtype_future = _get_empty_dtype(join_units)

has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)
Expand Down Expand Up @@ -565,6 +582,18 @@ def _concatenate_join_units(join_units: list[JoinUnit], copy: bool) -> ArrayLike
else:
concat_values = concat_compat(to_concat, axis=1)

if empty_dtype != empty_dtype_future:
if empty_dtype == concat_values.dtype:
# GH#39122
warnings.warn(
"The behavior of DataFrame concatenation with empty entries is "
"deprecated. In a future version, this will no longer exclude "
"empty frames when determining the result dtypes. "
"To retain the old behavior, exclude the empty entries before "
"the concat operation.",
FutureWarning,
stacklevel=find_stack_level(),
)
return concat_values


Expand All @@ -591,7 +620,7 @@ def _dtype_to_na_value(dtype: DtypeObj, has_none_blocks: bool):
raise NotImplementedError


def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> tuple[DtypeObj, DtypeObj]:
"""
Return dtype and N/A values to use when concatenating specified units.

Expand All @@ -603,11 +632,11 @@ def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
"""
if len(join_units) == 1:
blk = join_units[0].block
return blk.dtype
return blk.dtype, blk.dtype

if _is_uniform_reindex(join_units):
empty_dtype = join_units[0].block.dtype
return empty_dtype
return empty_dtype, empty_dtype

has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)

Expand All @@ -620,7 +649,25 @@ def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
dtype = find_common_type(dtypes)
if has_none_blocks:
dtype = ensure_dtype_can_hold_na(dtype)
return dtype

dtype_future = dtype
if len(dtypes) != len(join_units):
dtypes_future = [
unit.block.dtype
for unit in join_units
if not unit.is_na_after_size_deprecation
]
if not len(dtypes_future):
dtypes_future = [
unit.block.dtype for unit in join_units if unit.block.dtype.kind != "V"
]

if len(dtypes) != len(dtypes_future):
dtype_future = find_common_type(dtypes_future)
if has_none_blocks:
dtype_future = ensure_dtype_can_hold_na(dtype_future)

return dtype, dtype_future


def _is_uniform_join_units(join_units: list[JoinUnit]) -> bool:
Expand Down
7 changes: 5 additions & 2 deletions pandas/tests/dtypes/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,11 @@ def test_concat_mismatched_categoricals_with_empty():
ser1 = Series(["a", "b", "c"], dtype="category")
ser2 = Series([], dtype="category")

result = _concat.concat_compat([ser1._values, ser2._values])
expected = pd.concat([ser1, ser2])._values
msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = _concat.concat_compat([ser1._values, ser2._values])
with tm.assert_produces_warning(FutureWarning, match=msg):
expected = pd.concat([ser1, ser2])._values
tm.assert_categorical_equal(result, expected)


Expand Down
10 changes: 7 additions & 3 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,8 +362,11 @@ def f3(x):
df2 = DataFrame({"a": [3, 2, 2, 2], "b": range(4), "c": range(5, 9)})

# correct result
result1 = df.groupby("a").apply(f1)
result2 = df2.groupby("a").apply(f1)
depr_msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=depr_msg):
result1 = df.groupby("a").apply(f1)
with tm.assert_produces_warning(FutureWarning, match=depr_msg):
result2 = df2.groupby("a").apply(f1)
tm.assert_frame_equal(result1, result2)

# should fail (not the same number of levels)
Expand All @@ -377,7 +380,8 @@ def f3(x):
with pytest.raises(AssertionError, match=msg):
df.groupby("a").apply(f3)
with pytest.raises(AssertionError, match=msg):
df2.groupby("a").apply(f3)
with tm.assert_produces_warning(FutureWarning, match=depr_msg):
df2.groupby("a").apply(f3)


def test_attr_wrapper(ts):
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/indexes/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -616,7 +616,9 @@ def test_append_empty_preserve_name(self, name, expected):
left = Index([], name="foo")
right = Index([1, 2, 3], name=name)

result = left.append(right)
msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = left.append(right)
assert result.name == expected

@pytest.mark.parametrize(
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/io/formats/test_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -495,6 +495,7 @@ def test_info_int_columns():
assert result == expected


@pytest.mark.xfail(not IS64, reason="concat will cast to int64")
def test_memory_usage_empty_no_warning():
# GH#50066
df = DataFrame(index=["a", "b"])
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/reshape/concat/test_append.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,9 @@ def test_append_preserve_index_name(self):
df2 = DataFrame(data=[[1, 4, 7], [2, 5, 8], [3, 6, 9]], columns=["A", "B", "C"])
df2 = df2.set_index(["A"])

result = df1._append(df2)
msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = df1._append(df2)
assert result.index.name == "A"

indexes_can_append = [
Expand Down
21 changes: 13 additions & 8 deletions pandas/tests/reshape/concat/test_append_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -693,11 +693,14 @@ def test_concat_categorical_empty(self):
s1 = Series([], dtype="category")
s2 = Series([1, 2], dtype="category")

tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), s2)
tm.assert_series_equal(s1._append(s2, ignore_index=True), s2)
msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), s2)
tm.assert_series_equal(s1._append(s2, ignore_index=True), s2)

tm.assert_series_equal(pd.concat([s2, s1], ignore_index=True), s2)
tm.assert_series_equal(s2._append(s1, ignore_index=True), s2)
with tm.assert_produces_warning(FutureWarning, match=msg):
tm.assert_series_equal(pd.concat([s2, s1], ignore_index=True), s2)
tm.assert_series_equal(s2._append(s1, ignore_index=True), s2)

s1 = Series([], dtype="category")
s2 = Series([], dtype="category")
Expand All @@ -719,11 +722,13 @@ def test_concat_categorical_empty(self):

# empty Series is ignored
exp = Series([np.nan, np.nan])
tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), exp)
tm.assert_series_equal(s1._append(s2, ignore_index=True), exp)
with tm.assert_produces_warning(FutureWarning, match=msg):
tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), exp)
tm.assert_series_equal(s1._append(s2, ignore_index=True), exp)

tm.assert_series_equal(pd.concat([s2, s1], ignore_index=True), exp)
tm.assert_series_equal(s2._append(s1, ignore_index=True), exp)
with tm.assert_produces_warning(FutureWarning, match=msg):
tm.assert_series_equal(pd.concat([s2, s1], ignore_index=True), exp)
tm.assert_series_equal(s2._append(s1, ignore_index=True), exp)

def test_categorical_concat_append(self):
cat = Categorical(["a", "b"], categories=["a", "b"])
Expand Down
10 changes: 9 additions & 1 deletion pandas/tests/reshape/concat/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -747,7 +747,15 @@ def test_concat_ignore_empty_object_float(empty_dtype, df_dtype):
# https://github.com/pandas-dev/pandas/issues/45637
df = DataFrame({"foo": [1, 2], "bar": [1, 2]}, dtype=df_dtype)
empty = DataFrame(columns=["foo", "bar"], dtype=empty_dtype)
result = concat([empty, df])

msg = "The behavior of DataFrame concatenation with empty entries is deprecated"
warn = None
if df_dtype == "datetime64[ns]" or (
df_dtype == "float64" and empty_dtype != "float64"
):
warn = FutureWarning
with tm.assert_produces_warning(warn, match=msg):
result = concat([empty, df])
expected = df
if df_dtype == "int64":
# TODO what exact behaviour do we want for integer eventually?
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/reshape/concat/test_datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -567,7 +567,9 @@ def test_concat_float_datetime64(using_array_manager):

if not using_array_manager:
expected = DataFrame({"A": pd.array(["2000"], dtype="datetime64[ns]")})
result = concat([df_time, df_float.iloc[:0]])
msg = "The behavior of DataFrame concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = concat([df_time, df_float.iloc[:0]])
tm.assert_frame_equal(result, expected)
else:
expected = DataFrame({"A": pd.array(["2000"], dtype="datetime64[ns]")}).astype(
Expand Down
12 changes: 8 additions & 4 deletions pandas/tests/reshape/concat/test_empty.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,9 @@ def test_concat_empty_series(self):

s1 = Series([1, 2, 3], name="x")
s2 = Series(name="y", dtype="float64")
res = concat([s1, s2], axis=0)
msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
res = concat([s1, s2], axis=0)
# name will be reset
exp = Series([1, 2, 3])
tm.assert_series_equal(res, exp)
Expand Down Expand Up @@ -238,9 +240,11 @@ def test_concat_inner_join_empty(self):
df_a = DataFrame({"a": [1, 2]}, index=[0, 1], dtype="int64")
df_expected = DataFrame({"a": []}, index=RangeIndex(0), dtype="int64")

for how, expected in [("inner", df_expected), ("outer", df_a)]:
result = concat([df_a, df_empty], axis=1, join=how)
tm.assert_frame_equal(result, expected)
result = concat([df_a, df_empty], axis=1, join="inner")
tm.assert_frame_equal(result, df_expected)

result = concat([df_a, df_empty], axis=1, join="outer")
tm.assert_frame_equal(result, df_a)

def test_empty_dtype_coerce(self):
# xref to #12411
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/reshape/concat/test_series.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,9 @@ def test_concat_empty_and_non_empty_series_regression(self):
s2 = Series([], dtype=object)

expected = s1
result = concat([s1, s2])
msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = concat([s1, s2])
tm.assert_series_equal(result, expected)

def test_concat_series_axis1(self):
Expand Down
9 changes: 7 additions & 2 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -682,8 +682,13 @@ def test_join_append_timedeltas(self, using_array_manager):
{"d": [datetime(2013, 11, 5, 5, 56)], "t": [timedelta(0, 22500)]}
)
df = DataFrame(columns=list("dt"))
df = concat([df, d], ignore_index=True)
result = concat([df, d], ignore_index=True)
msg = "The behavior of DataFrame concatenation with empty entries is deprecated"
warn = FutureWarning
if using_array_manager:
warn = None
with tm.assert_produces_warning(warn, match=msg):
df = concat([df, d], ignore_index=True)
result = concat([df, d], ignore_index=True)
expected = DataFrame(
{
"d": [datetime(2013, 11, 5, 5, 56), datetime(2013, 11, 5, 5, 56)],
Expand Down
8 changes: 6 additions & 2 deletions pandas/tests/series/methods/test_combine_first.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,9 @@ def test_combine_first(self):
# corner case
ser = Series([1.0, 2, 3], index=[0, 1, 2])
empty = Series([], index=[], dtype=object)
result = ser.combine_first(empty)
msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = ser.combine_first(empty)
ser.index = ser.index.astype("O")
tm.assert_series_equal(ser, result)

Expand Down Expand Up @@ -110,7 +112,9 @@ def test_combine_first_timezone_series_with_empty_series(self):
)
s1 = Series(range(10), index=time_index)
s2 = Series(index=time_index)
result = s1.combine_first(s2)
msg = "The behavior of array concatenation with empty entries is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = s1.combine_first(s2)
tm.assert_series_equal(result, s1)

def test_combine_first_preserves_dtype(self):
Expand Down