Skip to content

DEPR: concat ignoring all-NA columns #52613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 18, 2023
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,7 @@ if one of the DataFrames was empty or had all-NA values, its dtype was
consistently *not* ignored (:issue:`43507`).

.. ipython:: python
:okwarning:
Copy link
Member

@mroeschke mroeschke Apr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you turn this into a .. code-block::?


df1 = pd.DataFrame({"bar": [pd.Timestamp("2013-01-01")]}, index=range(1))
df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2))
Expand Down
10 changes: 10 additions & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,13 +224,23 @@ Deprecations
- Deprecated :func:`is_int64_dtype`, check ``dtype == np.dtype(np.int64)`` instead (:issue:`52564`)
- Deprecated :func:`is_interval_dtype`, check ``isinstance(dtype, pd.IntervalDtype)`` instead (:issue:`52607`)
- Deprecated :meth:`DataFrame.swapaxes` and :meth:`Series.swapaxes`, use :meth:`DataFrame.transpose` or :meth:`Series.transpose` instead (:issue:`51946`)
- Deprecated :meth:`DataFrame.swapaxes` and :meth:`Series.swapaxes`, use :meth:`DataFrame.transpose` or :meth:`Series.transpose` instead (:issue:`51946`)
- Deprecated ``freq`` parameter in :class:`PeriodArray` constructor, pass ``dtype`` instead (:issue:`52462`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like duplicate entries were created

- Deprecated ``freq`` parameter in :class:`PeriodArray` constructor, pass ``dtype`` instead (:issue:`52462`)
- Deprecated behavior of :func:`concat` when :class:`DataFrame` has columns that are all-NA, in a future version these will not be discarded when determining the resulting dtype (:issue:`40893`)
- Deprecated behavior of :meth:`Series.dt.to_pydatetime`, in a future version this will return a :class:`Series` containing python ``datetime`` objects instead of an ``ndarray`` of datetimes; this matches the behavior of other :meth:`Series.dt` properties (:issue:`20306`)
- Deprecated behavior of :meth:`Series.dt.to_pydatetime`, in a future version this will return a :class:`Series` containing python ``datetime`` objects instead of an ``ndarray`` of datetimes; this matches the behavior of other :meth:`Series.dt` properties (:issue:`20306`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a few other duplicates in this file

- Deprecated logical operations (``|``, ``&``, ``^``) between pandas objects and dtype-less sequences (e.g. ``list``, ``tuple``), wrap a sequence in a :class:`Series` or numpy array before operating instead (:issue:`51521`)
- Deprecated logical operations (``|``, ``&``, ``^``) between pandas objects and dtype-less sequences (e.g. ``list``, ``tuple``), wrap a sequence in a :class:`Series` or numpy array before operating instead (:issue:`51521`)
- Deprecated making :meth:`Series.apply` return a :class:`DataFrame` when the passed-in callable returns a :class:`Series` object. In the future this will return a :class:`Series` whose values are themselves :class:`Series`. This pattern was very slow and it's recommended to use alternative methods to archive the same goal (:issue:`52116`)
- Deprecated making :meth:`Series.apply` return a :class:`DataFrame` when the passed-in callable returns a :class:`Series` object. In the future this will return a :class:`Series` whose values are themselves :class:`Series`. This pattern was very slow and it's recommended to use alternative methods to archive the same goal (:issue:`52116`)
- Deprecated parameter ``convert_type`` in :meth:`Series.apply` (:issue:`52140`)
- Deprecated parameter ``convert_type`` in :meth:`Series.apply` (:issue:`52140`)
- Deprecated passing a dictionary to :meth:`.SeriesGroupBy.agg`; pass a list of aggregations instead (:issue:`50684`)
- Deprecated passing a dictionary to :meth:`.SeriesGroupBy.agg`; pass a list of aggregations instead (:issue:`50684`)
- Deprecated the "fastpath" keyword in :class:`Categorical` constructor, use :meth:`Categorical.from_codes` instead (:issue:`20110`)
- Deprecated the "fastpath" keyword in :class:`Categorical` constructor, use :meth:`Categorical.from_codes` instead (:issue:`20110`)
- Deprecated the methods :meth:`Series.bool` and :meth:`DataFrame.bool` (:issue:`51749`)
- Deprecated the methods :meth:`Series.bool` and :meth:`DataFrame.bool` (:issue:`51749`)
- Deprecated unused "closed" and "normalize" keywords in the :class:`DatetimeIndex` constructor (:issue:`52628`)
- Deprecated unused "closed" keyword in the :class:`TimedeltaIndex` constructor (:issue:`52628`)
Expand Down
54 changes: 49 additions & 5 deletions pandas/core/internals/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
TYPE_CHECKING,
Sequence,
)
import warnings

import numpy as np

Expand All @@ -15,6 +16,7 @@
)
from pandas._libs.missing import NA
from pandas.util._decorators import cache_readonly
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.astype import astype_array
from pandas.core.dtypes.cast import (
Expand Down Expand Up @@ -448,6 +450,19 @@ def is_na(self) -> bool:
return False
return all(isna_all(row) for row in values)

@cache_readonly
def is_na_without_isna_all(self) -> bool:
blk = self.block
if blk.dtype.kind == "V":
return True
if not blk._can_hold_na:
return False

values = blk.values
if values.size == 0:
return True
return False

def get_reindexed_values(self, empty_dtype: DtypeObj, upcasted_na) -> ArrayLike:
values: ArrayLike

Expand Down Expand Up @@ -496,7 +511,7 @@ def _concatenate_join_units(join_units: list[JoinUnit], copy: bool) -> ArrayLike
"""
Concatenate values from several join units along axis=1.
"""
empty_dtype = _get_empty_dtype(join_units)
empty_dtype, empty_dtype_future = _get_empty_dtype(join_units)

has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)
Expand Down Expand Up @@ -535,6 +550,19 @@ def _concatenate_join_units(join_units: list[JoinUnit], copy: bool) -> ArrayLike
else:
concat_values = concat_compat(to_concat, axis=1)

if empty_dtype != empty_dtype_future:
if empty_dtype == concat_values.dtype:
# GH#40893
warnings.warn(
"The behavior of DataFrame concatenation with all-NA entries is "
"deprecated. In a future version, this will no longer exclude "
"all-NA columns when determining the result dtypes. "
"To retain the old behavior, cast the all-NA columns to the "
"desired dtype before the concat operation.",
FutureWarning,
stacklevel=find_stack_level(),
)

return concat_values


Expand All @@ -561,7 +589,7 @@ def _dtype_to_na_value(dtype: DtypeObj, has_none_blocks: bool):
raise NotImplementedError


def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> tuple[DtypeObj, DtypeObj]:
"""
Return dtype and N/A values to use when concatenating specified units.

Expand All @@ -573,11 +601,11 @@ def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
"""
if len(join_units) == 1:
blk = join_units[0].block
return blk.dtype
return blk.dtype, blk.dtype

if _is_uniform_reindex(join_units):
empty_dtype = join_units[0].block.dtype
return empty_dtype
return empty_dtype, empty_dtype

has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)

Expand All @@ -590,7 +618,23 @@ def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
dtype = find_common_type(dtypes)
if has_none_blocks:
dtype = ensure_dtype_can_hold_na(dtype)
return dtype

dtype_future = dtype
if len(dtypes) != len(join_units):
dtypes_future = [
unit.block.dtype for unit in join_units if not unit.is_na_without_isna_all
]
if not len(dtypes_future):
dtypes_future = [
unit.block.dtype for unit in join_units if unit.block.dtype.kind != "V"
]

if len(dtypes) != len(dtypes_future):
dtype_future = find_common_type(dtypes_future)
if has_none_blocks:
dtype_future = ensure_dtype_can_hold_na(dtype_future)

return dtype, dtype_future


def _is_uniform_join_units(join_units: list[JoinUnit]) -> bool:
Expand Down
20 changes: 18 additions & 2 deletions pandas/tests/reshape/concat/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -747,7 +747,9 @@ def test_concat_ignore_empty_object_float(empty_dtype, df_dtype):
# https://github.com/pandas-dev/pandas/issues/45637
df = DataFrame({"foo": [1, 2], "bar": [1, 2]}, dtype=df_dtype)
empty = DataFrame(columns=["foo", "bar"], dtype=empty_dtype)

result = concat([empty, df])

expected = df
if df_dtype == "int64":
# TODO what exact behaviour do we want for integer eventually?
Expand All @@ -764,14 +766,24 @@ def test_concat_ignore_empty_object_float(empty_dtype, df_dtype):
def test_concat_ignore_all_na_object_float(empty_dtype, df_dtype):
df = DataFrame({"foo": [1, 2], "bar": [1, 2]}, dtype=df_dtype)
empty = DataFrame({"foo": [np.nan], "bar": [np.nan]}, dtype=empty_dtype)
result = concat([empty, df], ignore_index=True)

if df_dtype == "int64":
# TODO what exact behaviour do we want for integer eventually?
if empty_dtype == "object":
df_dtype = "object"
else:
df_dtype = "float64"

msg = "The behavior of DataFrame concatenation with all-NA entries"
warn = None
if empty_dtype != df_dtype and empty_dtype is not None:
warn = FutureWarning
elif df_dtype == "datetime64[ns]":
warn = FutureWarning

with tm.assert_produces_warning(warn, match=msg):
result = concat([empty, df], ignore_index=True)

expected = DataFrame({"foo": [None, 1, 2], "bar": [None, 1, 2]}, dtype=df_dtype)
tm.assert_frame_equal(result, expected)

Expand All @@ -782,7 +794,11 @@ def test_concat_ignore_empty_from_reindex():
df1 = DataFrame({"a": [1], "b": [pd.Timestamp("2012-01-01")]})
df2 = DataFrame({"a": [2]})

result = concat([df1, df2.reindex(columns=df1.columns)], ignore_index=True)
aligned = df2.reindex(columns=df1.columns)

msg = "The behavior of DataFrame concatenation with all-NA entries"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = concat([df1, aligned], ignore_index=True)
expected = df1 = DataFrame({"a": [1, 2], "b": [pd.Timestamp("2012-01-01"), pd.NaT]})
tm.assert_frame_equal(result, expected)

Expand Down