Skip to content

BUG: DataFrame.all(bool_only=True) inconsistency with object dtype #37799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,54 @@ Other enhancements
- Improve numerical stability for :meth:`Rolling.skew()`, :meth:`Rolling.kurt()`, :meth:`Expanding.skew()` and :meth:`Expanding.kurt()` through implementation of Kahan summation (:issue:`6929`)
- Improved error reporting for subsetting columns of a :class:`DataFrameGroupBy` with ``axis=1`` (:issue:`37725`)

.. ---------------------------------------------------------------------------

.. _whatsnew_120.notable_bug_fixes:

Notable bug fixes
~~~~~~~~~~~~~~~~~

These are bug fixes that might have notable behavior changes.

Consistency of DataFrame Reductions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:meth:`DataFrame.any` and :meth:`DataFrame.all` with ``bool_only=True`` now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this PR number here, and is there an actual issue? ping on green.

determines whether to exclude object-dtype columns on a column-by-column basis,
instead of checking if *all* object-dtype columns can be considered boolean.

This prevents pathological behavior where applying the reduction on a subset
of columns could result in a larger :class:`Series` result. See (:issue:`37799`).

.. ipython:: python

df = pd.DataFrame({"A": ["foo", "bar"], "B": [True, False]}, dtype=object)
df["C"] = pd.Series([True, True])


*Previous behavior*:

.. code-block:: ipython

In [5]: df.all(bool_only=True)
Out[5]:
C True
dtype: bool

In [6]: df[["B", "C"]].all(bool_only=True)
Out[6]:
B False
C True
dtype: bool

*New behavior*:

.. ipython:: python

In [5]: df.all(bool_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the prompts (remove in followon / other PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so remove the "In [5]: "? do i also remove the "Out [5]: "?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep i python will generate appropriately


In [6]: df[["B", "C"]].all(bool_only=True)


.. _whatsnew_120.api_breaking.python:

Increased minimum version for Python
Expand Down
14 changes: 14 additions & 0 deletions pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -450,6 +450,20 @@ def f(mask, val, idx):

return self.split_and_operate(None, f, inplace)

def _split(self) -> List["Block"]:
"""
Split a block into a list of single-column blocks.
"""
assert self.ndim == 2

new_blocks = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding some unit tests for this (e.g. on a block that is empty, since column and multilple)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, was accidentally making a copy instead of a view!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, testing is good!

for i, ref_loc in enumerate(self.mgr_locs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be done as a list comprehension

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but nbd

vals = self.values[slice(i, i + 1)]

nb = self.make_block(vals, [ref_loc])
new_blocks.append(nb)
return new_blocks

def split_and_operate(self, mask, f, inplace: bool) -> List["Block"]:
"""
split the block per-column, and apply the callable f
Expand Down
19 changes: 17 additions & 2 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -713,13 +713,28 @@ def is_view(self) -> bool:

def get_bool_data(self, copy: bool = False) -> "BlockManager":
"""
Select blocks that are bool-dtype and columns from object-dtype blocks
that are all-bool.

Parameters
----------
copy : bool, default False
Whether to copy the blocks
"""
self._consolidate_inplace()
return self._combine([b for b in self.blocks if b.is_bool], copy)

new_blocks = []

for blk in self.blocks:
if blk.dtype == bool:
new_blocks.append(blk)

elif blk.is_object:
nbs = blk._split()
for nb in nbs:
if nb.is_bool:
new_blocks.append(nb)

return self._combine(new_blocks, copy)

def get_numeric_data(self, copy: bool = False) -> "BlockManager":
"""
Expand Down
31 changes: 31 additions & 0 deletions pandas/tests/frame/test_reductions.py
Original file line number Diff line number Diff line change
Expand Up @@ -1118,6 +1118,37 @@ def test_any_all_object(self):
result = np.any(DataFrame(columns=["a", "b"])).item()
assert result is False

def test_any_all_object_bool_only(self):
df = DataFrame({"A": ["foo", 2], "B": [True, False]}).astype(object)
df._consolidate_inplace()
df["C"] = Series([True, True])

# The underlying bug is in DataFrame._get_bool_data, so we check
# that while we're here
res = df._get_bool_data()
expected = df[["B", "C"]]
tm.assert_frame_equal(res, expected)

res = df.all(bool_only=True, axis=0)
expected = Series([False, True], index=["B", "C"])
tm.assert_series_equal(res, expected)

# operating on a subset of columns should not produce a _larger_ Series
res = df[["B", "C"]].all(bool_only=True, axis=0)
tm.assert_series_equal(res, expected)

assert not df.all(bool_only=True, axis=None)

res = df.any(bool_only=True, axis=0)
expected = Series([True, True], index=["B", "C"])
tm.assert_series_equal(res, expected)

# operating on a subset of columns should not produce a _larger_ Series
res = df[["B", "C"]].any(bool_only=True, axis=0)
tm.assert_series_equal(res, expected)

assert df.any(bool_only=True, axis=None)

@pytest.mark.parametrize("method", ["any", "all"])
def test_any_all_level_axis_none_raises(self, method):
df = DataFrame(
Expand Down
21 changes: 20 additions & 1 deletion pandas/tests/internals/test_internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,25 @@ def test_delete(self):
with pytest.raises(IndexError, match=None):
newb.delete(3)

def test_split(self):
# GH#37799
values = np.random.randn(3, 4)
blk = make_block(values, placement=[3, 1, 6])
result = blk._split()

# check that we get views, not copies
values[:] = -9999
assert (blk.values == -9999).all()

assert len(result) == 3
expected = [
make_block(values[[0]], placement=[3]),
make_block(values[[1]], placement=[1]),
make_block(values[[2]], placement=[6]),
]
for res, exp in zip(result, expected):
assert_block_equal(res, exp)


class TestBlockManager:
def test_attrs(self):
Expand Down Expand Up @@ -667,7 +686,7 @@ def test_get_bool_data(self):
mgr.iset(6, np.array([True, False, True], dtype=np.object_))

bools = mgr.get_bool_data()
tm.assert_index_equal(bools.items, Index(["bool"]))
tm.assert_index_equal(bools.items, Index(["bool", "dt"]))
tm.assert_almost_equal(
mgr.iget(mgr.items.get_loc("bool")).internal_values(),
bools.iget(bools.items.get_loc("bool")).internal_values(),
Expand Down