-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame.(any|all) inconsistency #34918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
40a55ec
ba0c526
e73204b
faf8bc8
e9a9d1a
fa398e0
52902a6
b5ecafb
a65cac0
da19314
887f6d1
1d72508
4d6749b
103d446
5adff4e
bf3d940
fd286dc
a772fb7
03bb7a9
2c4c011
3f2cf67
214b362
3804b0d
3f141b5
b39e533
5f8086e
eb8c75d
2dc84b9
1a27257
aca1f84
7bfadbc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,7 @@ | |
) | ||
from pandas.core.dtypes.common import ( | ||
DT64NS_DTYPE, | ||
is_bool_dtype, | ||
is_datetimelike_v_numeric, | ||
is_extension_array_dtype, | ||
is_list_like, | ||
|
@@ -688,8 +689,9 @@ def get_bool_data(self, copy: bool = False) -> "BlockManager": | |
copy : bool, default False | ||
Whether to copy the blocks | ||
""" | ||
self._consolidate_inplace() | ||
return self._combine([b for b in self.blocks if b.is_bool], copy) | ||
# Note: use is_bool_dtype instead of blk.is_bool to exclude | ||
# object-dtype blocks containing all-bool entries. | ||
return self._combine([b for b in self.blocks if is_bool_dtype(b.dtype)], copy) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this may actually be a bug, since it implies the following inconsistent behavior in master:
adding columns shouldnt make get_bool_data smaller. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a test for this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. test added |
||
|
||
def get_numeric_data(self, copy: bool = False) -> "BlockManager": | ||
""" | ||
|
@@ -698,7 +700,6 @@ def get_numeric_data(self, copy: bool = False) -> "BlockManager": | |
copy : bool, default False | ||
Whether to copy the blocks | ||
""" | ||
self._consolidate_inplace() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you check the impact of this? |
||
return self._combine([b for b in self.blocks if b.is_numeric], copy) | ||
|
||
def _combine(self, blocks: List[Block], copy: bool = True) -> "BlockManager": | ||
|
@@ -875,12 +876,7 @@ def to_dict(self, copy: bool = True): | |
------- | ||
values : a dict of dtype -> BlockManager | ||
|
||
Notes | ||
----- | ||
This consolidates based on str(dtype) | ||
""" | ||
self._consolidate_inplace() | ||
|
||
bd: Dict[str, List[Block]] = {} | ||
for b in self.blocks: | ||
bd.setdefault(str(b.dtype), []).append(b) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think consolidating here might be the better option, as it at least ensures consistent behaviour when you have multiple columns independent of consolidation status. The inconsistency between a single column and multiple column is still present of course, but IMO there is nothing to do about this giving our consolidated blocks (well, we could deprecate object dtype being regarded as bool ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does that (well changes outright, not deprecates). AFAICT thats the only way to make the behavior independent of whether the presence of another object-dtype-but-not-bool-like column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should deprecate this first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only non-test place where get_bool_data is called from is in
DataFrame._reduce
which I know you've been working on recently. The topic of how to handle object-dtype blocks has come up there, too. If we end up handling object-dtype blocks column-wise there, that would render this distinction irrelevant.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly are you suggesting we deprecate? do you have an example of something that breaks on this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche can you respond here re what specific behavior you'd like to deprecate?