Skip to content

BUG: True cannot be cast to bool in read_excel #58994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -547,6 +547,7 @@ I/O
- Bug in :meth:`DataFrame.to_stata` when writing :class:`DataFrame` and ``byteorder=`big```. (:issue:`58969`)
- Bug in :meth:`DataFrame.to_string` that raised ``StopIteration`` with nested DataFrames. (:issue:`16098`)
- Bug in :meth:`read_csv` raising ``TypeError`` when ``index_col`` is specified and ``na_values`` is a dict containing the key ``None``. (:issue:`57547`)
- Bug in :meth:`read_excel` raising ``ValueError`` when passing array of boolean values when ``dtype=pd.BooleanDtype.name``. (:issue:`58159`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :meth:`read_excel` raising ``ValueError`` when passing array of boolean values when ``dtype=pd.BooleanDtype.name``. (:issue:`58159`)
- Bug in :meth:`read_excel` raising ``ValueError`` when passing array of boolean values when ``dtype="boolean"``. (:issue:`58159`)

- Bug in :meth:`read_stata` raising ``KeyError`` when input file is stored in big-endian format and contains strL data. (:issue:`58638`)

Period
Expand Down
16 changes: 14 additions & 2 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,13 @@ class BooleanArray(BaseMaskedArray):
"""

_TRUE_VALUES = {"True", "TRUE", "true", "1", "1.0"}
_FALSE_VALUES = {"False", "FALSE", "false", "0", "0.0"}
_FALSE_VALUES = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you undo the formatting here?

"False",
"FALSE",
"false",
"0",
"0.0",
}

@classmethod
def _simple_new(cls, values: np.ndarray, mask: npt.NDArray[np.bool_]) -> Self:
Expand Down Expand Up @@ -329,15 +335,21 @@ def _from_sequence_of_strings(
copy: bool = False,
true_values: list[str] | None = None,
false_values: list[str] | None = None,
none_values: list[str] | None = None,
) -> BooleanArray:
true_values_union = cls._TRUE_VALUES.union(true_values or [])
false_values_union = cls._FALSE_VALUES.union(false_values or [])

def map_string(s) -> bool:
if none_values is None:
none_values = []

def map_string(s) -> bool | None:
if s in true_values_union:
return True
elif s in false_values_union:
return False
elif s in none_values:
return None
else:
raise ValueError(f"{s} cannot be cast to bool")

Expand Down
4 changes: 3 additions & 1 deletion pandas/io/parsers/base_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -745,11 +745,13 @@ def _cast_types(self, values: ArrayLike, cast_type: DtypeObj, column) -> ArrayLi
if isinstance(cast_type, BooleanDtype):
# error: Unexpected keyword argument "true_values" for
# "_from_sequence_of_strings" of "ExtensionArray"
values_str = [str(val) for val in values]
return array_type._from_sequence_of_strings( # type: ignore[call-arg]
values,
values_str,
dtype=cast_type,
true_values=self.true_values,
false_values=self.false_values,
Comment on lines +750 to 753
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none_values is not passed anything here. Can the equivalent na_values and keep_default_na be passed in here, or is it irrelevant?

Copy link
Contributor Author

@rmhowe425 rmhowe425 Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what you're saying. You're suggesting to propagate na_values and keep_default_na from read_excel() to _cast_types()?

I think that's a good idea. I'll make that change. That should allow me to get rid of _NONE_VALUES in boolean.py, and allow me to dynamically handle any values that equate to None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those params are already available in _cast_types since it's an instance method (so self.na_values and self.keep_default_na are available here, similar to self.true_values and self.false_values). I was thinking about progagating them to _from_sequence_of_strings()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we're on the same page. Making those changes now

none_values=self.na_values,
)
else:
return array_type._from_sequence_of_strings(values, dtype=cast_type)
Expand Down
Binary file added pandas/tests/io/data/excel/test7.xlsx
Binary file not shown.
29 changes: 29 additions & 0 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,35 @@ def xfail_datetimes_with_pyxlsb(engine, request):


class TestReaders:
@pytest.mark.parametrize("col", [[True, None, False], [True], [True, False]])
def test_read_excel_type_check(self, col):
# GH 58159
df = DataFrame({"bool_column": col}, dtype=pd.BooleanDtype.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = DataFrame({"bool_column": col}, dtype=pd.BooleanDtype.name)
df = DataFrame({"bool_column": col}, dtype="boolean")

And could you change other usages here?

df.to_excel("test-type.xlsx", index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the tmp_excel fixture for this path?

df2 = pd.read_excel(
"test-type.xlsx",
dtype={"bool_column": pd.BooleanDtype.name},
engine="openpyxl",
)
tm.assert_frame_equal(df, df2)

def test_pass_none_type(self):
# GH 58159
with pd.ExcelFile("test7.xlsx") as excel:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the datapath fixture to access test7.xlsx. Also could you give a test7.xlsx a better name reflective of the contents?

parsed = pd.read_excel(
excel,
sheet_name="Sheet1",
keep_default_na=True,
na_values=["nan", "None", "abcd"],
dtype=pd.BooleanDtype.name,
engine="openpyxl",
)
expected = DataFrame(
{"Test": [True, None, False, None, False, None, True]},
dtype=pd.BooleanDtype.name,
)
tm.assert_frame_equal(parsed, expected)

@pytest.fixture(autouse=True)
def cd_and_set_engine(self, engine, datapath, monkeypatch):
"""
Expand Down
Loading