Skip to content

ENH: Add check for character limmit #57103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Feb 1, 2024
4 changes: 4 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2363,6 +2363,10 @@ def to_excel(
Once a workbook has been saved it is not possible to write further
data without rewriting the whole workbook.

Pandas will check the number of rows, columns,
and cell character count does not exceed Excel's limitations.
All other limitations must be checked by the user.

Examples
--------

Expand Down
7 changes: 7 additions & 0 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1326,6 +1326,13 @@ def _value_with_fmt(
fmt = "0"
else:
val = str(val)
# GH#56954
if len(val) > 32767:
warnings.warn(
"String value too long, truncated to 32767 characters",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially come from other data types, "String" may end up being confusing if that's the case. I think a more generic "Cell contents too long" is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a quick question regarding this: should we also explore validating other data types, like float or int? The current code seems to focus solely on validating the length for specific data types. Do we need to extend these checks to other types mentioned above? Alternatively, have we conducted similar checks elsewhere, or is there a best practice that should be adhered to in this context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also explore validating other data types, like float or int?

I think Excel's limitation is specifically for strings.

Alternatively, have we conducted similar checks elsewhere, or is there a best practice that should be adhered to in this context?

When you say elsewhere, are you thinking of outside the excel code? I do not know of any formal best practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I grasp your perspective. I want to ensure we deliberate on whether we should also account for the length of other data types. For now, let's concentrate on checking the length specifically for strings in this pull request; that should suffice.

UserWarning,
stacklevel=find_stack_level(),
)

return val, fmt

Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/io/excel/test_writers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1385,6 +1385,17 @@ def test_to_excel_empty_frame(self, engine, ext):
expected = DataFrame()
tm.assert_frame_equal(result, expected)

def test_to_excel_raising_warning_when_cell_character_exceed_limit(
self, path, engine
):
# GH#56954
df = DataFrame({"A": ["a" * 32768]})
msg = "String value too long, truncated to 32767 characters"
with tm.assert_produces_warning(
UserWarning, match=msg, raise_on_extra_warnings=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What extra warning is raised here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From commit babdf29,

the check https://github.com/pandas-dev/pandas/actions/runs/7678882560/job/20929191559 rise pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit[openpyxl-.xlsx] - AssertionError: Caused unexpected warning(s): [('EncodingWarning', EncodingWarning("'encoding' argument not specified"), '/home/runner/micromamba/envs/test/lib/python3.10/site-packages/openpyxl/writer/excel.py', 199)]

and check https://github.com/pandas-dev/pandas/actions/runs/7678882560/job/20929192274 rasie pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit[openpyxl-.xlsm] - AssertionError: Caused unexpected warning(s): [('DeprecationWarning', DeprecationWarning('datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).'), '/home/runner/micromamba/envs/test/lib/python3.12/site-packages/openpyxl/packaging/core.py', 99), ('DeprecationWarning', DeprecationWarning('datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).'), '/home/runner/micromamba/envs/test/lib/python3.12/site-packages/openpyxl/writer/excel.py', 292)]

pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit is what we added in this pr. I just wrote the test following the other's in same file. If there is a better way to avoid these extra warnings, I am very happy to change codes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah since these are external warnings this is fine for now

):
df.to_excel(path, engine=engine)


class TestExcelWriterEngineTests:
@pytest.mark.parametrize(
Expand Down