-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add check for character limmit #57103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add check for character limmit #57103
Conversation
As discussed in the context of #56954 (comment), we have incorporated a validation for the maximum row/column number within the code. pandas/pandas/io/formats/excel.py Lines 929 to 934 in 9008ee5
Furthermore, I have introduced a character limit check and included corresponding notifications in the documentation. |
The CI test failure appears to be linked to the changes made in the pull request, but I'm unable to reproduce it on my local machine (WSL2). Assistance may be needed. |
Caused by warnings from other packages, and should be resolved by incorporating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, looks good! Some requests below, also needs a note in the 3.0.0 whatsnew in the Other Enhancements section.
pandas/io/excel/_base.py
Outdated
# GH#56954 | ||
if len(val) > 32767: | ||
warnings.warn( | ||
"String value too long, truncated to 32767 characters", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can potentially come from other data types, "String" may end up being confusing if that's the case. I think a more generic "Cell contents too long" is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a quick question regarding this: should we also explore validating other data types, like float or int? The current code seems to focus solely on validating the length for specific data types. Do we need to extend these checks to other types mentioned above? Alternatively, have we conducted similar checks elsewhere, or is there a best practice that should be adhered to in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also explore validating other data types, like float or int?
I think Excel's limitation is specifically for strings.
Alternatively, have we conducted similar checks elsewhere, or is there a best practice that should be adhered to in this context?
When you say elsewhere, are you thinking of outside the excel code? I do not know of any formal best practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I grasp your perspective. I want to ensure we deliberate on whether we should also account for the length of other data types. For now, let's concentrate on checking the length specifically for strings in this pull request; that should suffice.
pandas/io/excel/_base.py
Outdated
if len(val) > 32767: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a reference where 32767
comes from?
Also I think this should be a UserWarning
and not a ValueError
. We shouldn't completely stop writing the excel file due to excel's limitation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had suggested an error over a warning in #57103 (comment) primarily due to the fact that we raise an error if the frame is too long or wide already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. The reason I suggested warning was due to a similar scenario I remember a while back where we used to raise an error if we detected whether a stack/unstack
operation would create "too many elements" and users wanted to bypass that limitation. We ended up changing it to a PerformanceWarning
#45084
I slightly prefer to still raise a UserWarning
to make pandas less opinionated here (the Excel output is still valid and maybe a truncated value is OK for some users). But if you feel strongly about a ValueError
, it's not a blocker for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - raising here would need a deprecation too, right? That makes me think we should raise a warning. @luke396 - can you turn this back into a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - raising here would need a deprecation too, right?
Conservatively yes, but it would "fix" an undesired (external) behavior ("bug").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truth be told, I didn't initially utilize 'UserWarning' to the extent you both recommended when I first started out. However, I gained valuable insights from the discussions you provided.
Co-authored-by: Matthew Roeschke <[email protected]>
df = DataFrame({"A": ["a" * 32768]}) | ||
msg = "Cell contents too long, truncated to 32767 characters" | ||
with tm.assert_produces_warning( | ||
UserWarning, match=msg, raise_on_extra_warnings=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What extra warning is raised here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From commit babdf29,
the check https://github.com/pandas-dev/pandas/actions/runs/7678882560/job/20929191559 rise pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit[openpyxl-.xlsx] - AssertionError: Caused unexpected warning(s): [('EncodingWarning', EncodingWarning("'encoding' argument not specified"), '/home/runner/micromamba/envs/test/lib/python3.10/site-packages/openpyxl/writer/excel.py', 199)]
and check https://github.com/pandas-dev/pandas/actions/runs/7678882560/job/20929192274 rasie pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit[openpyxl-.xlsm] - AssertionError: Caused unexpected warning(s): [('DeprecationWarning', DeprecationWarning('datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).'), '/home/runner/micromamba/envs/test/lib/python3.12/site-packages/openpyxl/packaging/core.py', 99), ('DeprecationWarning', DeprecationWarning('datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).'), '/home/runner/micromamba/envs/test/lib/python3.12/site-packages/openpyxl/writer/excel.py', 292)]
pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit
is what we added in this pr. I just wrote the test following the other's in same file. If there is a better way to avoid these extra warnings, I am very happy to change codes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Yeah since these are external warnings this is fine for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge when ready @rhshadrach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @luke396! |
Co-authored-by: Matthew Roeschke <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.