Skip to content

ENH: Add check for character limmit #57103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Feb 1, 2024

Conversation

luke396
Copy link
Contributor

@luke396 luke396 commented Jan 27, 2024

@luke396 luke396 requested a review from rhshadrach as a code owner January 27, 2024 08:07
@luke396
Copy link
Contributor Author

luke396 commented Jan 27, 2024

As discussed in the context of #56954 (comment), we have incorporated a validation for the maximum row/column number within the code.

num_rows, num_cols = self.df.shape
if num_rows > self.max_rows or num_cols > self.max_cols:
raise ValueError(
f"This sheet is too large! Your sheet size is: {num_rows}, {num_cols} "
f"Max sheet size is: {self.max_rows}, {self.max_cols}"
)

Furthermore, I have introduced a character limit check and included corresponding notifications in the documentation.

@luke396
Copy link
Contributor Author

luke396 commented Jan 27, 2024

The CI test failure appears to be linked to the changes made in the pull request, but I'm unable to reproduce it on my local machine (WSL2). Assistance may be needed.

@luke396
Copy link
Contributor Author

luke396 commented Jan 28, 2024

The CI test failure appears to be linked to the changes made in the pull request

Caused by warnings from other packages, and should be resolved by incorporating raise_on_extra_warnings=False

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, looks good! Some requests below, also needs a note in the 3.0.0 whatsnew in the Other Enhancements section.

# GH#56954
if len(val) > 32767:
warnings.warn(
"String value too long, truncated to 32767 characters",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially come from other data types, "String" may end up being confusing if that's the case. I think a more generic "Cell contents too long" is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a quick question regarding this: should we also explore validating other data types, like float or int? The current code seems to focus solely on validating the length for specific data types. Do we need to extend these checks to other types mentioned above? Alternatively, have we conducted similar checks elsewhere, or is there a best practice that should be adhered to in this context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also explore validating other data types, like float or int?

I think Excel's limitation is specifically for strings.

Alternatively, have we conducted similar checks elsewhere, or is there a best practice that should be adhered to in this context?

When you say elsewhere, are you thinking of outside the excel code? I do not know of any formal best practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I grasp your perspective. I want to ensure we deliberate on whether we should also account for the length of other data types. For now, let's concentrate on checking the length specifically for strings in this pull request; that should suffice.

Comment on lines 1330 to 1331
if len(val) > 32767:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a reference where 32767 comes from?

Also I think this should be a UserWarning and not a ValueError. We shouldn't completely stop writing the excel file due to excel's limitation

Copy link
Member

@rhshadrach rhshadrach Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had suggested an error over a warning in #57103 (comment) primarily due to the fact that we raise an error if the frame is too long or wide already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. The reason I suggested warning was due to a similar scenario I remember a while back where we used to raise an error if we detected whether a stack/unstack operation would create "too many elements" and users wanted to bypass that limitation. We ended up changing it to a PerformanceWarning #45084

I slightly prefer to still raise a UserWarning to make pandas less opinionated here (the Excel output is still valid and maybe a truncated value is OK for some users). But if you feel strongly about a ValueError, it's not a blocker for me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - raising here would need a deprecation too, right? That makes me think we should raise a warning. @luke396 - can you turn this back into a warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - raising here would need a deprecation too, right?

Conservatively yes, but it would "fix" an undesired (external) behavior ("bug").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truth be told, I didn't initially utilize 'UserWarning' to the extent you both recommended when I first started out. However, I gained valuable insights from the discussions you provided.

@mroeschke mroeschke added the IO Excel read_excel, to_excel label Jan 29, 2024
df = DataFrame({"A": ["a" * 32768]})
msg = "Cell contents too long, truncated to 32767 characters"
with tm.assert_produces_warning(
UserWarning, match=msg, raise_on_extra_warnings=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What extra warning is raised here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From commit babdf29,

the check https://github.com/pandas-dev/pandas/actions/runs/7678882560/job/20929191559 rise pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit[openpyxl-.xlsx] - AssertionError: Caused unexpected warning(s): [('EncodingWarning', EncodingWarning("'encoding' argument not specified"), '/home/runner/micromamba/envs/test/lib/python3.10/site-packages/openpyxl/writer/excel.py', 199)]

and check https://github.com/pandas-dev/pandas/actions/runs/7678882560/job/20929192274 rasie pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit[openpyxl-.xlsm] - AssertionError: Caused unexpected warning(s): [('DeprecationWarning', DeprecationWarning('datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).'), '/home/runner/micromamba/envs/test/lib/python3.12/site-packages/openpyxl/packaging/core.py', 99), ('DeprecationWarning', DeprecationWarning('datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).'), '/home/runner/micromamba/envs/test/lib/python3.12/site-packages/openpyxl/writer/excel.py', 292)]

pandas/tests/io/excel/test_writers.py::TestExcelWriter::test_to_excel_raising_warning_when_cell_character_exceed_limit is what we added in this pr. I just wrote the test following the other's in same file. If there is a better way to avoid these extra warnings, I am very happy to change codes!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah since these are external warnings this is fine for now

@mroeschke mroeschke added this to the 3.0 milestone Feb 1, 2024
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge when ready @rhshadrach

@rhshadrach rhshadrach added Enhancement Warnings Warnings that appear or should be added to pandas labels Feb 1, 2024
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit bb42fc0 into pandas-dev:main Feb 1, 2024
@rhshadrach
Copy link
Member

Thanks @luke396!

@luke396 luke396 deleted the add-character-check-to-excel-cell branch February 2, 2024 02:27
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: show warning when exporting dataframe and hitting excel truncating limit
3 participants