-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: allow user to infer SAS file encoding; add correct encoding names #48050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
else: | ||
self.file_encoding = f"unknown (code={buf})" | ||
self.inferred_encoding = f"unknown (code={buf})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside, it doesn't look like this attribute is used anywhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, found that weird as well.
I also chose to rename it because (1) it has no implications to other code (2) reduce confusion, since there is file_encoding
, encoding
and default_encoding
. The name inferred_encoding
includes the source of this variables content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay makes sense. Would take a followup PR to remove the statefullness attributes.
pandas/tests/io/sas/test_sas7bdat.py
Outdated
fname = datapath("io", "sas", "data", "test1.sas7bdat") | ||
|
||
# check if inferred correctly | ||
df1_reader: SAS7BDATReader = pd.read_sas(fname, encoding="infer", iterator=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually don't using typing in test so I think the annotation is unnecessary
pandas/tests/io/sas/test_sas7bdat.py
Outdated
df1_reader: SAS7BDATReader = pd.read_sas(fname, encoding="infer", iterator=True) | ||
assert ( | ||
df1_reader.inferred_encoding == "cp1252" | ||
), f""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, we usually just have the plain assert
without message
I have included release information; this would fit into an enhancement section, right? Additionally: If you think that this issue/PR is worthwile, I'd include all the other encoding options from the SAS documentation as well to be complete with all use cases. |
I have now included the other encoding names, though I don't know how to factually test them; I just know that using these within Python's |
@mroeschke Is this done in your opinion? :-) The |
Would you mind merging in main and moving the release note to |
The team is releasing the 1.5rc in the next few days so we want to make minimal changes. It may have a change to make it though depending if other team members find it appropriate. |
…d correct encoding names
91bb9cc
to
0d93338
Compare
Have rebased (and uncluttered the PR from the other commits) and moved into the |
Sorry for the delay here. Just needs to resolve the merge conflict otherwise LGTM |
@mroeschke It has been done! |
Thanks @YYYasin19! |
pandas-dev#48050) * ENH: allow user to infer SAS file encoding; use detected encoding; add correct encoding names * remove typing annotation and assert error message * include initial release documentation * rewrite test: no type hinting + context manager call * add encoding constant for whole documentation -- incl. explicit not found references * moved release note to 1.6.0
read_sas
to infer encoding from file, then use encoding #48048 (Replace xxxx with the Github issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.