Skip to content

ENH: allow user to infer SAS file encoding; add correct encoding names #48050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 19, 2022

Conversation

YYYasin19
Copy link
Contributor

@YYYasin19 YYYasin19 commented Aug 11, 2022

@mroeschke mroeschke added the IO SAS SAS: read_sas label Aug 12, 2022
else:
self.file_encoding = f"unknown (code={buf})"
self.inferred_encoding = f"unknown (code={buf})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside, it doesn't look like this attribute is used anywhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, found that weird as well.
I also chose to rename it because (1) it has no implications to other code (2) reduce confusion, since there is file_encoding, encoding and default_encoding. The name inferred_encoding includes the source of this variables content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay makes sense. Would take a followup PR to remove the statefullness attributes.

fname = datapath("io", "sas", "data", "test1.sas7bdat")

# check if inferred correctly
df1_reader: SAS7BDATReader = pd.read_sas(fname, encoding="infer", iterator=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't using typing in test so I think the annotation is unnecessary

df1_reader: SAS7BDATReader = pd.read_sas(fname, encoding="infer", iterator=True)
assert (
df1_reader.inferred_encoding == "cp1252"
), f"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, we usually just have the plain assert without message

@YYYasin19
Copy link
Contributor Author

I have included release information; this would fit into an enhancement section, right?

Additionally: If you think that this issue/PR is worthwile, I'd include all the other encoding options from the SAS documentation as well to be complete with all use cases.

@YYYasin19
Copy link
Contributor Author

Additionally: If you think that this issue/PR is worthwile, I'd include all the other encoding options from the SAS documentation as well to be complete with all use cases.

I have now included the other encoding names, though I don't know how to factually test them; I just know that using these within Python's encode(...) does not throw an error :)

@YYYasin19
Copy link
Contributor Author

@mroeschke Is this done in your opinion? :-) The Data Manager check seems to have failed for unrelated reasons; do you mind restarting?

@mroeschke
Copy link
Member

@mroeschke Is this done in your opinion? :-) The Data Manager check seems to have failed for unrelated reasons; do you mind restarting?

Would you mind merging in main and moving the release note to 1.6.0.rst? We are cutting a 1.5.0 fairly soon so we are freezing changes like these into the 1.5 release

@YYYasin19
Copy link
Contributor Author

Does this not make the 1.5 release because of the SAS7BDAT reader refactoring here (#47403) (and others)?

But of course, will do once the 1.6.0 doc is merged / started (#48221)

@mroeschke
Copy link
Member

Does this not make the 1.5 release because of the SAS7BDAT reader refactoring here (#47403) (and others)?

But of course, will do once the 1.6.0 doc is merged / started (#48221)

The team is releasing the 1.5rc in the next few days so we want to make minimal changes. It may have a change to make it though depending if other team members find it appropriate.

@YYYasin19 YYYasin19 force-pushed the detect-sas-encoding branch from 91bb9cc to 0d93338 Compare August 31, 2022 16:13
@YYYasin19
Copy link
Contributor Author

Have rebased (and uncluttered the PR from the other commits) and moved into the 1.6.0 release notes. ✅

@mroeschke
Copy link
Member

Sorry for the delay here. Just needs to resolve the merge conflict otherwise LGTM

@YYYasin19
Copy link
Contributor Author

@mroeschke It has been done!

@mroeschke mroeschke added this to the 1.6 milestone Sep 19, 2022
@mroeschke mroeschke merged commit 5de2448 into pandas-dev:main Sep 19, 2022
@mroeschke
Copy link
Member

Thanks @YYYasin19!

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
pandas-dev#48050)

* ENH: allow user to infer SAS file encoding; use detected encoding; add correct encoding names

* remove typing annotation and assert error message

* include initial release documentation

* rewrite test: no type hinting + context manager call

* add encoding constant for whole documentation -- incl. explicit not found references

* moved release note to 1.6.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add option to read_sas to infer encoding from file, then use encoding
2 participants