ENH: allow user to infer SAS file encoding; add correct encoding names #48050

YYYasin19 · 2022-08-11T22:32:14Z

closes ENH: Add option to read_sas to infer encoding from file, then use encoding #48048 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

mroeschke · 2022-08-12T18:10:46Z

pandas/io/sas/sas7bdat.py

        else:
-            self.file_encoding = f"unknown (code={buf})"
+            self.inferred_encoding = f"unknown (code={buf})"


Aside, it doesn't look like this attribute is used anywhere else?

Yes, found that weird as well.
I also chose to rename it because (1) it has no implications to other code (2) reduce confusion, since there is file_encoding, encoding and default_encoding. The name inferred_encoding includes the source of this variables content.

Okay makes sense. Would take a followup PR to remove the statefullness attributes.

mroeschke · 2022-08-12T18:11:19Z

pandas/tests/io/sas/test_sas7bdat.py

+    fname = datapath("io", "sas", "data", "test1.sas7bdat")
+
+    # check if inferred correctly
+    df1_reader: SAS7BDATReader = pd.read_sas(fname, encoding="infer", iterator=True)


We usually don't using typing in test so I think the annotation is unnecessary

mroeschke · 2022-08-12T18:11:39Z

pandas/tests/io/sas/test_sas7bdat.py

+    df1_reader: SAS7BDATReader = pd.read_sas(fname, encoding="infer", iterator=True)
+    assert (
+        df1_reader.inferred_encoding == "cp1252"
+    ), f"""


Likewise, we usually just have the plain assert without message

YYYasin19 · 2022-08-12T23:03:40Z

I have included release information; this would fit into an enhancement section, right?

Additionally: If you think that this issue/PR is worthwile, I'd include all the other encoding options from the SAS documentation as well to be complete with all use cases.

doc/source/whatsnew/v1.4.4.rst

pandas/tests/io/sas/test_sas7bdat.py

YYYasin19 · 2022-08-16T19:52:59Z

Additionally: If you think that this issue/PR is worthwile, I'd include all the other encoding options from the SAS documentation as well to be complete with all use cases.

I have now included the other encoding names, though I don't know how to factually test them; I just know that using these within Python's encode(...) does not throw an error :)

YYYasin19 · 2022-08-23T20:32:07Z

@mroeschke Is this done in your opinion? :-) The Data Manager check seems to have failed for unrelated reasons; do you mind restarting?

mroeschke · 2022-08-23T21:02:29Z

@mroeschke Is this done in your opinion? :-) The Data Manager check seems to have failed for unrelated reasons; do you mind restarting?

Would you mind merging in main and moving the release note to 1.6.0.rst? We are cutting a 1.5.0 fairly soon so we are freezing changes like these into the 1.5 release

YYYasin19 · 2022-08-23T21:54:06Z

Does this not make the 1.5 release because of the SAS7BDAT reader refactoring here (#47403) (and others)?

But of course, will do once the 1.6.0 doc is merged / started (#48221)

mroeschke · 2022-08-23T22:53:39Z

Does this not make the 1.5 release because of the SAS7BDAT reader refactoring here (#47403) (and others)?

But of course, will do once the 1.6.0 doc is merged / started (#48221)

The team is releasing the 1.5rc in the next few days so we want to make minimal changes. It may have a change to make it though depending if other team members find it appropriate.

…d correct encoding names

…ound references

YYYasin19 · 2022-08-31T16:23:02Z

Have rebased (and uncluttered the PR from the other commits) and moved into the 1.6.0 release notes. ✅

mroeschke · 2022-09-12T18:53:49Z

Sorry for the delay here. Just needs to resolve the merge conflict otherwise LGTM

YYYasin19 · 2022-09-19T19:38:43Z

@mroeschke It has been done!

mroeschke · 2022-09-19T23:22:04Z

Thanks @YYYasin19!

pandas-dev#48050) * ENH: allow user to infer SAS file encoding; use detected encoding; add correct encoding names * remove typing annotation and assert error message * include initial release documentation * rewrite test: no type hinting + context manager call * add encoding constant for whole documentation -- incl. explicit not found references * moved release note to 1.6.0

mroeschke added the IO SAS SAS: read_sas label Aug 12, 2022

mroeschke reviewed Aug 12, 2022

View reviewed changes

mroeschke reviewed Aug 15, 2022

View reviewed changes

doc/source/whatsnew/v1.4.4.rst Outdated Show resolved Hide resolved

mroeschke reviewed Aug 15, 2022

View reviewed changes

pandas/tests/io/sas/test_sas7bdat.py Outdated Show resolved Hide resolved

mroeschke reviewed Aug 16, 2022

View reviewed changes

pandas/tests/io/sas/test_sas7bdat.py Outdated Show resolved Hide resolved

YYYasin19 added 5 commits August 31, 2022 18:02

ENH: allow user to infer SAS file encoding; use detected encoding; ad…

f5293f7

…d correct encoding names

remove typing annotation and assert error message

87fb92c

include initial release documentation

f637661

rewrite test: no type hinting + context manager call

b4d7346

add encoding constant for whole documentation -- incl. explicit not f…

0d93338

…ound references

YYYasin19 force-pushed the detect-sas-encoding branch from 91bb9cc to 0d93338 Compare August 31, 2022 16:13

moved release note to 1.6.0

4f77c09

Merge branch 'main' into detect-sas-encoding

fde8ce9

mroeschke added this to the 1.6 milestone Sep 19, 2022

mroeschke approved these changes Sep 19, 2022

View reviewed changes

mroeschke merged commit 5de2448 into pandas-dev:main Sep 19, 2022

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow user to infer SAS file encoding; add correct encoding names #48050

ENH: allow user to infer SAS file encoding; add correct encoding names #48050

YYYasin19 commented Aug 11, 2022 •

edited

Loading

mroeschke Aug 12, 2022

YYYasin19 Aug 12, 2022

mroeschke Aug 15, 2022

mroeschke Aug 12, 2022

mroeschke Aug 12, 2022

YYYasin19 commented Aug 12, 2022

YYYasin19 commented Aug 16, 2022

YYYasin19 commented Aug 23, 2022

mroeschke commented Aug 23, 2022

YYYasin19 commented Aug 23, 2022

mroeschke commented Aug 23, 2022

YYYasin19 commented Aug 31, 2022

mroeschke commented Sep 12, 2022

YYYasin19 commented Sep 19, 2022

mroeschke commented Sep 19, 2022

ENH: allow user to infer SAS file encoding; add correct encoding names #48050

ENH: allow user to infer SAS file encoding; add correct encoding names #48050

Conversation

YYYasin19 commented Aug 11, 2022 • edited Loading

mroeschke Aug 12, 2022

Choose a reason for hiding this comment

YYYasin19 Aug 12, 2022

Choose a reason for hiding this comment

mroeschke Aug 15, 2022

Choose a reason for hiding this comment

mroeschke Aug 12, 2022

Choose a reason for hiding this comment

mroeschke Aug 12, 2022

Choose a reason for hiding this comment

YYYasin19 commented Aug 12, 2022

YYYasin19 commented Aug 16, 2022

YYYasin19 commented Aug 23, 2022

mroeschke commented Aug 23, 2022

YYYasin19 commented Aug 23, 2022

mroeschke commented Aug 23, 2022

YYYasin19 commented Aug 31, 2022

mroeschke commented Sep 12, 2022

YYYasin19 commented Sep 19, 2022

mroeschke commented Sep 19, 2022

YYYasin19 commented Aug 11, 2022 •

edited

Loading