Skip to content

ENH: allow user to infer SAS file encoding; add correct encoding names #48050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 19, 2022
Merged
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.6.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ enhancement2

Other enhancements
^^^^^^^^^^^^^^^^^^
- :func:`read_sas` now supports using ``encoding='infer'`` to correctly read and use the encoding specified by the sas file. (:issue:`48048`)
- :meth:`.DataFrameGroupBy.quantile` and :meth:`.SeriesGroupBy.quantile` now preserve nullable dtypes instead of casting to numpy dtypes (:issue:`37493`)
- :meth:`Series.add_suffix`, :meth:`DataFrame.add_suffix`, :meth:`Series.add_prefix` and :meth:`DataFrame.add_prefix` support an ``axis`` argument. If ``axis`` is set, the default behaviour of which axis to consider can be overwritten (:issue:`47819`)
- :func:`assert_frame_equal` now shows the first element where the DataFrames differ, analogously to ``pytest``'s output (:issue:`47910`)
Expand Down
12 changes: 8 additions & 4 deletions pandas/io/sas/sas7bdat.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,8 +147,10 @@ class SAS7BDATReader(ReaderBase, abc.Iterator):
chunksize : int, defaults to None
Return SAS7BDATReader object for iterations, returns chunks
with given number of lines.
encoding : string, defaults to None
String encoding.
encoding : str, 'infer', defaults to None
String encoding acc. to python standard encodings,
encoding='infer' tries to detect the encoding from the file header,
encoding=None will leave the data in binary format.
convert_text : bool, defaults to True
If False, text variables are left as raw bytes.
convert_header_text : bool, defaults to True
Expand Down Expand Up @@ -265,9 +267,11 @@ def _get_properties(self) -> None:
# Get encoding information
buf = self._read_bytes(const.encoding_offset, const.encoding_length)[0]
if buf in const.encoding_names:
self.file_encoding = const.encoding_names[buf]
self.inferred_encoding = const.encoding_names[buf]
if self.encoding == "infer":
self.encoding = self.inferred_encoding
else:
self.file_encoding = f"unknown (code={buf})"
self.inferred_encoding = f"unknown (code={buf})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside, it doesn't look like this attribute is used anywhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, found that weird as well.
I also chose to rename it because (1) it has no implications to other code (2) reduce confusion, since there is file_encoding, encoding and default_encoding. The name inferred_encoding includes the source of this variables content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay makes sense. Would take a followup PR to remove the statefullness attributes.


# Get platform information
buf = self._read_bytes(const.platform_offset, const.platform_length)
Expand Down
61 changes: 55 additions & 6 deletions pandas/io/sas/sas_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,15 +107,64 @@
compression_literals: Final = [rle_compression, rdc_compression]

# Incomplete list of encodings, using SAS nomenclature:
# http://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002607278.htm
# https://support.sas.com/documentation/onlinedoc/dfdmstudio/2.6/dmpdmsug/Content/dfU_Encodings_SAS.html
# corresponding to the Python documentation of standard encodings
# https://docs.python.org/3/library/codecs.html#standard-encodings
encoding_names: Final = {
29: "latin1",
20: "utf-8",
29: "latin1",
30: "latin2",
31: "latin3",
32: "latin4",
33: "cyrillic",
60: "wlatin2",
61: "wcyrillic",
62: "wlatin1",
90: "ebcdic870",
34: "arabic",
35: "greek",
36: "hebrew",
37: "latin5",
38: "latin6",
39: "cp874",
40: "latin9",
41: "cp437",
42: "cp850",
43: "cp852",
44: "cp857",
45: "cp858",
46: "cp862",
47: "cp864",
48: "cp865",
49: "cp866",
50: "cp869",
51: "cp874",
# 52: "", # not found
# 53: "", # not found
# 54: "", # not found
55: "cp720",
56: "cp737",
57: "cp775",
58: "cp860",
59: "cp863",
60: "cp1250",
61: "cp1251",
62: "cp1252",
63: "cp1253",
64: "cp1254",
65: "cp1255",
66: "cp1256",
67: "cp1257",
68: "cp1258",
118: "cp950",
# 119: "", # not found
123: "big5",
125: "gb2312",
126: "cp936",
134: "euc_jp",
136: "cp932",
138: "shift_jis",
140: "euc-kr",
141: "cp949",
227: "latin8",
# 228: "", # not found
# 229: "" # not found
}


Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/io/sas/test_sas7bdat.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,21 @@ def test_encoding_options(datapath):
assert x == y.decode()


def test_encoding_infer(datapath):
fname = datapath("io", "sas", "data", "test1.sas7bdat")

with pd.read_sas(fname, encoding="infer", iterator=True) as df1_reader:
# check: is encoding inferred correctly from file
assert df1_reader.inferred_encoding == "cp1252"
df1 = df1_reader.read()

with pd.read_sas(fname, encoding="cp1252", iterator=True) as df2_reader:
df2 = df2_reader.read()

# check: reader reads correct information
tm.assert_frame_equal(df1, df2)


def test_productsales(datapath):
fname = datapath("io", "sas", "data", "productsales.sas7bdat")
df = pd.read_sas(fname, encoding="utf-8")
Expand Down