Skip to content

Issue 20927 fix resolves read_sas error for dates/datetimes greater than 2262-04-11 #28047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
May 25, 2020
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
31b8770
BUG GH20927 Fixes read_sas error for dates/datetimes greater than 226…
paullilley Aug 20, 2019
41ea312
BUG GH20927 Fixes read_sas error for dates/datetimes greater than 226…
paullilley Aug 20, 2019
1e3a71a
DOC fixed typo
paullilley Aug 20, 2019
6860ed3
Merge branch 'master' into issue_20927_fix
paul-lilley Aug 20, 2019
0b4a97e
fixed failing test due to column ordering
paullilley Aug 22, 2019
d7ad4e7
black style fix
paullilley Aug 22, 2019
1aca08a
Return datetime.datetime and raise UserWarning for all dates/datetime…
paullilley Aug 22, 2019
6b3c175
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Aug 22, 2019
f9dfb44
fixed PEP8 formating issue
paullilley Aug 22, 2019
5260a11
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Aug 27, 2019
1865cb3
DOC fixed attribute and class markdown
paullilley Aug 27, 2019
6881114
fixed linting issue
paullilley Aug 27, 2019
6ad0d29
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Aug 27, 2019
2b8b1bd
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Oct 25, 2019
b0976af
CLN: minor refactor for reading sas files, updated whatsnew
paullilley Oct 25, 2019
fef7f0b
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Oct 25, 2019
096ae1c
removed warnings
paullilley Dec 11, 2019
ab85185
merged upstream/master
paullilley Dec 11, 2019
922c9d3
tidied trailing whitespace
paullilley Dec 11, 2019
828e687
moved static method to module level function
paullilley Jan 27, 2020
b7158d3
merged master
paullilley Jan 27, 2020
6279ea7
fixed import sorting
paullilley Jan 29, 2020
43fb9c5
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Feb 5, 2020
5e1a542
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Feb 10, 2020
b38c165
moved whatsnew to 1.1, added space before comment
paullilley Feb 10, 2020
3eaa34d
tidied float constants
paullilley Feb 10, 2020
ce8616f
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Feb 10, 2020
deb313c
fixed linting
paullilley Feb 10, 2020
63991b4
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Feb 10, 2020
6c6144a
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Feb 13, 2020
22e06d4
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Feb 19, 2020
4dce007
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Feb 24, 2020
28ae4cd
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley Mar 20, 2020
1628ffc
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley May 20, 2020
80acc7d
fixed failing test
paullilley May 21, 2020
010f672
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley May 21, 2020
e6403ba
fixed failing test
paullilley May 21, 2020
7b7984a
fixed linting
paullilley May 21, 2020
6ca1577
fixed import order
paullilley May 21, 2020
2afbdee
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley May 25, 2020
17b6515
Added return type. Removed space before issue number.
paullilley May 25, 2020
a654327
Fixed mypy error.
paullilley May 25, 2020
c730cb3
Merge remote-tracking branch 'upstream/master' into issue_20927_fix
paullilley May 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -773,6 +773,7 @@ I/O
timestamps with ``version="2.0"`` (:issue:`31652`).
- Bug in :meth:`read_csv` was raising `TypeError` when `sep=None` was used in combination with `comment` keyword (:issue:`31396`)
- Bug in :class:`HDFStore` that caused it to set to ``int64`` the dtype of a ``datetime64`` column when reading a DataFrame in Python 3 from fixed format written in Python 2 (:issue:`31750`)
- :func:`read_sas()` now handles dates and datetimes larger than :attr:`Timestamp.max` returning them as :class:`datetime.datetime` objects (:issue: `20927`)
- Bug in :meth:`DataFrame.to_json` where ``Timedelta`` objects would not be serialized correctly with ``date_format="iso"`` (:issue:`28256`)
- :func:`read_csv` will raise a ``ValueError`` when the column names passed in `parse_dates` are missing in the Dataframe (:issue:`31251`)
- Bug in :meth:`read_excel` where a UTF-8 string with a high surrogate would cause a segmentation violation (:issue:`23809`)
Expand Down
39 changes: 30 additions & 9 deletions pandas/io/sas/sas7bdat.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@
http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/CUJ/1992/9210/ross/ross.htm
"""
from collections import abc
from datetime import datetime
from datetime import datetime, timedelta
import struct

import numpy as np

from pandas.errors import EmptyDataError
from pandas.errors import EmptyDataError, OutOfBoundsDatetime

import pandas as pd

Expand All @@ -29,6 +29,32 @@
from pandas.io.sas.sasreader import ReaderBase


def _convert_datetimes(sas_datetimes: pd.Series, unit: str):
"""
Convert to Timestamp if possible, otherwise to datetime.datetime.
SAS float64 lacks precision for more than ms resolution so the fit
to datetime.datetime is ok.

Parameters
----------
sas_datetimes : {Series, Sequence[float]}
Dates or datetimes in SAS
unit : {str}
"d" if the floats represent dates, "s" for datetimes
"""
try:
return pd.to_datetime(sas_datetimes, unit=unit, origin="1960-01-01")
except OutOfBoundsDatetime:
if unit == "s":
return sas_datetimes.apply(
lambda sas_float: datetime(1960, 1, 1) + timedelta(seconds=sas_float)
)
elif unit == "d":
return sas_datetimes.apply(
lambda sas_float: datetime(1960, 1, 1) + timedelta(days=sas_float)
)


class _subheader_pointer:
pass

Expand Down Expand Up @@ -706,15 +732,10 @@ def _chunk_to_dataframe(self):
rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")
rslt[name] = np.asarray(rslt[name], dtype=np.float64)
if self.convert_dates:
unit = None
if self.column_formats[j] in const.sas_date_formats:
unit = "d"
rslt[name] = _convert_datetimes(rslt[name], "d")
elif self.column_formats[j] in const.sas_datetime_formats:
unit = "s"
if unit:
rslt[name] = pd.to_datetime(
rslt[name], unit=unit, origin="1960-01-01"
)
rslt[name] = _convert_datetimes(rslt[name], "s")
jb += 1
elif self._column_types[j] == b"s":
rslt[name] = self._string_chunk[js, :]
Expand Down
Binary file added pandas/tests/io/sas/data/max_sas_date.sas7bdat
Binary file not shown.
92 changes: 92 additions & 0 deletions pandas/tests/io/sas/test_sas7bdat.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import os
from pathlib import Path

import dateutil.parser
import numpy as np
import pytest

Expand Down Expand Up @@ -214,3 +215,94 @@ def test_zero_variables(datapath):
fname = datapath("io", "sas", "data", "zero_variables.sas7bdat")
with pytest.raises(EmptyDataError):
pd.read_sas(fname)


def round_datetime_to_ms(ts):
if isinstance(ts, datetime):
return ts.replace(microsecond=int(round(ts.microsecond, -3) / 1000) * 1000)
elif isinstance(ts, str):
_ts = dateutil.parser.parse(timestr=ts)
return _ts.replace(microsecond=int(round(_ts.microsecond, -3) / 1000) * 1000)
else:
return ts


def test_max_sas_date(datapath):
# GH 20927
# NB. max datetime in SAS dataset is 31DEC9999:23:59:59.999
# but this is read as 29DEC9999:23:59:59.998993 by a buggy
# sas7bdat module
fname = datapath("io", "sas", "data", "max_sas_date.sas7bdat")
df = pd.read_sas(fname, encoding="iso-8859-1")

# SAS likes to left pad strings with spaces - lstrip before comparing
df = df.applymap(lambda x: x.lstrip() if isinstance(x, str) else x)
# GH 19732: Timestamps imported from sas will incur floating point errors
try:
df["dt_as_dt"] = df["dt_as_dt"].dt.round("us")
except pd._libs.tslibs.np_datetime.OutOfBoundsDatetime:
df = df.applymap(round_datetime_to_ms)
except AttributeError:
df["dt_as_dt"] = df["dt_as_dt"].apply(round_datetime_to_ms)
# if there are any date/times > pandas.Timestamp.max then ALL in that chunk
# are returned as datetime.datetime
expected = pd.DataFrame(
{
"text": ["max", "normal"],
"dt_as_float": [253717747199.999, 1880323199.999],
"dt_as_dt": [
datetime(9999, 12, 29, 23, 59, 59, 999000),
datetime(2019, 8, 1, 23, 59, 59, 999000),
],
"date_as_float": [2936547.0, 21762.0],
"date_as_date": [datetime(9999, 12, 29), datetime(2019, 8, 1)],
},
columns=["text", "dt_as_float", "dt_as_dt", "date_as_float", "date_as_date"],
)
tm.assert_frame_equal(df, expected)


def test_max_sas_date_iterator(datapath):
# GH 20927
# when called as an iterator, only those chunks with a date > pd.Timestamp.max
# are returned as datetime.datetime, if this happens that whole chunk is returned
# as datetime.datetime
col_order = ["text", "dt_as_float", "dt_as_dt", "date_as_float", "date_as_date"]
fname = datapath("io", "sas", "data", "max_sas_date.sas7bdat")
results = []
for df in pd.read_sas(fname, encoding="iso-8859-1", chunksize=1):
# SAS likes to left pad strings with spaces - lstrip before comparing
df = df.applymap(lambda x: x.lstrip() if isinstance(x, str) else x)
# GH 19732: Timestamps imported from sas will incur floating point errors
try:
df["dt_as_dt"] = df["dt_as_dt"].dt.round("us")
except pd._libs.tslibs.np_datetime.OutOfBoundsDatetime:
df = df.applymap(round_datetime_to_ms)
except AttributeError:
df["dt_as_dt"] = df["dt_as_dt"].apply(round_datetime_to_ms)
df.reset_index(inplace=True, drop=True)
results.append(df)
expected = [
pd.DataFrame(
{
"text": ["max"],
"dt_as_float": [253717747199.999],
"dt_as_dt": [datetime(9999, 12, 29, 23, 59, 59, 999000)],
"date_as_float": [2936547.0],
"date_as_date": [datetime(9999, 12, 29)],
},
columns=col_order,
),
pd.DataFrame(
{
"text": ["normal"],
"dt_as_float": [1880323199.999],
"dt_as_dt": [np.datetime64("2019-08-01 23:59:59.999")],
"date_as_float": [21762.0],
"date_as_date": [np.datetime64("2019-08-01")],
},
columns=col_order,
),
]
for result, expected in zip(results, expected):
tm.assert_frame_equal(result, expected)