Skip to content

read_sas OverflowError: int too big to convert with '31DEC9999'd formatted as DATE9. #20927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
paul-lilley opened this issue May 2, 2018 · 8 comments · Fixed by #28047
Closed
Labels
Bug IO SAS SAS: read_sas
Milestone

Comments

@paul-lilley
Copy link
Contributor

import os
import pandas as pd

sasDS_works = 'highDate_works.sas7bdat'
sasDS_fails = 'highDate_fails.sas7bdat'
path = r'C:\temp'

print(sasDS_works)
df = pd.read_sas(os.path.join(path, sasDS_works))
print(df)

print(sasDS_fails)
df = pd.read_sas(os.path.join(path, sasDS_fails))
print(df)
data temp.highDate_works ;
	highDate_num = '31DEC9999'd;
	highDate_date = '31DEC9999'd; 
run;
data temp.highDate_fails ;
	highDate_num = '31DEC9999'd;
	highDate_date = '31DEC9999'd; format highDate_date date9.;
run;

The sas7bdat files are zipped and uploaded.
highdate_works.zip

Problem description

Traceback (most recent call last):
File "pandas/_libs/tslib.pyx", line 2075, in pandas._libs.tslib.array_with_unit_to_datetime
File "pandas/_libs/tslibs/timedeltas.pyx", line 96, in pandas._libs.tslibs.timedeltas.cast_from_unit
OverflowError: int too big to convert

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/xxxx/aargh_sasData_31DEC9999.py", line 27, in
df = pd.read_sas(os.path.join(path, sasDS_fails))
File "C:\xxxx\AppData\Local\Continuum\Anaconda3\envs\ratabasePython\lib\site-packages\pandas\io\sas\sasreader.py", line 68, in read_sas
data = reader.read()
File "C:\xxx\AppData\Local\Continuum\Anaconda3\envs\ratabasePython\lib\site-packages\pandas\io\sas\sas7bdat.py", line 614, in read
rslt = self._chunk_to_dataframe()
File "C:\xxxx\AppData\Local\Continuum\Anaconda3\envs\ratabasePython\lib\site-packages\pandas\io\sas\sas7bdat.py", line 666, in _chunk_to_dataframe
origin="1960-01-01")
File "C:\xxxx\AppData\Local\Continuum\Anaconda3\envs\ratabasePython\lib\site-packages\pandas\core\tools\datetimes.py", line 373, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "C:\xxxx\AppData\Local\Continuum\Anaconda3\envs\ratabasePython\lib\site-packages\pandas\core\tools\datetimes.py", line 229, in _convert_listlike
errors=errors)
File "pandas/_libs/tslib.pyx", line 2001, in pandas._libs.tslib.array_with_unit_to_datetime
File "pandas/_libs/tslib.pyx", line 2078, in pandas._libs.tslib.array_with_unit_to_datetime
pandas._libs.tslib.OutOfBoundsDatetime: cannot convert input 2932894.0 with the unit 'd'

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added IO SAS SAS: read_sas Datetime Datetime data dtype labels May 8, 2018
@paul-lilley
Copy link
Contributor Author

I've dug a bit more into this.

The exception throws from
File "pandas/_libs/tslibs/timedeltas.pyx", line 237, in pandas._libs.tslibs.timedeltas.cast_from_unit
which tries to cast the number of days from SAS epoch (midnight 1960-01-01) to 9999-12-31 to nanoseconds and return it as an int64_t
(as an aside the correct number of days is 2932896 but 2932894 is passed in - I think this is a follow-through error
from SAS7BDAT which pandas uses as direct use of SAS7BDAT converts 9999-12-31 to 9999-12-29 - i.e. the same 2-day diff)
Anyhow, 2932894 * 1000000000L * 86400 is too large for an int64_t => exception "OverflowError: int too big to convert"

question is - why convert a datediff in days to nanosecs?
answer is likely that the function _libs.tslibs.timedeltas.cast_from_unit is generic and not just for importing data from SAS

according to http://support.sas.com/documentation/cdl/en/etsug/63939/HTML/default/viewer.htm#etsug_tsdata_sect006.htm
SAS datetimes are the number of seconds the time is from midnight, 1 January 1960.
All numerics in SAS are floating point so milliseconds in a SAS datetime can be stored in the fractional part but
there is insufficient precision to store nanoseconds in a SAS datetime.
(see https://support.sas.com/resources/papers/proceedings16/SAS2802-2016.pdf for a suggested workaround)
i.e. nanosecond granularity is never needed for converting SAS dates or datetimes

solutions?
a) return an int128_t from pandas._libs.tslibs.timedeltas.cast_from_unit
b) have a separate conversion for SAS dates - maybe just use what SAS7BDAT gives ? (and fix the out-by-two-days error in SAS7BDAT rather than pandas)

PS - Not sure the Timeseries label is appropriate for this

@paul-lilley
Copy link
Contributor Author

Oops - didn't mean to close.

@paul-lilley
Copy link
Contributor Author

The SAS import issue is NOT related to the format of the very high dates - i.e. 31DEC9999 or 9999-12-31 as a SAS format both fail.
The root problem is the conversion via nanoseconds in _libs.tslibs.timedeltas.cast_from_unit which is used by to_datetime()

@paul-lilley
Copy link
Contributor Author

Thanks to comment from @jschendel on #21972 a possible resolution to this may to be catch the pandas._libs.tslib.OutOfBoundsDatetime and convert larger dates to Periods https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-oob

@paul-lilley
Copy link
Contributor Author

On second thoughts, I think a better approach is exposing the convert_dates=bool parameter from the underlying SAS7BDATReader up to read_sas().
This retains current functionality but also allows catching the exception for very high datetimes/dates and re-reading with the dates/datetimes as SAS native float64 values which can then be converted/cleansed/capped as required.

I've forked, altered and filed a pull-request - first time I've done this so any feedback is most welcome.

@sunilkalmadi
Copy link

Hi @paul-lilley
Thanks for your work here.
I am currently facing this issue. Did you solve the problem?

@paul-lilley
Copy link
Contributor Author

paul-lilley commented Aug 20, 2019

Hi @sunilkalmadi
Sorry for the long delay replying, I've now developed a fix and will create a pull-request for it.
The fix converts dates > 2262-04-11 to datetime.date rather than np.datetime64 so that they can be read in.

@langhimebaugh
Copy link

Still an issue? Am I doing something wrong?
import pandas as pd

df_highdate_works = pd.read_sas('highdate_works.sas7bdat')
print(df_highdate_works)

========================================================================

highDate_num highDate_date

0 2936547.0 2936547.0

========================================================================

df_highdate_fails = pd.read_sas('highdate_fails.sas7bdat')

========================================================================

OutOfBoundsDatetime: cannot convert input 2932894.0 with the unit 'd'

========================================================================

print(pd.version)

1.0.1

On another data set I get:
OutOfBoundsDatetime: cannot convert input 253402041600.0 with the unit 's'

@mroeschke mroeschke added Bug and removed Datetime Datetime data dtype labels Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants