Skip to content

to_datetime cannot convert dates larger than 2262-04-11T00:00:00.000000000 #21972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
paul-lilley opened this issue Jul 18, 2018 · 2 comments
Closed

Comments

@paul-lilley
Copy link
Contributor

paul-lilley commented Jul 18, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
import datetime

d0 = datetime.date(1970, 1, 1)  # python epoch
d1 = datetime.date(9999, 12, 31)  # oft-used max-date in SAS / DB2

for days in range(106752, (d1 - d0).days):  # 2932896 is the number of days
    try:
        data = {'days_from_python_epoch': [days]}
        df = pd.DataFrame(data)
        df['date_as_date'] = pd.to_datetime(df['days_from_python_epoch'], unit='d', origin='1970-01-01')
    except pd._libs.tslibs.np_datetime.OutOfBoundsDatetime as e:
        print(f'exception with {days}')
        z = d0 + datetime.timedelta(days=days -1)
        print(f'looks like {days} -1 -> {z} is the max possible')  # '2262-04-11T00:00:00.000000000'] is the max possible
        raise e  # 106752 throws exception

Problem description

This causes import of some SAS datasets to fail, and potentially imports/conversions from other sources (such as DB2) where an arbitrarily high date (often 9999-12-31) is used in datawarehousing to indicate currently valid rows with a to_date=9999-12-31

see related issue #20927 - reposted here as the issue is more generic than just SAS dataset imports
and maybe #15836

I'm pretty sure the root cause is because to_datetime() converts via nanoseconds ( in _libs.tslibs.timedeltas.cast_from_unit ) and very large dates in nanoseconds do not fit into int64 space.

Expected Output

correctly parsed dates up to and including 9999-12-31

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.2
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.3
numpy: 1.14.5
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.5
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.4
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.5
lxml: 4.2.2
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.8
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jschendel
Copy link
Member

This is the expected behavior. See the Timestamp Limitations section of the documentation, as well as the Representing Out-of-Bounds Spans section for a workaround using period_range/PeriodIndex.

@jschendel jschendel added this to the No action milestone Jul 18, 2018
@paul-lilley
Copy link
Contributor Author

Thanks - the Representing Out-of-Bounds Spans section gives a good hint for how issue #20927 might be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants