ENH: Improve `OverflowError` message when `pandas.read_stata` fails to convert dates #36096

miker985 · 2020-09-03T17:00:40Z

Is your feature request related to a problem?

Calling pd.read_stata on a DTA file with a bad date format code leads to a minimally useful error message. The error is raised from this line.

What I see:

>>> import pandas as pd
>>> pd.read_stata("date_overflow.DTA")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 1933, in read_stata
    data = reader.read()
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 1682, in read
    data[col], self.fmtlist[i]
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 333, in _stata_elapsed_date_to_datetime_vec
    conv_dates = convert_delta_safe(base, days, "d")
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 294, in convert_delta_safe
    values = [base + relativedelta(days=int(d)) for d in deltas]
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 294, in <listcomp>
    values = [base + relativedelta(days=int(d)) for d in deltas]
  File "/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/dateutil/relativedelta.py", line 405, in __radd__
    return self.__add__(other)
  File "/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/dateutil/relativedelta.py", line 392, in __add__
    microseconds=self.microseconds))
OverflowError: date value out of range

Describe the solution you'd like

I'd like a clearer message indicating the problem column. It would be easy to catch the OverflowError in the loop doing the date conversion and include the column name.

Desired detailed error: OverflowError: Failed converting column tiempo_gen for format %dD_m_Y: date value out of range
Alternative error: OverflowError: date value out of range for column tiempo_gen
date_overflow.DTA.gz

I would be happy to submit a patch for this if the solution is approved.

API breaking implications

I don't believe this would break or alter any APIs

Describe alternatives you've considered

Add a flag skip_invalid_dates which changes behavior to skip converting dates IFF they would error. Instead of raising an exception it would be caught and a warning emitted along the lines of "date value out of range for column tiempo_gen - skipping date conversion". This would allow users access to as much converted data as possible. This flag would default to False and be backwards compatible with the APIs.
- I would be happy to submit a patch for this if approved
Refactor the pandas.io.stata API to allow for manual date conversion without using private functions.

Additional context

Output of pd.show_versions()

/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/setuptools/distutils_patch.py:26: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first. "Distutils was imported before Setuptools. This usage is discouraged "

INSTALLED VERSIONS

commit : 76f74d5
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-112-generic
Version : #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+230.g76f74d53b.dirty
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

This was the test I put together to hammer out all the details so I knew what was happening with the file

tl;dr

The column contains values in the millions
The date units are days, meaning the dates occur after the year 56,000

import pandas as pd


def test_loading_bad_format():
    """
    Demonstrate OverflowError is raised with pd.read_stata
    """
    dta = "date_overflow.DTA"

    try:
        pd.read_stata(dta)
    except OverflowError as e:
        assert len(e.args) == 1
        # this message is minimally useful.
        # it should include at minimum the column causing the error
        assert e.args[0] == "date value out of range"

    # demonstrate cause
    reader = pd.io.stata.StataReader(dta, convert_dates=False)
    # date conversion is the cause here
    df = reader.read()
    # for one column
    assert len(df.columns) == 1 and 'tiempo_gen' in df
    min_val = df['tiempo_gen'].min()
    # with very large values ...
    assert min_val > 20_588_866
    assert min_val < 20_588_867
    # ... and a format indicating the units are days
    # (20.5 million days is ~56,000 years)
    assert len(reader.fmtlist) == 1 and reader.fmtlist[0] == '%dD_m_Y'

Example file - a subset of a real life file I ran into this issue with: date_overflow.DTA.gz

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-09-04T11:02:26Z

cc @bashtage if you have thoughts.

bashtage · 2020-09-04T11:08:29Z

The flag would be OK but it should only be part of Stata Reader, read stata. Read stata api is already quite crowded and adding a rarely used feature would just make things worse.

Would it skip entire columns or just observations? If entire columns, would the columns raw values be returned? If single. Observations presumably these would be converted to NAT.

Also, what produced this dta file? Is the date really out of range or does Stata not comply with its own specification (I've seen this once before, so not impossible).

Finally, another option would be to

bashtage · 2020-09-04T11:10:44Z

The other option, possibly better would be to return object columns with datetime, when it data is out of range.

miker985 · 2020-09-04T14:51:49Z

The flag would be OK but it should only be part of Stata Reader, read stata. Read stata api is already quite crowded and adding a rarely used feature would just make things worse.

👍

Would it skip entire columns or just observations? If entire columns, would the columns raw values be returned? If single. Observations presumably these would be converted to NAT.... The other option, possibly better would be to return object columns with datetime, when it data is out of range.

Preference to returning datetime objects (great idea) with raw values second.

Also, what produced this dta file?

The file name indicates it comes from Mexican National Health and Nutrition Surveys (ENSANUT).

Is the date really out of range or does Stata not comply with its own specification (I've seen this once before, so not impossible).

I've opened the file with both Stata 13 and Stata 15. In both cases Stata displays the data as a numeric value in spite of the fact it carries a date format. Using Stata I dropped most rows and all but one column to create the file I uploaded and I received no warnings or errors.

bashtage · 2020-09-06T14:41:00Z

I've opened the file with both Stata 13 and Stata 15. In both cases Stata displays the data as a numeric value in spite of the fact it carries a date format. Using Stata I dropped most rows and all but one column to create the file I uploaded and I received no warnings or errors.

Does Stata display the other dates in the variable correctly, or is the entire column numeric?

miker985 · 2020-09-08T14:35:20Z

The minimum value is well into the future - they all display as numeric.

bashtage · 2020-09-08T15:26:56Z

Could you check what happens if you have some valid dates and some large values? Might help decide what return would have the highest fidelity with Stata.

miker985 · 2020-09-08T15:50:08Z

Setting 1 value to a reasonable date results in it displaying as a formatted date.

bashtage · 2020-09-08T15:54:07Z

This suggests a path where out-of-range are left unmodified and in-range are returned as object. Thinking about the api, it could be a 3-way argument: date_errors with choices "raise" (default), "ignore" (mixed object with datetime and number), "coerce" all converted to dates or NaT if out of range, with a standard datetime column.

miker985 · 2020-09-08T16:07:04Z

How about a fourth argument value that returns regular/slow datetime.datetime objects?

bashtage · 2020-09-08T16:09:20Z

Do even out-of-range are converted? the hard part is knowing whether this is correct since Stata seems to refuse to convert these values.

miker985 · 2020-09-08T16:36:33Z

Testing indicates STATA will not do date operations on these columns e.g., gen year = year(tiempo_gen) results in missing values for the unformatted distant future dates.

"raise" and "coerce" cover the cases where I don't want to investigate the data manually. As near as I can tell "coerce" most closely mirrors Stata's behavior.

For the case I want to investigate the data "ignore" leaves me wanting. There are no public methods in stata.py to do date conversion and I'm keen to have some way to be able to see what the dates would be in case manual inspection can provide context to correct the data e.g., by consulting the documentation that came with it.

bashtage · 2020-09-08T22:10:15Z

Well, you can always read the data without converting the dates. What is unclear to me is what sense does it make to convert dates that are outside of the stata spec?

miker985 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020

TomAugspurger added IO Stata read_stata, to_stata Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member Enhancement labels Sep 4, 2020

bashtage changed the title ~~Improve OverflowError message when pandas.read_stata fails to convert dates~~ ENH: Improve OverflowError message when pandas.read_stata fails to convert dates May 18, 2021

bashtage added the Enhancement label May 18, 2021

jbrockmendel added Error Reporting Incorrect or improved errors from pandas and removed Enhancement labels Jun 19, 2021

mroeschke added the Enhancement label Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Improve `OverflowError` message when `pandas.read_stata` fails to convert dates #36096

ENH: Improve `OverflowError` message when `pandas.read_stata` fails to convert dates #36096

miker985 commented Sep 3, 2020

INSTALLED VERSIONS

TomAugspurger commented Sep 4, 2020

bashtage commented Sep 4, 2020

bashtage commented Sep 4, 2020

miker985 commented Sep 4, 2020

bashtage commented Sep 6, 2020

miker985 commented Sep 8, 2020

bashtage commented Sep 8, 2020

miker985 commented Sep 8, 2020

bashtage commented Sep 8, 2020

miker985 commented Sep 8, 2020

bashtage commented Sep 8, 2020

miker985 commented Sep 8, 2020

bashtage commented Sep 8, 2020

ENH: Improve OverflowError message when pandas.read_stata fails to convert dates #36096

ENH: Improve OverflowError message when pandas.read_stata fails to convert dates #36096

Comments

miker985 commented Sep 3, 2020

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

INSTALLED VERSIONS

TomAugspurger commented Sep 4, 2020

bashtage commented Sep 4, 2020

bashtage commented Sep 4, 2020

miker985 commented Sep 4, 2020

bashtage commented Sep 6, 2020

miker985 commented Sep 8, 2020

bashtage commented Sep 8, 2020

miker985 commented Sep 8, 2020

bashtage commented Sep 8, 2020

miker985 commented Sep 8, 2020

bashtage commented Sep 8, 2020

miker985 commented Sep 8, 2020

bashtage commented Sep 8, 2020

ENH: Improve `OverflowError` message when `pandas.read_stata` fails to convert dates #36096

ENH: Improve `OverflowError` message when `pandas.read_stata` fails to convert dates #36096