Skip to content

ENH: Improve OverflowError message when pandas.read_stata fails to convert dates #36096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
miker985 opened this issue Sep 3, 2020 · 13 comments
Labels
Datetime Datetime data dtype Enhancement Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata

Comments

@miker985
Copy link
Contributor

miker985 commented Sep 3, 2020

Is your feature request related to a problem?

Calling pd.read_stata on a DTA file with a bad date format code leads to a minimally useful error message. The error is raised from this line.

What I see:

>>> import pandas as pd
>>> pd.read_stata("date_overflow.DTA")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 1933, in read_stata
    data = reader.read()
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 1682, in read
    data[col], self.fmtlist[i]
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 333, in _stata_elapsed_date_to_datetime_vec
    conv_dates = convert_delta_safe(base, days, "d")
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 294, in convert_delta_safe
    values = [base + relativedelta(days=int(d)) for d in deltas]
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 294, in <listcomp>
    values = [base + relativedelta(days=int(d)) for d in deltas]
  File "/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/dateutil/relativedelta.py", line 405, in __radd__
    return self.__add__(other)
  File "/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/dateutil/relativedelta.py", line 392, in __add__
    microseconds=self.microseconds))
OverflowError: date value out of range

Describe the solution you'd like

I'd like a clearer message indicating the problem column. It would be easy to catch the OverflowError in the loop doing the date conversion and include the column name.

Desired detailed error: OverflowError: Failed converting column tiempo_gen for format %dD_m_Y: date value out of range
Alternative error: OverflowError: date value out of range for column tiempo_gen
date_overflow.DTA.gz

I would be happy to submit a patch for this if the solution is approved.

API breaking implications

I don't believe this would break or alter any APIs

Describe alternatives you've considered

  • Add a flag skip_invalid_dates which changes behavior to skip converting dates IFF they would error. Instead of raising an exception it would be caught and a warning emitted along the lines of "date value out of range for column tiempo_gen - skipping date conversion". This would allow users access to as much converted data as possible. This flag would default to False and be backwards compatible with the APIs.

    • I would be happy to submit a patch for this if approved
  • Refactor the pandas.io.stata API to allow for manual date conversion without using private functions.

Additional context

Output of pd.show_versions()

/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/setuptools/distutils_patch.py:26: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first. "Distutils was imported before Setuptools. This usage is discouraged "

INSTALLED VERSIONS

commit : 76f74d5
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-112-generic
Version : #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+230.g76f74d53b.dirty
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

This was the test I put together to hammer out all the details so I knew what was happening with the file

tl;dr

  • The column contains values in the millions
  • The date units are days, meaning the dates occur after the year 56,000
import pandas as pd


def test_loading_bad_format():
    """
    Demonstrate OverflowError is raised with pd.read_stata
    """
    dta = "date_overflow.DTA"

    try:
        pd.read_stata(dta)
    except OverflowError as e:
        assert len(e.args) == 1
        # this message is minimally useful.
        # it should include at minimum the column causing the error
        assert e.args[0] == "date value out of range"

    # demonstrate cause
    reader = pd.io.stata.StataReader(dta, convert_dates=False)
    # date conversion is the cause here
    df = reader.read()
    # for one column
    assert len(df.columns) == 1 and 'tiempo_gen' in df
    min_val = df['tiempo_gen'].min()
    # with very large values ...
    assert min_val > 20_588_866
    assert min_val < 20_588_867
    # ... and a format indicating the units are days
    # (20.5 million days is ~56,000 years)
    assert len(reader.fmtlist) == 1 and reader.fmtlist[0] == '%dD_m_Y'

Example file - a subset of a real life file I ran into this issue with: date_overflow.DTA.gz

@miker985 miker985 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020
@TomAugspurger
Copy link
Contributor

cc @bashtage if you have thoughts.

@TomAugspurger TomAugspurger added IO Stata read_stata, to_stata Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member Enhancement labels Sep 4, 2020
@bashtage
Copy link
Contributor

bashtage commented Sep 4, 2020

The flag would be OK but it should only be part of Stata Reader, read stata. Read stata api is already quite crowded and adding a rarely used feature would just make things worse.

Would it skip entire columns or just observations? If entire columns, would the columns raw values be returned? If single. Observations presumably these would be converted to NAT.

Also, what produced this dta file? Is the date really out of range or does Stata not comply with its own specification (I've seen this once before, so not impossible).

Finally, another option would be to

@bashtage
Copy link
Contributor

bashtage commented Sep 4, 2020

The other option, possibly better would be to return object columns with datetime, when it data is out of range.

@miker985
Copy link
Contributor Author

miker985 commented Sep 4, 2020

The flag would be OK but it should only be part of Stata Reader, read stata. Read stata api is already quite crowded and adding a rarely used feature would just make things worse.

👍

Would it skip entire columns or just observations? If entire columns, would the columns raw values be returned? If single. Observations presumably these would be converted to NAT.... The other option, possibly better would be to return object columns with datetime, when it data is out of range.

Preference to returning datetime objects (great idea) with raw values second.

Also, what produced this dta file?

The file name indicates it comes from Mexican National Health and Nutrition Surveys (ENSANUT).

Is the date really out of range or does Stata not comply with its own specification (I've seen this once before, so not impossible).

I've opened the file with both Stata 13 and Stata 15. In both cases Stata displays the data as a numeric value in spite of the fact it carries a date format. Using Stata I dropped most rows and all but one column to create the file I uploaded and I received no warnings or errors.

@bashtage
Copy link
Contributor

bashtage commented Sep 6, 2020

I've opened the file with both Stata 13 and Stata 15. In both cases Stata displays the data as a numeric value in spite of the fact it carries a date format. Using Stata I dropped most rows and all but one column to create the file I uploaded and I received no warnings or errors.

Does Stata display the other dates in the variable correctly, or is the entire column numeric?

@miker985
Copy link
Contributor Author

miker985 commented Sep 8, 2020

The minimum value is well into the future - they all display as numeric.

@bashtage
Copy link
Contributor

bashtage commented Sep 8, 2020

Could you check what happens if you have some valid dates and some large values? Might help decide what return would have the highest fidelity with Stata.

@miker985
Copy link
Contributor Author

miker985 commented Sep 8, 2020

image

Setting 1 value to a reasonable date results in it displaying as a formatted date.

@bashtage
Copy link
Contributor

bashtage commented Sep 8, 2020

This suggests a path where out-of-range are left unmodified and in-range are returned as object. Thinking about the api, it could be a 3-way argument: date_errors with choices "raise" (default), "ignore" (mixed object with datetime and number), "coerce" all converted to dates or NaT if out of range, with a standard datetime column.

@miker985
Copy link
Contributor Author

miker985 commented Sep 8, 2020

How about a fourth argument value that returns regular/slow datetime.datetime objects?

@bashtage
Copy link
Contributor

bashtage commented Sep 8, 2020

Do even out-of-range are converted? the hard part is knowing whether this is correct since Stata seems to refuse to convert these values.

@miker985
Copy link
Contributor Author

miker985 commented Sep 8, 2020

Testing indicates STATA will not do date operations on these columns e.g., gen year = year(tiempo_gen) results in missing values for the unformatted distant future dates.

"raise" and "coerce" cover the cases where I don't want to investigate the data manually. As near as I can tell "coerce" most closely mirrors Stata's behavior.

For the case I want to investigate the data "ignore" leaves me wanting. There are no public methods in stata.py to do date conversion and I'm keen to have some way to be able to see what the dates would be in case manual inspection can provide context to correct the data e.g., by consulting the documentation that came with it.

@bashtage
Copy link
Contributor

bashtage commented Sep 8, 2020

Well, you can always read the data without converting the dates. What is unclear to me is what sense does it make to convert dates that are outside of the stata spec?

@bashtage bashtage changed the title Improve OverflowError message when pandas.read_stata fails to convert dates ENH: Improve OverflowError message when pandas.read_stata fails to convert dates May 18, 2021
@jbrockmendel jbrockmendel added Error Reporting Incorrect or improved errors from pandas and removed Enhancement labels Jun 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata
Projects
None yet
Development

No branches or pull requests

5 participants