Skip to content

StataReader selectively ignores format details #17797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
miker985 opened this issue Oct 5, 2017 · 5 comments · Fixed by #17990
Closed

StataReader selectively ignores format details #17797

miker985 opened this issue Oct 5, 2017 · 5 comments · Fixed by #17990
Labels
Datetime Datetime data dtype IO Stata read_stata, to_stata
Milestone

Comments

@miker985
Copy link
Contributor

miker985 commented Oct 5, 2017

Code Sample, a copy-pastable example if possible

link to current commit

    def _read_header(self):
        # ...
        # remove format details from %td
        self.fmtlist = ["%td" if x.startswith("%td") else x
                        for x in self.fmtlist]

which allows for this check to succeed when format details are present

def _stata_elapsed_date_to_datetime_vec(dates, fmt):
    # ...
    elif fmt in ["%td", "td", "%d", "d"]:  # Delta days relative to base
        # ...

Problem description

It looks like pandas is attempting to the right thing by dropping format codes from %td formats so that they will match the correct format in _stata_elapsed_date_to_datetime_vec

I handle a large number of Stata files and arbitrarily supporting display formats with only %td is problematic.

Expected Output

Ideally none of these four assert statements would fail

import datetime as dt
import pandas

# File attached below
d1, d2, d3, d4 = pandas.read_stata('STATA_DATES.DTA').iloc[0]

expected = dt.datetime(2006, 11, 21)

assert d1 == expected  # fails    - format is %dD_m_Y
assert d2 == expected  # succeeds - format is %tdD_m_Y
assert d3 == expected  # fails    - format is %tcD_m_Y
assert d4 == expected  # succeeds - format is %tc

Proposed Fixes

  1. Strip formats from other types by adding extra list comprehensions in _read_header. My current workaround is to subclass StataReader and add this logic after _read_header.

  2. Use str.startswith's lesser-known tuple argument
    In _stata_elapsed_date_to_datetime_vec change lines like
    elif fmt in ["%td", "td", "%d", "d"]: # Delta days relative to base
    to
    elif fmt.startswith(("%td", "td", "%d", "d")): # Delta days relative to base
    this obviates the need for the list comprehensions in the first place.

Output of pd.show_versions()

>>> pandas.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

Sample Files

STATA_DATES.DTA.zip the DTA file used for the Expected Output. It's a single row with 4 values

stata date types

The file was generated from this CSV file with the following commands.
stata_dates.csv.zip

load ~/tmp/stata_dates.csv
format date_d %dD_m_Y
format date_td %tdD_m_Y


format datetime_format %tcD_m_Y
format datetime_noformat %tc
save ~/tmp/STATA_DATE_D_TD.DTA
@chris-b1
Copy link
Contributor

chris-b1 commented Oct 6, 2017

cc @bashtage - I don't know anything about Stata but at a surface level this seems reasonable, I definitely welcome you to submit a PR.

@chris-b1 chris-b1 added the IO Stata read_stata, to_stata label Oct 6, 2017
@bashtage
Copy link
Contributor

bashtage commented Oct 6, 2017

Sounds reasonable -- seems like pure additional capability without any changes to existing features. Only small concern is whether it is easy to comprehensively test this feature?

@bashtage
Copy link
Contributor

bashtage commented Oct 6, 2017

Is the proposal to do 1 or 2? or 1 and 2? If it is -or- I think 2 is probably cleaner -- handle dates where dates are handled.

@miker985
Copy link
Contributor Author

miker985 commented Oct 6, 2017

Unless requested otherwise I will pursue proposal 2.

It does change existing behavior - instead of returning a numeric value a datetime will be returned instead. Will that be a problem?

For comprehensive testing, what do you have in mind? I see a pandas/tests/io/data directory containing fixtures for (among others) stata. I'm happy to generate a file similar to the attached earlier with a larger set of fields.

Fundamentally I know of two types of Stata date/time fields - those with day precision (val += 1 adds a single day) and with microsecond precision.

Everything beyond that I understand only to be formatting options similar to strftime.

Also there's the wrinkle of %tC accounting for leap seconds and %tc ignoring leap seconds, but pandas already handles this

@jreback jreback added this to the Next Major Release milestone Oct 8, 2017
@bashtage
Copy link
Contributor

I agree that this is a minor change in behavior, but I don't think many people would really want numbers for %tdD...

miker985 added a commit to miker985/pandas that referenced this issue Oct 26, 2017
@jreback jreback modified the milestones: Next Major Release, 0.21.1 Oct 27, 2017
miker985 added a commit to miker985/pandas that referenced this issue Oct 27, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Dec 8, 2017
TomAugspurger pushed a commit that referenced this issue Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants