StataReader selectively ignores format details #17797

miker985 · 2017-10-05T19:51:03Z

Code Sample, a copy-pastable example if possible

    def _read_header(self):
        # ...
        # remove format details from %td
        self.fmtlist = ["%td" if x.startswith("%td") else x
                        for x in self.fmtlist]

which allows for this check to succeed when format details are present

def _stata_elapsed_date_to_datetime_vec(dates, fmt):
    # ...
    elif fmt in ["%td", "td", "%d", "d"]:  # Delta days relative to base
        # ...

Problem description

It looks like pandas is attempting to the right thing by dropping format codes from %td formats so that they will match the correct format in _stata_elapsed_date_to_datetime_vec

I handle a large number of Stata files and arbitrarily supporting display formats with only %td is problematic.

Expected Output

Ideally none of these four assert statements would fail

import datetime as dt
import pandas

# File attached below
d1, d2, d3, d4 = pandas.read_stata('STATA_DATES.DTA').iloc[0]

expected = dt.datetime(2006, 11, 21)

assert d1 == expected  # fails    - format is %dD_m_Y
assert d2 == expected  # succeeds - format is %tdD_m_Y
assert d3 == expected  # fails    - format is %tcD_m_Y
assert d4 == expected  # succeeds - format is %tc

Proposed Fixes

Strip formats from other types by adding extra list comprehensions in _read_header. My current workaround is to subclass StataReader and add this logic after _read_header.
Use str.startswith's lesser-known tuple argument
In _stata_elapsed_date_to_datetime_vec change lines like
elif fmt in ["%td", "td", "%d", "d"]: # Delta days relative to base
to
elif fmt.startswith(("%td", "td", "%d", "d")): # Delta days relative to base
this obviates the need for the list comprehensions in the first place.

Output of `pd.show_versions()`

>>> pandas.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

Sample Files

STATA_DATES.DTA.zip the DTA file used for the Expected Output. It's a single row with 4 values

The file was generated from this CSV file with the following commands.
stata_dates.csv.zip

load ~/tmp/stata_dates.csv
format date_d %dD_m_Y
format date_td %tdD_m_Y


format datetime_format %tcD_m_Y
format datetime_noformat %tc
save ~/tmp/STATA_DATE_D_TD.DTA

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-10-06T15:11:59Z

cc @bashtage - I don't know anything about Stata but at a surface level this seems reasonable, I definitely welcome you to submit a PR.

bashtage · 2017-10-06T15:47:02Z

Sounds reasonable -- seems like pure additional capability without any changes to existing features. Only small concern is whether it is easy to comprehensively test this feature?

bashtage · 2017-10-06T16:54:07Z

Is the proposal to do 1 or 2? or 1 and 2? If it is -or- I think 2 is probably cleaner -- handle dates where dates are handled.

miker985 · 2017-10-06T16:55:09Z

Unless requested otherwise I will pursue proposal 2.

It does change existing behavior - instead of returning a numeric value a datetime will be returned instead. Will that be a problem?

For comprehensive testing, what do you have in mind? I see a pandas/tests/io/data directory containing fixtures for (among others) stata. I'm happy to generate a file similar to the attached earlier with a larger set of fields.

Fundamentally I know of two types of Stata date/time fields - those with day precision (val += 1 adds a single day) and with microsecond precision.

Everything beyond that I understand only to be formatting options similar to strftime.

Also there's the wrinkle of %tC accounting for leap seconds and %tc ignoring leap seconds, but pandas already handles this

bashtage · 2017-10-10T08:29:57Z

I agree that this is a minor change in behavior, but I don't think many people would really want numbers for %tdD...

(cherry picked from commit e886af5)

chris-b1 added the IO Stata read_stata, to_stata label Oct 6, 2017

jreback added Difficulty Intermediate Datetime Datetime data dtype labels Oct 8, 2017

jreback added this to the Next Major Release milestone Oct 8, 2017

miker985 added a commit to miker985/pandas that referenced this issue Oct 26, 2017

BUG: Fix parsing of stata dates (pandas-dev#17797)

9ab0681

miker985 mentioned this issue Oct 26, 2017

BUG: Fix parsing of stata dates (#17797) #17990

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.21.1 Oct 27, 2017

miker985 added a commit to miker985/pandas that referenced this issue Oct 27, 2017

BUG: Fix parsing of stata dates (pandas-dev#17797)

b5c2b48

jreback closed this as completed in #17990 Oct 31, 2017

jreback pushed a commit that referenced this issue Oct 31, 2017

BUG: Fix parsing of stata dates (#17797) (#17990)

e886af5

peterpanmj pushed a commit to peterpanmj/pandas that referenced this issue Oct 31, 2017

BUG: Fix parsing of stata dates (pandas-dev#17797) (pandas-dev#17990)

fc41f3b

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

BUG: Fix parsing of stata dates (pandas-dev#17797) (pandas-dev#17990)

bcf8d57

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Dec 8, 2017

BUG: Fix parsing of stata dates (pandas-dev#17797) (pandas-dev#17990)

6912d25

(cherry picked from commit e886af5)

TomAugspurger pushed a commit that referenced this issue Dec 11, 2017

BUG: Fix parsing of stata dates (#17797) (#17990)

eb0deaa

(cherry picked from commit e886af5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StataReader selectively ignores format details #17797

StataReader selectively ignores format details #17797

miker985 commented Oct 5, 2017

INSTALLED VERSIONS

chris-b1 commented Oct 6, 2017

bashtage commented Oct 6, 2017

bashtage commented Oct 6, 2017

miker985 commented Oct 6, 2017 •

edited

Loading

bashtage commented Oct 10, 2017

StataReader selectively ignores format details #17797

StataReader selectively ignores format details #17797

Comments

miker985 commented Oct 5, 2017

Problem description

Expected Output

Proposed Fixes

Output of pd.show_versions()

INSTALLED VERSIONS

Sample Files

chris-b1 commented Oct 6, 2017

bashtage commented Oct 6, 2017

bashtage commented Oct 6, 2017

miker985 commented Oct 6, 2017 • edited Loading

bashtage commented Oct 10, 2017

Output of `pd.show_versions()`

miker985 commented Oct 6, 2017 •

edited

Loading