-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
StataReader selectively ignores format details #17797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @bashtage - I don't know anything about Stata but at a surface level this seems reasonable, I definitely welcome you to submit a PR. |
Sounds reasonable -- seems like pure additional capability without any changes to existing features. Only small concern is whether it is easy to comprehensively test this feature? |
Is the proposal to do 1 or 2? or 1 and 2? If it is -or- I think 2 is probably cleaner -- handle dates where dates are handled. |
Unless requested otherwise I will pursue proposal 2. It does change existing behavior - instead of returning a numeric value a datetime will be returned instead. Will that be a problem? For comprehensive testing, what do you have in mind? I see a Fundamentally I know of two types of Stata date/time fields - those with day precision (val += 1 adds a single day) and with microsecond precision. Everything beyond that I understand only to be formatting options similar to Also there's the wrinkle of |
I agree that this is a minor change in behavior, but I don't think many people would really want numbers for |
(cherry picked from commit e886af5)
Code Sample, a copy-pastable example if possible
link to current commit
which allows for this check to succeed when format details are present
Problem description
It looks like
pandas
is attempting to the right thing by dropping format codes from%td
formats so that they will match the correct format in_stata_elapsed_date_to_datetime_vec
I handle a large number of Stata files and arbitrarily supporting display formats with only
%td
is problematic.Expected Output
Ideally none of these four assert statements would fail
Proposed Fixes
Strip formats from other types by adding extra list comprehensions in
_read_header
. My current workaround is to subclassStataReader
and add this logic after_read_header
.Use
str.startswith
's lesser-known tuple argumentIn
_stata_elapsed_date_to_datetime_vec
change lines likeelif fmt in ["%td", "td", "%d", "d"]: # Delta days relative to base
to
elif fmt.startswith(("%td", "td", "%d", "d")): # Delta days relative to base
this obviates the need for the list comprehensions in the first place.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
Sample Files
STATA_DATES.DTA.zip the DTA file used for the Expected Output. It's a single row with 4 values
The file was generated from this CSV file with the following commands.
stata_dates.csv.zip
The text was updated successfully, but these errors were encountered: