-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add class to write dta format 117 files #20844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @bashtage! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 01, 2018 at 18:57 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #20844 +/- ##
=========================================
Coverage ? 91.78%
=========================================
Files ? 153
Lines ? 49349
Branches ? 0
=========================================
Hits ? 45295
Misses ? 4054
Partials ? 0
Continue to review full report at Codecov.
|
927beda
to
e87e64c
Compare
Is it intentional that |
what does this mean? the coverage that prints on the issue is odd (and has always been that way), but you can click thru and see the actual reports. |
pandas/core/frame.py
Outdated
|
||
convert_strl : list, optional | ||
List of column names to convert to string columns to Stata StrL | ||
format. Only available if version is 117. Storign strings in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing
pandas/core/frame.py
Outdated
@@ -1801,6 +1802,23 @@ def to_stata(self, fname, convert_dates=None, write_index=True, | |||
|
|||
.. versionadded:: 0.19.0 | |||
|
|||
version : {114, 117} | |||
dta version to use in the output file. Version 114 can be used | |||
read by Stata 10 and later. Version 117 can be read by Stata 13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just support version 117 and forward? how old are these respective versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old version is supported from 2007+. The new version is 2013+. If it was only possible to support 1 version I would stick with the old version since compatibility is more important than features for an export format IMO.
The biggest advantage of this PR is that it provides a stepping stone to supporting future export formats which have more useful features for full compatibility with pandas, like unicode support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much simplfication do we get from only supporting new format? (2013 is pretty 'old' though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would probably drop ~100-200 lines that have been are overridden by the methods in the StataWriter117 class from the base StataWriter class. The formats are fairly similar in terms of how the binary blob parts are stored and so this is all shared.
For some reasn it isn't showing up in codecov at all. See https://codecov.io/gh/pandas-dev/pandas/pull/20844/tree/pandas/io for coverage fo this PR. stata.py is missng as are other files in io.py. The master coverage is also wrong. https://codecov.io/gh/pandas-dev/pandas/tree/master/pandas/io Might be a bug in codecov.
|
13d2897
to
d13e32d
Compare
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Fix typo Enhance compliance of related docstrings usign validator
Fix incorrect skipping in strl writer Fix incorrect byteorder when exporting bigendian Fix incorrect byteorder parsing when importing bigendian Improve test coverage for errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. Might consiser renaming StataWriter to StataWriter114 (and essentially make it private). Otherwise lgtm.
# byteorder | ||
bio.write(self._tag(byteorder == ">" and "MSF" or "LSF", 'byteorder')) | ||
# number of vars, 2 bytes | ||
assert self.nvar < 2 ** 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a blank line before comments, easier to read
pandas/tests/io/test_stata.py
Outdated
|
||
import numpy as np | ||
import pytest | ||
from pandas._libs.tslib import NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import from pandas
I think this would need a dep cycle since |
Fix import location and add whitespace
@jreback Green and ready. |
@@ -1801,6 +1802,23 @@ def to_stata(self, fname, convert_dates=None, write_index=True, | |||
|
|||
.. versionadded:: 0.19.0 | |||
|
|||
version : {114, 117} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, should these be strings? Is this exposed anywhere in stats itself? Do they use integers? (when I see version number, I think string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, https://www.stata.com/support/faqs/data-management/save-for-previous-version/ seems to suggest that stata uses integers? version(13)
. OK then, let's follow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use 10 and 13 which are the Stata release versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple small comments. OK to do as a followup (I'm tagging the RC momentarily.)
writer = StataWriter(fname, self, convert_dates=convert_dates, | ||
kwargs = {} | ||
if version not in (114, 117): | ||
raise ValueError('Only formats 114 and 117 supported.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to include the user passed version
in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can push in a little bit.
columns=['long1' * 10, 'long', 1]) | ||
original.index.name = 'index' | ||
|
||
with warnings.catch_warnings(record=True) as w: # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What warnings are you catching here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An invalid name. The writer mungs the name to comply with Stata rules and issues a warning.
@jreback are you OK with merging this as is? I'll fix the git conflict on merge. |
yep this is fine |
K. @bashtage I'm going to fix the conflict and then merge. A followup PR would be welcome. |
Add export for dta 117 files which add support for long strings
Refactor StataWriter to simplify new writer
closes #16450
git diff upstream/master -u -- "*.py" | flake8 --diff