-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add class to write dta format 117 files #20844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
d54541a
900c9f7
a5f1653
4397ae7
2d54ded
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1769,27 +1769,28 @@ def to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', | |
|
||
def to_stata(self, fname, convert_dates=None, write_index=True, | ||
encoding="latin-1", byteorder=None, time_stamp=None, | ||
data_label=None, variable_labels=None): | ||
data_label=None, variable_labels=None, version=114, | ||
convert_strl=None): | ||
""" | ||
A class for writing Stata binary dta files from array-like objects | ||
Export Stata binary dta files. | ||
|
||
Parameters | ||
---------- | ||
fname : str or buffer | ||
String path of file-like object | ||
String path of file-like object. | ||
convert_dates : dict | ||
Dictionary mapping columns containing datetime types to stata | ||
internal format to use when writing the dates. Options are 'tc', | ||
'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer | ||
or a name. Datetime columns that do not have a conversion type | ||
specified will be converted to 'tc'. Raises NotImplementedError if | ||
a datetime column has timezone information | ||
a datetime column has timezone information. | ||
write_index : bool | ||
Write the index to Stata dataset. | ||
encoding : str | ||
Default is latin-1. Unicode is not supported | ||
Default is latin-1. Unicode is not supported. | ||
byteorder : str | ||
Can be ">", "<", "little", or "big". default is `sys.byteorder` | ||
Can be ">", "<", "little", or "big". default is `sys.byteorder`. | ||
time_stamp : datetime | ||
A datetime to use as file creation date. Default is the current | ||
time. | ||
|
@@ -1801,6 +1802,23 @@ def to_stata(self, fname, convert_dates=None, write_index=True, | |
|
||
.. versionadded:: 0.19.0 | ||
|
||
version : {114, 117} | ||
Version to use in the output dta file. Version 114 can be used | ||
read by Stata 10 and later. Version 117 can be read by Stata 13 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we just support version 117 and forward? how old are these respective versions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The old version is supported from 2007+. The new version is 2013+. If it was only possible to support 1 version I would stick with the old version since compatibility is more important than features for an export format IMO. The biggest advantage of this PR is that it provides a stepping stone to supporting future export formats which have more useful features for full compatibility with pandas, like unicode support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how much simplfication do we get from only supporting new format? (2013 is pretty 'old' though) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would probably drop ~100-200 lines that have been are overridden by the methods in the StataWriter117 class from the base StataWriter class. The formats are fairly similar in terms of how the binary blob parts are stored and so this is all shared. |
||
or later. Version 114 limits string variables to 244 characters or | ||
fewer while 117 allows strings with lengths up to 2,000,000 | ||
characters. | ||
|
||
.. versionadded:: 0.23.0 | ||
|
||
convert_strl : list, optional | ||
List of column names to convert to string columns to Stata StrL | ||
format. Only available if version is 117. Storing strings in the | ||
StrL format can produce smaller dta files if strings have more than | ||
8 characters and values are repeated. | ||
|
||
.. versionadded:: 0.23.0 | ||
|
||
Raises | ||
------ | ||
NotImplementedError | ||
|
@@ -1814,6 +1832,12 @@ def to_stata(self, fname, convert_dates=None, write_index=True, | |
|
||
.. versionadded:: 0.19.0 | ||
|
||
See Also | ||
-------- | ||
pandas.read_stata : Import Stata data files | ||
pandas.io.stata.StataWriter : low-level writer for Stata data files | ||
pandas.io.stata.StataWriter117 : low-level writer for version 117 files | ||
|
||
Examples | ||
-------- | ||
>>> data.to_stata('./data_file.dta') | ||
|
@@ -1832,12 +1856,23 @@ def to_stata(self, fname, convert_dates=None, write_index=True, | |
>>> writer = StataWriter('./date_data_file.dta', data, {2 : 'tw'}) | ||
>>> writer.write_file() | ||
""" | ||
from pandas.io.stata import StataWriter | ||
writer = StataWriter(fname, self, convert_dates=convert_dates, | ||
kwargs = {} | ||
if version not in (114, 117): | ||
raise ValueError('Only formats 114 and 117 supported.') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be nice to include the user passed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can push in a little bit. |
||
if version == 114: | ||
if convert_strl is not None: | ||
raise ValueError('strl support is only available when using ' | ||
'format 117') | ||
from pandas.io.stata import StataWriter as statawriter | ||
else: | ||
from pandas.io.stata import StataWriter117 as statawriter | ||
kwargs['convert_strl'] = convert_strl | ||
|
||
writer = statawriter(fname, self, convert_dates=convert_dates, | ||
encoding=encoding, byteorder=byteorder, | ||
time_stamp=time_stamp, data_label=data_label, | ||
write_index=write_index, | ||
variable_labels=variable_labels) | ||
variable_labels=variable_labels, **kwargs) | ||
writer.write_file() | ||
|
||
def to_feather(self, fname): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, should these be strings? Is this exposed anywhere in stats itself? Do they use integers? (when I see version number, I think string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, https://www.stata.com/support/faqs/data-management/save-for-previous-version/ seems to suggest that stata uses integers?
version(13)
. OK then, let's follow that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use 10 and 13 which are the Stata release versions.