Skip to content

ENH: Add class to write dta format 117 files #20844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 1, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -450,6 +450,8 @@ Other Enhancements
- Updated :meth:`DataFrame.to_gbq` and :meth:`pandas.read_gbq` signature and documentation to reflect changes from
the Pandas-GBQ library version 0.4.0. Adds intersphinx mapping to Pandas-GBQ
library. (:issue:`20564`)
- Added new writer for exporting Stata dta files in version 117, ``StataWriter117``. This format supports exporting strings with lengths up to 2,000,000 characters (:issue:`16450`)


.. _whatsnew_0230.api_breaking:

Expand Down
53 changes: 44 additions & 9 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1769,27 +1769,28 @@ def to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='',

def to_stata(self, fname, convert_dates=None, write_index=True,
encoding="latin-1", byteorder=None, time_stamp=None,
data_label=None, variable_labels=None):
data_label=None, variable_labels=None, version=114,
convert_strl=None):
"""
A class for writing Stata binary dta files from array-like objects
Export Stata binary dta files.

Parameters
----------
fname : str or buffer
String path of file-like object
String path of file-like object.
convert_dates : dict
Dictionary mapping columns containing datetime types to stata
internal format to use when writing the dates. Options are 'tc',
'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer
or a name. Datetime columns that do not have a conversion type
specified will be converted to 'tc'. Raises NotImplementedError if
a datetime column has timezone information
a datetime column has timezone information.
write_index : bool
Write the index to Stata dataset.
encoding : str
Default is latin-1. Unicode is not supported
Default is latin-1. Unicode is not supported.
byteorder : str
Can be ">", "<", "little", or "big". default is `sys.byteorder`
Can be ">", "<", "little", or "big". default is `sys.byteorder`.
time_stamp : datetime
A datetime to use as file creation date. Default is the current
time.
Expand All @@ -1801,6 +1802,23 @@ def to_stata(self, fname, convert_dates=None, write_index=True,

.. versionadded:: 0.19.0

version : {114, 117}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should these be strings? Is this exposed anywhere in stats itself? Do they use integers? (when I see version number, I think string).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, https://www.stata.com/support/faqs/data-management/save-for-previous-version/ seems to suggest that stata uses integers? version(13). OK then, let's follow that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use 10 and 13 which are the Stata release versions.

Version to use in the output dta file. Version 114 can be used
read by Stata 10 and later. Version 117 can be read by Stata 13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just support version 117 and forward? how old are these respective versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version is supported from 2007+. The new version is 2013+. If it was only possible to support 1 version I would stick with the old version since compatibility is more important than features for an export format IMO.

The biggest advantage of this PR is that it provides a stepping stone to supporting future export formats which have more useful features for full compatibility with pandas, like unicode support.

Copy link
Contributor

@jreback jreback Apr 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much simplfication do we get from only supporting new format? (2013 is pretty 'old' though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably drop ~100-200 lines that have been are overridden by the methods in the StataWriter117 class from the base StataWriter class. The formats are fairly similar in terms of how the binary blob parts are stored and so this is all shared.

or later. Version 114 limits string variables to 244 characters or
fewer while 117 allows strings with lengths up to 2,000,000
characters.

.. versionadded:: 0.23.0

convert_strl : list, optional
List of column names to convert to string columns to Stata StrL
format. Only available if version is 117. Storing strings in the
StrL format can produce smaller dta files if strings have more than
8 characters and values are repeated.

.. versionadded:: 0.23.0

Raises
------
NotImplementedError
Expand All @@ -1814,6 +1832,12 @@ def to_stata(self, fname, convert_dates=None, write_index=True,

.. versionadded:: 0.19.0

See Also
--------
pandas.read_stata : Import Stata data files
pandas.io.stata.StataWriter : low-level writer for Stata data files
pandas.io.stata.StataWriter117 : low-level writer for version 117 files

Examples
--------
>>> data.to_stata('./data_file.dta')
Expand All @@ -1832,12 +1856,23 @@ def to_stata(self, fname, convert_dates=None, write_index=True,
>>> writer = StataWriter('./date_data_file.dta', data, {2 : 'tw'})
>>> writer.write_file()
"""
from pandas.io.stata import StataWriter
writer = StataWriter(fname, self, convert_dates=convert_dates,
kwargs = {}
if version not in (114, 117):
raise ValueError('Only formats 114 and 117 supported.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include the user passed version in the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can push in a little bit.

if version == 114:
if convert_strl is not None:
raise ValueError('strl support is only available when using '
'format 117')
from pandas.io.stata import StataWriter as statawriter
else:
from pandas.io.stata import StataWriter117 as statawriter
kwargs['convert_strl'] = convert_strl

writer = statawriter(fname, self, convert_dates=convert_dates,
encoding=encoding, byteorder=byteorder,
time_stamp=time_stamp, data_label=data_label,
write_index=write_index,
variable_labels=variable_labels)
variable_labels=variable_labels, **kwargs)
writer.write_file()

def to_feather(self, fname):
Expand Down
Loading