Skip to content

ENH: Add class to write dta format 117 files #20844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 1, 2018

Conversation

bashtage
Copy link
Contributor

Add export for dta 117 files which add support for long strings
Refactor StataWriter to simplify new writer

closes #16450

@pep8speaks
Copy link

pep8speaks commented Apr 27, 2018

Hello @bashtage! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 01, 2018 at 18:57 Hours UTC

@codecov
Copy link

codecov bot commented Apr 27, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@ade293d). Click here to learn what that means.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20844   +/-   ##
=========================================
  Coverage          ?   91.78%           
=========================================
  Files             ?      153           
  Lines             ?    49349           
  Branches          ?        0           
=========================================
  Hits              ?    45295           
  Misses            ?     4054           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.17% <80%> (?)
#single 41.93% <0%> (?)
Impacted Files Coverage Δ
pandas/core/frame.py 97.13% <80%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ade293d...2d54ded. Read the comment docs.

@bashtage bashtage changed the title ENH: Add class to write da format 117 files ENH: Add class to write dta format 117 files Apr 28, 2018
@bashtage bashtage force-pushed the strl-support branch 4 times, most recently from 927beda to e87e64c Compare April 29, 2018 09:09
@bashtage
Copy link
Contributor Author

Is it intentional that stata.py is not covered? Seems a bit strange.

@jreback jreback added the IO Stata read_stata, to_stata label Apr 29, 2018
@jreback
Copy link
Contributor

jreback commented Apr 29, 2018

Is it intentional that stata.py is not covered? Seems a bit strange.

what does this mean? the coverage that prints on the issue is odd (and has always been that way), but you can click thru and see the actual reports.


convert_strl : list, optional
List of column names to convert to string columns to Stata StrL
format. Only available if version is 117. Storign strings in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing

@@ -1801,6 +1802,23 @@ def to_stata(self, fname, convert_dates=None, write_index=True,

.. versionadded:: 0.19.0

version : {114, 117}
dta version to use in the output file. Version 114 can be used
read by Stata 10 and later. Version 117 can be read by Stata 13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just support version 117 and forward? how old are these respective versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version is supported from 2007+. The new version is 2013+. If it was only possible to support 1 version I would stick with the old version since compatibility is more important than features for an export format IMO.

The biggest advantage of this PR is that it provides a stepping stone to supporting future export formats which have more useful features for full compatibility with pandas, like unicode support.

Copy link
Contributor

@jreback jreback Apr 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much simplfication do we get from only supporting new format? (2013 is pretty 'old' though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably drop ~100-200 lines that have been are overridden by the methods in the StataWriter117 class from the base StataWriter class. The formats are fairly similar in terms of how the binary blob parts are stored and so this is all shared.

@bashtage
Copy link
Contributor Author

what does this mean? the coverage that prints on the issue is odd (and has always been that way), but you can click thru and see the actual reports.

For some reasn it isn't showing up in codecov at all.

See

https://codecov.io/gh/pandas-dev/pandas/pull/20844/tree/pandas/io

for coverage fo this PR. stata.py is missng as are other files in io.py.

The master coverage is also wrong.

https://codecov.io/gh/pandas-dev/pandas/tree/master/pandas/io

Might be a bug in codecov.

excel, pickle, stata, and sql are all missing.

@bashtage bashtage force-pushed the strl-support branch 5 times, most recently from 13d2897 to d13e32d Compare May 1, 2018 06:58
bashtage added 2 commits May 1, 2018 08:33
Add export for dta 117 files which add support for long strings
Refactor StataWriter to simplify new writer

closes pandas-dev#16450
Fix typo
Enhance compliance of related docstrings usign validator
Fix incorrect skipping in strl writer
Fix incorrect byteorder when exporting bigendian
Fix incorrect byteorder parsing when importing bigendian
Improve test coverage for errors
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. Might consiser renaming StataWriter to StataWriter114 (and essentially make it private). Otherwise lgtm.

# byteorder
bio.write(self._tag(byteorder == ">" and "MSF" or "LSF", 'byteorder'))
# number of vars, 2 bytes
assert self.nvar < 2 ** 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a blank line before comments, easier to read


import numpy as np
import pytest
from pandas._libs.tslib import NaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import from pandas

@jreback jreback added this to the 0.23.0 milestone May 1, 2018
@bashtage
Copy link
Contributor Author

bashtage commented May 1, 2018

minor comments. Might consiser renaming StataWriter to StataWriter114 (and essentially make it private). Otherwise lgtm.

I think this would need a dep cycle since StataWriter has been public. Can do this if you want.

Fix import location and add whitespace
@bashtage
Copy link
Contributor Author

bashtage commented May 1, 2018

@jreback Green and ready.

@TomAugspurger TomAugspurger mentioned this pull request May 1, 2018
71 tasks
@@ -1801,6 +1802,23 @@ def to_stata(self, fname, convert_dates=None, write_index=True,

.. versionadded:: 0.19.0

version : {114, 117}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should these be strings? Is this exposed anywhere in stats itself? Do they use integers? (when I see version number, I think string).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, https://www.stata.com/support/faqs/data-management/save-for-previous-version/ seems to suggest that stata uses integers? version(13). OK then, let's follow that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use 10 and 13 which are the Stata release versions.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple small comments. OK to do as a followup (I'm tagging the RC momentarily.)

writer = StataWriter(fname, self, convert_dates=convert_dates,
kwargs = {}
if version not in (114, 117):
raise ValueError('Only formats 114 and 117 supported.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include the user passed version in the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can push in a little bit.

columns=['long1' * 10, 'long', 1])
original.index.name = 'index'

with warnings.catch_warnings(record=True) as w: # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What warnings are you catching here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An invalid name. The writer mungs the name to comply with Stata rules and issues a warning.

@TomAugspurger
Copy link
Contributor

@jreback are you OK with merging this as is? I'll fix the git conflict on merge.

@jreback
Copy link
Contributor

jreback commented May 1, 2018

yep this is fine
if we need follow ups after RC that is fine

@TomAugspurger
Copy link
Contributor

K. @bashtage I'm going to fix the conflict and then merge. A followup PR would be welcome.

@TomAugspurger TomAugspurger merged commit 93e7123 into pandas-dev:master May 1, 2018
@bashtage bashtage deleted the strl-support branch June 2, 2018 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_stata: Fixed width strings in Stata .dta files are limited to 244 (or fewer)
4 participants