-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
to_stata: Fixed width strings in Stata .dta files are limited to 244 (or fewer) #16450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @bashtage does this sound right to you? @raffamaiden I'm not very familar with stata's file format. Is it possible that the file-format is limited to 244 characters, while Stata itself can handle datasets with names > 244 characters in memory? |
That is correct. There is no support for the writing the variable length strings that was added in recent Stata versions, and so writing stata is limited to 244 ascii-like (latin-1, 8 byte fixed) characters. Longer strings were added in Stata 13 and there were further changes to written strings in Stata 14 (unicode only). Reading longer strings works fine thanks to @kshedden . Little reason to support these IMO since reading is more important than writing. |
Thanks. @raffamaiden do you have a need for this? If you're willing to implement the code to support writing longer strings, I'm sure it'd be accepted, but as @bashtage says it'll be pretty low priority on our end. |
Accepting this would mean users of older versions of STATA could not read
exported files.
…On Tue, 23 May 2017 at 14:15, Tom Augspurger ***@***.***> wrote:
Thanks.
@raffamaiden <https://github.com/raffamaiden> do you have a need for
this? If you're willing to implement the code to support writing longer
strings, I'm sure it'd be accepted, but as @bashtage
<https://github.com/bashtage> says it'll be pretty low priority on our
end.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16450 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFU5RRuX6SMB59HGuHhbUs35_nEQ-W-eks5r8tv0gaJpZM4NjU4U>
.
|
Oh, yeah didn't think about that. I'm going to close this for now then. We can revisit it in the future once compat with older versions of Stata is less important. |
Well I can save the file in Excel format and have STATA convert it to dta by running a STATA script in batch mode I still think it would be a nice feature to have You may add an argument to the to_stata() function asking the file format version you want to save into. Maybe have the lowest possible file format version set by default, and change the error message to something like "This file version doesn't support strings longer than 244 characters. Save in a more recent file format version, thus breaking compatibility with older versions of STATA" As to implement it by myself, i'm quite new with pandas and have no idea of the internal file structure of stata file. Maybe I can give it a try in the summer if I can ask support to pandas developers |
Ok. Most of that would start here Feel free to post here if / when you want to take it up. |
Giving this item a bump. Stata 15 has been out a while now. More information here : https://www.youtube.com/watch?time_continue=89&v=y6KZvm1oXAk Stata data type specs here: https://www.stata.com/help.cgi?data+types I'm a fan of the ideas from @raffam Would be a helpful feature for those that use both pandas and Stata. |
Thanks for bumping. Let's re-open it. Are there numbers on which versions of Stata are actually used? Should we care at all about anything older than Stata 15?
This also seems sensible. If it isn't too much additional effort to implement and maintain, then that's best. Otherwise, it's best to just make a clean break. |
My thought is that we should look at being compatible with Stata 14 and onward. Many still use 14. I believe there are a handful of folks using 13. Though it might not be necessary to forgo reverse compatibility. Without having given this a deep dive I would suspect that since the "strL" data type with 2billion characters has been around since Stata 13, it might be a possible to adjust the crosswalk at about line 1781 here without losing Stata 13 reverse compatibility. Also, perhaps the solution would be to modify |
Reading support for these is already in. As the saying goes, PRs are welcome for writing. |
The solution is more complicated than a cast. strL are stored in a different place in a lookup table. This is efficient but makes encoding them more complex than just changing a column type. Essentially one has to build up a strL dictionary and then write it at the end of the dta file in a particular (and idiosyncratic) format. Writing Stata is probably lower priority than reading Stata since CSV is an easy method to move data to Stata. Reading is useful since there are many published dta files on the web, and without a reader these are inaccessible. |
I took a further look at this and encoding strl's isn't that hard. The harder part of implementing this is that the current version supported in 114/115. The minimum for strl is 117. 117 makes a lot of changes the file format. The biggest change is the addition of tags surrounding regions. For example the header in 117 looks like
while the header in 114 is a 109 byte blob described as:
Adding StrL's requires a substantial rewrite of StataWriter, especially
See: https://www.stata.com/help.cgi?dta_114 |
A mostly working implementation is here: master...bashtage:strl-support I noticed one larger potential advantage of this -- when writing largish data files with strings > 8 characters StrLs can reduce file size significantly if there are many repeated values. They can also reduce files size when writing sparse strings again as long as the maximum string length is > 8 characters (this happens since blank strings are replaced with an 8 bute uinteger). |
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450
* ENH: Add class to write dta format 117 files Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes #16450
Code Sample, a copy-pastable example if possible
Problem description
Raises the following error:
However this restriction seems to not exists in STATA, as the Excel file can be imported correctly
Open STATA, import the Excel file
Now we can get the type of data in column 'A', and as you can see, it's str250. So STATA can store string longer than 244 characters
A str250 %250s A
Expected Output
File gets exported with the correct format and without problems
Output of
pd.show_versions()
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.5.5
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.5.3
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1
The text was updated successfully, but these errors were encountered: