to_stata: Fixed width strings in Stata .dta files are limited to 244 (or fewer) #16450

ghost · 2017-05-23T08:08:50Z

Code Sample, a copy-pastable example if possible

import pandas
frame = pandas.DataFrame({'A':['h'*250,'hi','hola']})
frame.to_excel("text.xlsx", index=False)
frame.to_stata("test.dta")

Problem description

Raises the following error:

ValueError:
Fixed width strings in Stata .dta files are limited to 244 (or fewer)
characters. Column 'A' does not satisfy this restriction.

However this restriction seems to not exists in STATA, as the Excel file can be imported correctly
Open STATA, import the Excel file

import excel "C:\data\tesi\software\text.xlsx", sheet("Sheet1") firstrow clear

Now we can get the type of data in column 'A', and as you can see, it's str250. So STATA can store string longer than 244 characters

. describe A

A str250 %250s A

Expected Output

File gets exported with the correct format and without problems

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.5.5
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.5.3
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-05-23T12:27:40Z

cc @bashtage does this sound right to you?

@raffamaiden I'm not very familar with stata's file format. Is it possible that the file-format is limited to 244 characters, while Stata itself can handle datasets with names > 244 characters in memory?

bashtage · 2017-05-23T13:12:22Z

That is correct. There is no support for the writing the variable length strings that was added in recent Stata versions, and so writing stata is limited to 244 ascii-like (latin-1, 8 byte fixed) characters. Longer strings were added in Stata 13 and there were further changes to written strings in Stata 14 (unicode only).

Reading longer strings works fine thanks to @kshedden .

Little reason to support these IMO since reading is more important than writing.

TomAugspurger · 2017-05-23T13:15:21Z

Thanks.

@raffamaiden do you have a need for this? If you're willing to implement the code to support writing longer strings, I'm sure it'd be accepted, but as @bashtage says it'll be pretty low priority on our end.

bashtage · 2017-05-23T13:19:35Z

Accepting this would mean users of older versions of STATA could not read exported files.

…

On Tue, 23 May 2017 at 14:15, Tom Augspurger ***@***.***> wrote: Thanks. @raffamaiden <https://github.com/raffamaiden> do you have a need for this? If you're willing to implement the code to support writing longer strings, I'm sure it'd be accepted, but as @bashtage <https://github.com/bashtage> says it'll be pretty low priority on our end. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16450 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFU5RRuX6SMB59HGuHhbUs35_nEQ-W-eks5r8tv0gaJpZM4NjU4U> .

TomAugspurger · 2017-05-23T13:32:50Z

Oh, yeah didn't think about that.

I'm going to close this for now then. We can revisit it in the future once compat with older versions of Stata is less important.

ghost · 2017-05-23T14:07:54Z

Well I can save the file in Excel format and have STATA convert it to dta by running a STATA script in batch mode

I still think it would be a nice feature to have

You may add an argument to the to_stata() function asking the file format version you want to save into.

Maybe have the lowest possible file format version set by default, and change the error message to something like "This file version doesn't support strings longer than 244 characters. Save in a more recent file format version, thus breaking compatibility with older versions of STATA"

As to implement it by myself, i'm quite new with pandas and have no idea of the internal file structure of stata file.

Maybe I can give it a try in the summer if I can ask support to pandas developers

TomAugspurger · 2017-05-23T14:13:10Z

Ok. Most of that would start here

Feel free to post here if / when you want to take it up.

adamrossnelson · 2018-04-20T03:04:13Z

Giving this item a bump. Stata 15 has been out a while now.
Also since Stata 13 there has be a supported "strL" data type.
This data type can accommodate 2,000,000,000 characters of text or binary data.

More information here : https://www.youtube.com/watch?time_continue=89&v=y6KZvm1oXAk

Stata data type specs here: https://www.stata.com/help.cgi?data+types

I'm a fan of the ideas from @raffam

Would be a helpful feature for those that use both pandas and Stata.
Count me in as a collaborator when it is time to make this improvement.

TomAugspurger · 2018-04-20T03:08:33Z

Thanks for bumping. Let's re-open it.

Are there numbers on which versions of Stata are actually used? Should we care at all about anything older than Stata 15?

You may add an argument to the to_stata() function asking the file format version you want to save into.

This also seems sensible. If it isn't too much additional effort to implement and maintain, then that's best. Otherwise, it's best to just make a clean break.

adamrossnelson · 2018-04-20T04:55:01Z

My thought is that we should look at being compatible with Stata 14 and onward. Many still use 14. I believe there are a handful of folks using 13.

Though it might not be necessary to forgo reverse compatibility. Without having given this a deep dive I would suspect that since the "strL" data type with 2billion characters has been around since Stata 13, it might be a possible to adjust the crosswalk at about line 1781 here without losing Stata 13 reverse compatibility.

Also, perhaps the solution would be to modify def _cast_to_stata_types(data): line 491 here so that it will explicitly cast string data types?

bashtage · 2018-04-20T06:57:19Z

Reading support for these is already in. As the saying goes, PRs are welcome for writing.

bashtage · 2018-04-20T07:02:21Z

The solution is more complicated than a cast. strL are stored in a different place in a lookup table. This is efficient but makes encoding them more complex than just changing a column type. Essentially one has to build up a strL dictionary and then write it at the end of the dta file in a particular (and idiosyncratic) format.

Writing Stata is probably lower priority than reading Stata since CSV is an easy method to move data to Stata. Reading is useful since there are many published dta files on the web, and without a reader these are inaccessible.

bashtage · 2018-04-20T15:14:42Z

I took a further look at this and encoding strl's isn't that hard. The harder part of implementing this is that the current version supported in 114/115. The minimum for strl is 117.

117 makes a lot of changes the file format. The biggest change is the addition of tags surrounding regions. For example the header in 117 looks like

         <header>
                <release>117</release>
                <byteorder>MSF</byteorder>
                <K>0002</K>
                <N>00000001</N>
                <label>00</label>
                <timestamp>1110 Jul 2013 14:23</timestamp>
            </header>

while the header in 114 is a 109 byte blob described as:

        Contents            Length    Format    Comments
        -----------------------------------------------------------------------
        ds_format                1    byte      contains 114 = 0x72
        byteorder                1    byte      0x01 -> HILO, 0x02 -> LOHI
        filetype                 1    byte      0x01
        unused                   1    byte      0x00
        nvar (number of vars)    2    int       encoded per byteorder
        nobs (number of obs)     4    int       encoded per byteorder
        data_label              81    char      dataset label, \0 terminated
        time_stamp              18    char      date/time saved, \0 terminated
        -----------------------------------------------------------------------
        Total                  109

Adding StrL's requires a substantial rewrite of StataWriter, especially

_write_header
_write_descriptors
_write_variable_labels
_write_value_labels

See:

https://www.stata.com/help.cgi?dta_114
https://www.stata.com/help.cgi?dta_117

bashtage · 2018-04-26T07:41:33Z

A mostly working implementation is here:

master...bashtage:strl-support

I noticed one larger potential advantage of this -- when writing largish data files with strings > 8 characters StrLs can reduce file size significantly if there are many repeated values. They can also reduce files size when writing sparse strings again as long as the maximum string length is > 8 characters (this happens since blank strings are replaced with an 8 bute uinteger).

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

* ENH: Add class to write dta format 117 files Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes #16450

TomAugspurger added the IO Stata read_stata, to_stata label May 23, 2017

TomAugspurger closed this as completed May 23, 2017

TomAugspurger reopened this Apr 20, 2018

bashtage added a commit to bashtage/pandas that referenced this issue Apr 27, 2018

ENH: Add class to write da format 117 files

9717de4

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

bashtage mentioned this issue Apr 27, 2018

ENH: Add class to write dta format 117 files #20844

Merged

4 tasks

bashtage added a commit to bashtage/pandas that referenced this issue Apr 28, 2018

ENH: Add class to write da format 117 files

d3e7634

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

bashtage added a commit to bashtage/pandas that referenced this issue Apr 28, 2018

ENH: Add class to write dta format 117 files

92c90c7

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

bashtage added a commit to bashtage/pandas that referenced this issue Apr 28, 2018

ENH: Add class to write dta format 117 files

a10a993

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

bashtage added a commit to bashtage/pandas that referenced this issue Apr 28, 2018

ENH: Add class to write dta format 117 files

927beda

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

bashtage added a commit to bashtage/pandas that referenced this issue Apr 29, 2018

ENH: Add class to write dta format 117 files

e87e64c

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

jreback added this to the 0.23.0 milestone Apr 29, 2018

bashtage added a commit to bashtage/pandas that referenced this issue Apr 30, 2018

ENH: Add class to write dta format 117 files

b179c64

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

bashtage added a commit to bashtage/pandas that referenced this issue Apr 30, 2018

ENH: Add class to write dta format 117 files

1baeb46

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

bashtage added a commit to bashtage/pandas that referenced this issue May 1, 2018

ENH: Add class to write dta format 117 files

d54541a

Add export for dta 117 files which add support for long strings Refactor StataWriter to simplify new writer closes pandas-dev#16450

TomAugspurger closed this as completed in #20844 May 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_stata: Fixed width strings in Stata .dta files are limited to 244 (or fewer) #16450

to_stata: Fixed width strings in Stata .dta files are limited to 244 (or fewer) #16450

ghost commented May 23, 2017

TomAugspurger commented May 23, 2017

bashtage commented May 23, 2017

TomAugspurger commented May 23, 2017

bashtage commented May 23, 2017 via email

TomAugspurger commented May 23, 2017

ghost commented May 23, 2017

TomAugspurger commented May 23, 2017

adamrossnelson commented Apr 20, 2018

TomAugspurger commented Apr 20, 2018

adamrossnelson commented Apr 20, 2018

bashtage commented Apr 20, 2018

bashtage commented Apr 20, 2018

bashtage commented Apr 20, 2018

bashtage commented Apr 26, 2018

to_stata: Fixed width strings in Stata .dta files are limited to 244 (or fewer) #16450

to_stata: Fixed width strings in Stata .dta files are limited to 244 (or fewer) #16450

Comments

ghost commented May 23, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented May 23, 2017

bashtage commented May 23, 2017

TomAugspurger commented May 23, 2017

bashtage commented May 23, 2017 via email

TomAugspurger commented May 23, 2017

ghost commented May 23, 2017

TomAugspurger commented May 23, 2017

adamrossnelson commented Apr 20, 2018

TomAugspurger commented Apr 20, 2018

adamrossnelson commented Apr 20, 2018

bashtage commented Apr 20, 2018

bashtage commented Apr 20, 2018

bashtage commented Apr 20, 2018

bashtage commented Apr 26, 2018

Output of `pd.show_versions()`