Skip to content

BUG: to_stata() creates corrupt output file when value_labels is specified and version is 118 or greater #46750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
cmjcharlton opened this issue Apr 12, 2022 · 1 comment · Fixed by #47199
Closed
2 of 3 tasks
Labels
Bug IO Stata read_stata, to_stata Needs Triage Issue that has not been reviewed by a pandas team member
Milestone

Comments

@cmjcharlton
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame(
    {
        "fully_labelled": [1, 2, 3, 3, 1],
    }
)

value_labels = {
    "fully_labelled": {1: "one", 2: "two", 3: "three"},
}

df.to_stata("test.dta", value_labels = value_labels, version = 118)

Issue Description

When attempting to open the file created by the above code in Stata 17 MP the following error message is returned:

.dta file corrupt
    The file unexpectedly ended before it should have.
r(612);

This also occurs if version is set to 119.

I believe that this is because the space allocated for the label name is 33 bytes instead of the required 129, as the encoding is set to latin-1 by default and has not been changed to utf-8 based on the specified version.

If I change the line:

svl = StataNonCatValueLabel(colname, labels)

to:

svl = StataNonCatValueLabel(colname, labels, self._encoding)

in the function _prepare_non_cat_value_labels() in /pandas/io/stata.py

and run the example again then the file created appears to open successfully with the value labels intact.

Expected Behavior

No error message is returned, the data is loaded into Stata and the value labels are correctly applied.

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.10.4.final.0
python-bits : 64
OS : Windows
OS-release : 2012ServerR2
Version : 6.3.9600
machine : AMD64
processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

pandas : 1.4.2
numpy : 1.21.5
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 62.1.0
Cython : 0.29.28
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : 8.2.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : 3.5.1
numba : 0.55.1
numexpr : 2.8.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : 1.4.35
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@cmjcharlton cmjcharlton added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2022
@simonjayhawkins simonjayhawkins added the IO Stata read_stata, to_stata label Apr 12, 2022
@bbo2adwuff
Copy link

I just came across the same error when using value_labels, however with a different error message in Stata 17 MP:

. use data.dta
Segmentation fault

Nevertheless, the proposed fix by adding self._encoding works here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Stata read_stata, to_stata Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants