BUG: StataWriterUTF8 is needlessly strict when converting variable names #47276

eirki · 2022-06-07T19:48:09Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({"mø": [1, 2, 3, 4]}).to_stata("data.dta", version=119)

#.venv/lib/python3.9/site-packages/pandas/io/stata.py:2491: InvalidColumnName: 
#Not all pandas column names were valid Stata variable names.
#The following replacements have been made:
#
#    mø   ->   m_
#
#If this is not what you expect, please make sure you have Stata-compliant
#column names in your DataFrame (strings only, max 32 characters, only
#alphanumerics and underscores, no Stata reserved words)
#
#  warnings.warn(ws, InvalidColumnName)

pd.read_stata("data.dta")

#    index  m_
# 0      0   1
# 1      1   2
# 2      2   3
# 3      3   4

Issue Description

All characters in the range 128 <= ord(c) < 256 are replaced with underscore by StataWriterUTF8, but only (128 <= ord(c) < 192) or ord(c) in {215, 247} need to be removed. The rest (ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ) are perfectly valid in variable names in Stata version >= 118

Expected Behavior

The characters ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ should not be removed from variable names when saving with StataWriterUTF8. Happy to submit a pull request.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 4bfe3d0 python : 3.10.4.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-33-generic Version : #34-Ubuntu SMP Wed May 18 13:34:26 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.2
setuptools : 59.6.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

…andas-dev#47276)

…47276) (#47297)

…andas-dev#47276) (pandas-dev#47297)

eirki added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 7, 2022

eirki added a commit to eirki/pandas that referenced this issue Jun 9, 2022

BUG: StataWriterUTF8 removing some valid characters in column names (p…

4a9db3d

…andas-dev#47276)

eirki mentioned this issue Jun 9, 2022

BUG: StataWriterUTF8 removing some valid characters in column names (#47276) #47297

Merged

4 tasks

jreback added this to the 1.5 milestone Jun 10, 2022

jreback added IO Stata read_stata, to_stata and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 10, 2022

jreback closed this as completed in #47297 Jun 10, 2022

jreback pushed a commit that referenced this issue Jun 10, 2022

BUG: StataWriterUTF8 removing some valid characters in column names (#…

05f85a4

…47276) (#47297)

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022

BUG: StataWriterUTF8 removing some valid characters in column names (p…

aba825a

…andas-dev#47276) (pandas-dev#47297)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: StataWriterUTF8 is needlessly strict when converting variable names #47276

BUG: StataWriterUTF8 is needlessly strict when converting variable names #47276

eirki commented Jun 7, 2022

BUG: StataWriterUTF8 is needlessly strict when converting variable names #47276

BUG: StataWriterUTF8 is needlessly strict when converting variable names #47276

Comments

eirki commented Jun 7, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions