-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_stata
always uses 'utf8'
#21244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Perfectly solved the problem I was having, thank you. |
I am still having issues with this. I'm using a 118 Stata file, and I'm getting the same |
Having the same issue just today. Changing line 1339 from def _null_terminate(self, s):
# have bytes not strings, so must decode
s = s.partition(b"\0")[0]
return s.decode('latin-1') # instead of s.decode(self._encoding) |
Can this bug please be reopened? |
if u have a self contained example reproducing with master pls open a new issue |
Thanks, this fixed my issue. Not sure why this issue is closed while the problem is still around, even though the issue doesn't contain the dataset to reproduce this. Problem description seems quite clear to me. |
@harmbuisman It was hard to produce a dataset that has this characteristic since it can only be produced due to a bug in Stata. Stata incorrectly writes latin-1 encoded 117 format files with latin-1 encoding when saving as 118. This doesn't happen if a new file is created and then saved to 118 format. |
If this bug can be reproduced using master, please make sure you share a datafile (it could be a small extraction from a larger file, as long as the small extraction reproduced the issue), so that the structure of the file can be inspected. |
The file 196slers1967to2016_20180908.dta has this problem. |
@Larz60p You should probably let harvard know that their platform is not providing files that confirm to the Stata dta file format spec. |
Hi I am having the same issue. When I exported stata file to csv file and added pd.read_csv("file csv", encoding = "latin-1"), it worked. But when I added that to pd.read_stata("file dta" , encoding = "latin-1), it happened "Futurewarning encoding is..."). Even when I tried your ways, it's still the same, nothing changed (even the _null_terminate....) |
What version is the DTA file you are creating? |
stata 16 |
Can you share the dta file so I can take a look?
…On Wed, Mar 11, 2020, 21:00 leolovethewayyoulie ***@***.***> wrote:
What version is the DTA file you are creating?
stata 16
I read this version to find out that its' encode is "ISO-8859-1"
I have already exported the dta to csv, and using encode worked.
But the problem with encoding in read_stata is
"C:\Users\USER\Anaconda3\lib\site-packages\ipykernel_launcher.py:1:
FutureWarning: the 'encoding' keyword is deprecated and will be removed in
a future version. Please take steps to stop the use of 'encoding'
"""Entry point for launching an IPython kernel."
:(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21244 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKTSRJPCE4YYO34FKERUODRG73YRANCNFSM4FCFJ5FQ>
.
|
FWIW "ISO-8859-1" is latin-1. |
Sure, but since it is really heavy, I might send it through email, can I have your email, I will send with my csv as well. |
Yeap, so what I'm trying to say is the dta file is encoded "latin-1" since the exported-csv file from this dta file can be read with encoded "ISO-8859-1". In another word, here is my situation:
|
You could share with dropbox or google drive as well to [email protected] |
I have sent you my data through google drive |
AFAICT pandas reads the file correctly. You get a warning that the file does not have the correct format. This warning is correct since this is a stata DTA 118 file which must b utf-8 encoded per Stata's dta documentation. However, it is latin-1 encoded. This happens when an older dta file is loaded into Stata and then saved in 118 format. If you think this should be fixed, you should contact Stata since this is their bug. |
Works in pandas 1.0.1. |
Okie, I'll install pandas 1.0.1 to try |
|
Haha, thank you so much dude, |
Code Sample, a copy-pastable example if possible
This raises
UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte
.OK. So the file isn't a utf8 one. Even though the StataReader doesn't specify any Unicode support; I then try and open it with a latin-1 encoding:
This raises the same exception at exactly the same place (still utf-8).
Problem description
This is a problem because it appears that
read_stata
doesn't honour theencoding
argument.I think this line introduced a bug. The
StataReader
doesn't manage any other type of data than ascii or latin-1.Changing the line 1338 of the
pandas.io.stata
module:to:
Seemed to make everything work and I could read the data from the given file.
Even better, changing it to the following:
also seems to have made it work.
I believe though, that if you want to make this work with Unicode too you'd have to add the following encodings to
VALID_ENCODINGS
: utf-8, utf8, iso10646.Expected Output
The file should be correctly read and parsed
Output of
pd.show_versions()
pandas: 0.24.0.dev0+41.gb2eec25
pytest: 3.2.3
pip: 9.0.3
setuptools: 36.6.0
Cython: 0.28.2
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.7.3
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 3.8.0
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: