-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
UnicodeDecodeError for Stata file #25960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You could probably also leverage |
def _set_encoding(self):
"""
Set string encoding which depends on file version
"""
if self.format_version < 118:
self._encoding = 'latin-1'
else:
self._encoding = 'utf-8' @WillAyd from the above method, it seems that |
Not familiar enough with stata to guess, but from reading through linked PR it looks like this was intentional to make the encoding strict based off version @bashtage might have thoughts |
According to Stata these should be UTF-8.
|
This file is a 118 format file. |
This character is actually in the b'\xa4'.decode('utf-8') does not work if it maps to |
It is a 2-byte encoding in utf-8. The correctly formatted string should be
or just
|
Thank you. |
It is not obvious to me. Stata does some undocumented things w.r.t. latin-1 encodings. This block would make it work.
|
Further looking says yes, this is invalid unicode. In particular, 0xA4 is an invalud Unicode encoding, so this qualifies as an undocumented Stata "feature" |
I can confirm that this works on my end too. I think I'm starting to get a little more understanding of this. |
I fully thing happens for these values: Correct unicode This is read as |
@hudcap That is correct. When a byte leads with 1 the first buyte must be |
I would support a patch used a try/except as above since this seems like a real Stata issue. I resaved the file using Stata 14.2 and got the same incorrect format. |
What if a string has these invalid Unicode encodings as well as valid UTF-8? |
If they are not latin1 or ubicode then it will error, which is the right
decision since we don't know what the right value is.
…On Tue, Apr 2, 2019, 18:48 Yehuda Davis ***@***.***> wrote:
It is not obvious to me. Stata does some undocumented things w.r.t.
latin-1 encodings.
This block would make it work.
try:
return s.decode('utf-8')
except UnicodeDecodeError:
return s.decode('unicode_escape')
What if a string has these invalid Unicode encodings as well as valid
UTF-8?
Is there any way to handle that?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFU5Rcw5m_E_Lv-04JMeGBcJRuWjxjqSks5vc5fwgaJpZM4cYGcN>
.
|
Should I make a PR? |
When you open this file in Stata, the bad character shows up as � (U+FFFD), so Stata also doesn't correctly read it (although it does read it). I also cannot create a file that uses Latin-1 -- Stata correctly always writes unicode C2 A4 for the offending character, and pandas reads it correctly. I wonder if this dataset was dumped using some other program. It seems that this file is not a valid Stata file. Ultimately it probably doesn't make sense to provide ad hoc paths that might support damaged files when the current implementation seems to do a good job with correctly formatted Stata files. |
Good point -- I completely agree. Given that the file is offered in many different formats, it's reasonable to assume that some other program created it. I'll correct the file before importing it. |
Add a fall back decode path that allows improperly formatted Stata files written in 118 format but using latin-1 encoded strings to be read closes pandas-dev#25960
Stata can produce these files, even though it is probably a big. You can reproduce by opening a 117 file with latin-1 characters (ord>127), and then save as 118. This is how the file above was produced. The original large dataset is 117 and can be read file. I made an example file in Stata and have added a fix in #25967 |
Add a fall back decode path that allows improperly formatted Stata files written in 118 format but using latin-1 encoded strings to be read closes pandas-dev#25960
* ENH: Allow poorly formatted stata files to be read Add a fall back decode path that allows improperly formatted Stata files written in 118 format but using latin-1 encoded strings to be read closes #25960 * MAINT: Refactor decode Refactor decode and null terminate to use file encoding
Code Sample, a copy-pastable example if possible
mwe.dta
available here: mwe.zipThis file is a derivative of The Supreme Court Database
Problem description
The command raises
I traced the error to a value label containing that byte.
This is a follow-up for #21244 and #23736
Changing line 1334 of
pandas.io.stata
fromto
allows me to read in the file.
Expected Output
The file should be correctly read and parsed.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.6.2
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: