-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix handling of encoding for the StataReader #21244 #21246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a1efb5d
to
b291a30
Compare
Codecov Report
@@ Coverage Diff @@
## master #21246 +/- ##
=======================================
Coverage 91.84% 91.84%
=======================================
Files 153 153
Lines 49538 49538
=======================================
Hits 45499 45499
Misses 4039 4039
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible include / and/or generate a utf-8 encoded file for testing?
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -146,7 +146,8 @@ MultiIndex | |||
I/O | |||
^^^ | |||
|
|||
- | |||
- :func:`pandas.read_stata` now honours the ``encoding`` parameter, and supports the 'utf-8' | |||
encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding now.
@bashtage can you have a look |
You need to add a small Stata produce dta file that can replicate the issue and that this or fixes. Then please add a test that reads this dta file. It needs to be produced by Stata and not pandas. |
118 does support utf8. But this needs to be tested using a real Stata file with both fixed width utf8 strings and StrL utf8 characters, and numbers. |
I see. I thought that https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/data/stata14_118.dta contains Unicode, since this test Unfortunately I don't own a copy of the software and it's neither in my field to work with stata files as a Scientist/Statistician. :) |
It possible the standard strings might be latin-1 not unicode. Are you sure
that the reader doesn't already read utf8 without passing an encoding? The
dta 118 spec says all strings are unicode and so this suggests that the
encoding is always utf8 irrespective of the one passed in. If so, then the
reader shouldn't be changed and a note should be added that encoding is
ignored for 118 files.
More generally I think reader should remove encoding and always use latin-1
for dta <118 and utf8 for 118.
…On Wed, May 30, 2018, 13:14 Adrian Castravete ***@***.***> wrote:
I see. I thought that
https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/data/stata14_118.dta
contains Unicode, since the test passes with unicode characters. And the
TestStata.test_read_dta18 test from
https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_stata.py
has a check for that.
Though indeed I do not know if the generated file was made by Stata or not.
Unfortunately I don't own a copy of the software and it's neither in my
*field* to work with stata files as a Scientist/Statician. :)
My use case is for a converter for a simple view of the file.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21246 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFU5Rc-jrlC4-5xNhTVXzwoz0Zpbx14eks5t3n8KgaJpZM4URdVo>
.
|
When I tried encoding the string with unicode and then decoding it with As for the If the encoding is to be determined via the version number, then this argument should also be removed and everything else like documentation or tests should be changed to reflect this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think the encoding should be changd form settable to not settable. This will require a deprecation in StataReader. Users should not be encouraged to set encoding in read_stata
since it isn't really settable (either latin-1 or utf-8)
@@ -37,7 +37,8 @@ | |||
from pandas.util._decorators import deprecate_kwarg | |||
|
|||
VALID_ENCODINGS = ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', | |||
'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1') | |||
'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not in general valid. The set of valid depends on the reader. For < 118 it is
('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1',
'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1')
for 118 it is 'utf-8', 'utf8'
.
@@ -1335,7 +1336,7 @@ def _calcsize(self, fmt): | |||
|
|||
def _decode(self, s): | |||
s = s.partition(b"\0")[0] | |||
return s.decode('utf-8') | |||
return s.decode(self._encoding or self._default_encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... this is the line that was causing all the problems with my converter. So _decode
should only be used in >= 118, right?
@@ -99,9 +99,9 @@ def setup_method(self, method): | |||
|
|||
self.stata_dates = os.path.join(self.dirpath, 'stata13_dates.dta') | |||
|
|||
def read_dta(self, file): | |||
def read_dta(self, file, encoding='latin-1'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use None
as encoding so it can be overridden by the reader depending on the dta version.
# Legacy default reader configuration | ||
return read_stata(file, convert_dates=True) | ||
return read_stata(file, convert_dates=True, encoding=encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
I took a look at the dta spec and it is stricter than pandas enforces. dta < 118 claim to use ASCII only although Stata internally displays and works with latin-1. dta 118 is utf-8 only. |
I see. I will continue with the added suggestions. |
@adrian-castravete Master just got updated with a fix for this. If you have a chance could you try it out with your dta file? It should always use the correct encoding (automatically) now. If it fails, we might need to look into your dta file. |
I think this has been resolved in master. |
deprecated encoding by #21400 (and bug is already fixed in master) |
read_stata
always uses 'utf8' #21244git diff upstream/master -u -- "*.py" | flake8 --diff