BUG: Fix handling of encoding for the StataReader #21244 #21246

adrian-castravete · 2018-05-29T13:08:47Z

closes BUG: read_stata always uses 'utf8' #21244
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2018-05-30T10:04:13Z

Codecov Report

Merging #21246 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #21246   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49538    49538           
=======================================
  Hits        45499    45499           
  Misses       4039     4039

Flag	Coverage Δ
#multiple	`90.24% <ø> (ø)`	⬆️
#single	`41.87% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c85ab08...57c24f8. Read the comment docs.

jreback

is it possible include / and/or generate a utf-8 encoded file for testing?

jreback · 2018-05-30T10:46:14Z

doc/source/whatsnew/v0.24.0.txt

@@ -146,7 +146,8 @@ MultiIndex
 I/O
 ^^^

-
+- :func:`pandas.read_stata` now honours the ``encoding`` parameter, and supports the 'utf-8'
+  encoding.


add the issue number

Adding now.

jreback · 2018-05-30T10:47:12Z

@bashtage can you have a look

bashtage · 2018-05-30T10:51:21Z

You need to add a small Stata produce dta file that can replicate the issue and that this or fixes. Then please add a test that reads this dta file. It needs to be produced by Stata and not pandas.

bashtage · 2018-05-30T10:52:15Z

118 does support utf8. But this needs to be tested using a real Stata file with both fixed width utf8 strings and StrL utf8 characters, and numbers.

adrian-castravete · 2018-05-30T11:13:47Z

I see. I thought that https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/data/stata14_118.dta contains Unicode, since this test TestStata.test_read_dta18 from https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_stata.py has a check for that and it passes.
Though indeed I don't know if the generated file was made by Stata or not.

Unfortunately I don't own a copy of the software and it's neither in my field to work with stata files as a Scientist/Statistician. :)
My use case is for a converter for a simple view of the file.

bashtage · 2018-05-30T11:41:30Z

It possible the standard strings might be latin-1 not unicode. Are you sure that the reader doesn't already read utf8 without passing an encoding? The dta 118 spec says all strings are unicode and so this suggests that the encoding is always utf8 irrespective of the one passed in. If so, then the reader shouldn't be changed and a note should be added that encoding is ignored for 118 files. More generally I think reader should remove encoding and always use latin-1 for dta <118 and utf8 for 118.

…

On Wed, May 30, 2018, 13:14 Adrian Castravete ***@***.***> wrote: I see. I thought that https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/data/stata14_118.dta contains Unicode, since the test passes with unicode characters. And the TestStata.test_read_dta18 test from https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_stata.py has a check for that. Though indeed I do not know if the generated file was made by Stata or not. Unfortunately I don't own a copy of the software and it's neither in my *field* to work with stata files as a Scientist/Statician. :) My use case is for a converter for a simple view of the file. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21246 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFU5Rc-jrlC4-5xNhTVXzwoz0Zpbx14eks5t3n8KgaJpZM4URdVo> .

adrian-castravete · 2018-05-30T11:49:04Z

When I tried encoding the string with unicode and then decoding it with latin-1 the string was different, It showed two weird characters instead of the expected Ü.

As for the encoding argument: I agree if the specs say that 118 is always Unicode then it would make sense to add the necessary ifs where applicable. I've seen parts of the code where the encoding gets set to latin-1.

If the encoding is to be determined via the version number, then this argument should also be removed and everything else like documentation or tests should be changed to reflect this.

bashtage

Overall I think the encoding should be changd form settable to not settable. This will require a deprecation in StataReader. Users should not be encouraged to set encoding in read_stata since it isn't really settable (either latin-1 or utf-8)

bashtage · 2018-05-30T12:51:57Z

pandas/io/stata.py

@@ -37,7 +37,8 @@
 from pandas.util._decorators import deprecate_kwarg

 VALID_ENCODINGS = ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1',
-                   'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1')
+                   'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1',


These are not in general valid. The set of valid depends on the reader. For < 118 it is

('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1')

for 118 it is 'utf-8', 'utf8'.

bashtage · 2018-05-30T12:52:28Z

pandas/io/stata.py

@@ -1335,7 +1336,7 @@ def _calcsize(self, fmt):

    def _decode(self, s):
        s = s.partition(b"\0")[0]
-        return s.decode('utf-8')
+        return s.decode(self._encoding or self._default_encoding)


This should not be changed.

Interesting... this is the line that was causing all the problems with my converter. So _decode should only be used in >= 118, right?

bashtage · 2018-05-30T12:53:32Z

pandas/tests/io/test_stata.py

@@ -99,9 +99,9 @@ def setup_method(self, method):

        self.stata_dates = os.path.join(self.dirpath, 'stata13_dates.dta')

-    def read_dta(self, file):
+    def read_dta(self, file, encoding='latin-1'):


Should use None as encoding so it can be overridden by the reader depending on the dta version.

bashtage · 2018-05-30T12:53:41Z

pandas/tests/io/test_stata.py

        # Legacy default reader configuration
-        return read_stata(file, convert_dates=True)
+        return read_stata(file, convert_dates=True, encoding=encoding)


bashtage · 2018-05-30T13:10:29Z

I took a look at the dta spec and it is stricter than pandas enforces. dta < 118 claim to use ASCII only although Stata internally displays and works with latin-1. dta 118 is utf-8 only.

adrian-castravete · 2018-05-30T14:49:52Z

I see. I will continue with the added suggestions.

bashtage · 2018-06-06T15:51:47Z

@adrian-castravete Master just got updated with a fix for this. If you have a chance could you try it out with your dta file? It should always use the correct encoding (automatically) now. If it fails, we might need to look into your dta file.

bashtage · 2018-06-12T09:54:55Z

I think this has been resolved in master.

jreback · 2018-06-12T11:02:58Z

deprecated encoding by #21400 (and bug is already fixed in master)

adrian-castravete force-pushed the master branch 2 times, most recently from a1efb5d to b291a30 Compare May 30, 2018 09:17

adrian-castravete force-pushed the master branch from b291a30 to 2968c59 Compare May 30, 2018 10:21

jreback added Unicode Unicode strings IO Stata read_stata, to_stata labels May 30, 2018

jreback requested changes May 30, 2018

View reviewed changes

BUG: Fix handling of encoding for the StataReader pandas-dev#21244

57c24f8

adrian-castravete force-pushed the master branch from 2968c59 to 57c24f8 Compare May 30, 2018 11:19

bashtage reviewed May 30, 2018

View reviewed changes

jreback closed this Jun 12, 2018

jreback added this to the No action milestone Jun 12, 2018

hudcap mentioned this pull request Apr 2, 2019

UnicodeDecodeError for Stata file #25960

Closed

eirki mentioned this pull request Jun 2, 2022

BUG: StataWriter value_label encoding #47199

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix handling of encoding for the StataReader #21244 #21246

BUG: Fix handling of encoding for the StataReader #21244 #21246

adrian-castravete commented May 29, 2018 •

edited

Loading

codecov bot commented May 30, 2018 •

edited

Loading

jreback left a comment

jreback May 30, 2018

adrian-castravete May 30, 2018

jreback commented May 30, 2018

bashtage commented May 30, 2018

bashtage commented May 30, 2018

adrian-castravete commented May 30, 2018 •

edited

Loading

bashtage commented May 30, 2018 via email

adrian-castravete commented May 30, 2018 •

edited

Loading

bashtage left a comment

bashtage May 30, 2018

bashtage May 30, 2018

adrian-castravete May 30, 2018

bashtage May 30, 2018

bashtage May 30, 2018

bashtage commented May 30, 2018

adrian-castravete commented May 30, 2018

bashtage commented Jun 6, 2018

bashtage commented Jun 12, 2018

jreback commented Jun 12, 2018

BUG: Fix handling of encoding for the StataReader #21244 #21246

BUG: Fix handling of encoding for the StataReader #21244 #21246

Conversation

adrian-castravete commented May 29, 2018 • edited Loading

codecov bot commented May 30, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

jreback May 30, 2018

Choose a reason for hiding this comment

adrian-castravete May 30, 2018

Choose a reason for hiding this comment

jreback commented May 30, 2018

bashtage commented May 30, 2018

bashtage commented May 30, 2018

adrian-castravete commented May 30, 2018 • edited Loading

bashtage commented May 30, 2018 via email

adrian-castravete commented May 30, 2018 • edited Loading

bashtage left a comment

Choose a reason for hiding this comment

bashtage May 30, 2018

Choose a reason for hiding this comment

bashtage May 30, 2018

Choose a reason for hiding this comment

adrian-castravete May 30, 2018

Choose a reason for hiding this comment

bashtage May 30, 2018

Choose a reason for hiding this comment

bashtage May 30, 2018

Choose a reason for hiding this comment

bashtage commented May 30, 2018

adrian-castravete commented May 30, 2018

bashtage commented Jun 6, 2018

bashtage commented Jun 12, 2018

jreback commented Jun 12, 2018

adrian-castravete commented May 29, 2018 •

edited

Loading

codecov bot commented May 30, 2018 •

edited

Loading

adrian-castravete commented May 30, 2018 •

edited

Loading

adrian-castravete commented May 30, 2018 •

edited

Loading