ENH: Allow poorly formatted stata files to be read #25967

bashtage · 2019-04-02T22:51:31Z

Add a fall back decode path that allows improperly formatted Stata
files written in 118 format but using latin-1 encoded strings to be
read

closes #25960

closes UnicodeDecodeError for Stata file #25960
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2019-04-02T23:29:11Z

Codecov Report

Merging #25967 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25967      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         175      175              
  Lines       52550    52550              
==========================================
- Hits        48266    48261       -5     
- Misses       4284     4289       +5

Flag	Coverage Δ
#multiple	`90.39% <ø> (ø)`	⬆️
#single	`41.9% <ø> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.61% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4814a28...733f7a9. Read the comment docs.

codecov · 2019-04-02T23:29:14Z

Codecov Report

Merging #25967 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25967      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         175      175              
  Lines       52550    52550              
==========================================
- Hits        48266    48262       -4     
- Misses       4284     4288       +4

Flag	Coverage Δ
#multiple	`90.39% <ø> (ø)`	⬆️
#single	`41.89% <ø> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4814a28...27a173d. Read the comment docs.

pandas/io/stata.py

Add a fall back decode path that allows improperly formatted Stata files written in 118 format but using latin-1 encoded strings to be read closes pandas-dev#25960

gfyoung · 2019-04-03T09:27:23Z

pandas/tests/io/test_stata.py

+has been incorrectly encoded by Stata or some other software. You should verify
+the string values returned are correct."""
+        with pytest.warns(UnicodeWarning, match=msg):
+            encoded = read_stata(self.dta_encoding_118)


Use tm.assert_produces_warning. Our implementation also checks stacklevel.

actually this already checks

Refactor decode and null terminate to use file encoding

jreback · 2019-04-04T12:26:30Z

lgtm. @bashtage if you can get the check_stacklevel to work would be great.

bashtage · 2019-04-04T12:50:37Z

I switched to tm.assert_produces_warning. Doesn't this automatically check the stacklevel?

https://github.com/pandas-dev/pandas/pull/25967/files#diff-ee04a1f9b23ca162f9592eef0caf0dbdR1621

jreback · 2019-04-04T12:51:49Z

actually this does check, thanks @bashtage

bashtage mentioned this pull request Apr 2, 2019

UnicodeDecodeError for Stata file #25960

Closed

WillAyd reviewed Apr 3, 2019

View reviewed changes

pandas/io/stata.py Outdated Show resolved Hide resolved

WillAyd added the IO Stata read_stata, to_stata label Apr 3, 2019

gfyoung reviewed Apr 3, 2019

View reviewed changes

pandas/io/stata.py Outdated Show resolved Hide resolved

ENH: Allow poorly formatted stata files to be read

2aff757

Add a fall back decode path that allows improperly formatted Stata files written in 118 format but using latin-1 encoded strings to be read closes pandas-dev#25960

bashtage force-pushed the latin-1-fallback branch 2 times, most recently from 3f711fc to ddc806f Compare April 3, 2019 08:39

gfyoung reviewed Apr 3, 2019

View reviewed changes

MAINT: Refactor decode

27a173d

Refactor decode and null terminate to use file encoding

bashtage force-pushed the latin-1-fallback branch from ddc806f to 27a173d Compare April 3, 2019 09:56

jreback added this to the 0.25.0 milestone Apr 4, 2019

jreback merged commit 435e2b5 into pandas-dev:master Apr 4, 2019

jorisvandenbossche mentioned this pull request Apr 5, 2019

BUG: read_stata always uses 'utf8' #21244

Closed

bashtage deleted the latin-1-fallback branch December 19, 2019 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Allow poorly formatted stata files to be read #25967

ENH: Allow poorly formatted stata files to be read #25967

Uh oh!

bashtage commented Apr 2, 2019

Uh oh!

codecov bot commented Apr 2, 2019

Uh oh!

codecov bot commented Apr 2, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

gfyoung Apr 3, 2019

Uh oh!

jreback Apr 4, 2019

Uh oh!

jreback commented Apr 4, 2019

Uh oh!

bashtage commented Apr 4, 2019

Uh oh!

jreback commented Apr 4, 2019

Uh oh!

Uh oh!

Uh oh!

ENH: Allow poorly formatted stata files to be read #25967

ENH: Allow poorly formatted stata files to be read #25967

Uh oh!

Conversation

bashtage commented Apr 2, 2019

Uh oh!

codecov bot commented Apr 2, 2019

Codecov Report

Uh oh!

codecov bot commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

gfyoung Apr 3, 2019

Choose a reason for hiding this comment

Uh oh!

jreback Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

jreback commented Apr 4, 2019

Uh oh!

bashtage commented Apr 4, 2019

Uh oh!

jreback commented Apr 4, 2019

Uh oh!

Uh oh!

codecov bot commented Apr 2, 2019 •

edited

Loading