Skip to content

ENH: Allow poorly formatted stata files to be read #25967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 4, 2019

Conversation

bashtage
Copy link
Contributor

@bashtage bashtage commented Apr 2, 2019

Add a fall back decode path that allows improperly formatted Stata
files written in 118 format but using latin-1 encoded strings to be
read

closes #25960

@codecov
Copy link

codecov bot commented Apr 2, 2019

Codecov Report

Merging #25967 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25967      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         175      175              
  Lines       52550    52550              
==========================================
- Hits        48266    48261       -5     
- Misses       4284     4289       +5
Flag Coverage Δ
#multiple 90.39% <ø> (ø) ⬆️
#single 41.9% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.61% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4814a28...733f7a9. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 2, 2019

Codecov Report

Merging #25967 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25967      +/-   ##
==========================================
- Coverage   91.84%   91.84%   -0.01%     
==========================================
  Files         175      175              
  Lines       52550    52550              
==========================================
- Hits        48266    48262       -4     
- Misses       4284     4288       +4
Flag Coverage Δ
#multiple 90.39% <ø> (ø) ⬆️
#single 41.89% <ø> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4814a28...27a173d. Read the comment docs.

@WillAyd WillAyd added the IO Stata read_stata, to_stata label Apr 3, 2019
Add a fall back decode path that allows improperly formatted Stata
files written in 118 format but using latin-1 encoded strings to be
read

closes pandas-dev#25960
@bashtage bashtage force-pushed the latin-1-fallback branch 2 times, most recently from 3f711fc to ddc806f Compare April 3, 2019 08:39
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct."""
with pytest.warns(UnicodeWarning, match=msg):
encoded = read_stata(self.dta_encoding_118)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tm.assert_produces_warning. Our implementation also checks stacklevel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this already checks

Refactor decode and null terminate to use file encoding
@jreback jreback added this to the 0.25.0 milestone Apr 4, 2019
@jreback
Copy link
Contributor

jreback commented Apr 4, 2019

lgtm. @bashtage if you can get the check_stacklevel to work would be great.

@bashtage
Copy link
Contributor Author

bashtage commented Apr 4, 2019

I switched to tm.assert_produces_warning. Doesn't this automatically check the stacklevel?

https://github.com/pandas-dev/pandas/pull/25967/files#diff-ee04a1f9b23ca162f9592eef0caf0dbdR1621

@jreback jreback merged commit 435e2b5 into pandas-dev:master Apr 4, 2019
@jreback
Copy link
Contributor

jreback commented Apr 4, 2019

actually this does check, thanks @bashtage

@bashtage bashtage deleted the latin-1-fallback branch December 19, 2019 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnicodeDecodeError for Stata file
4 participants