Skip to content

BUG: Enforce correct encoding in stata #15768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

bashtage
Copy link
Contributor

Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes #15723

@bashtage bashtage changed the title BIG: Enforce correc encoding in stata BUG: Enforce correct encoding in stata Mar 21, 2017
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes pandas-dev#15723
@bashtage bashtage force-pushed the limit-stata-encoding branch from f549481 to 2f02697 Compare March 21, 2017 17:20
@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

@bashtage side issue. We finally have 32-bit daily wheels.

https://travis-ci.org/MacPython/pandas-wheels/jobs/213101390

so this one came up in testing (I am working fixing the rest). Unfort no easy way to test this as has to be on master. But if you have a fix can merge.

/venv/local/lib/python2.7/site-packages/pandas/tests/io/test_stata.py:247: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/venv/local/lib/python2.7/site-packages/pandas/tests/io/test_stata.py:88: in read_dta
    return read_stata(file, convert_dates=True)
/venv/local/lib/python2.7/site-packages/pandas/io/stata.py:173: in read_stata
    data = reader.read()
/venv/local/lib/python2.7/site-packages/pandas/io/stata.py:1526: in read
    data = self._insert_strls(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <pandas.io.stata.StataReader object at 0xede213ec>
data =       x    y            z
0   1.0  abc   4294967299
1   3.0  cba   8589934595
2  93.0       12884901891
    def _insert_strls(self, data):
        if not hasattr(self, 'GSO') or len(self.GSO) == 0:
            return data
        for i, typ in enumerate(self.typlist):
            if typ != 'Q':
                continue
>           data.iloc[:, i] = [self.GSO[k] for k in data.iloc[:, i]]
E           KeyError: 4294967299
/venv/local/lib/python2.7/site-packages/pandas/io/stata.py:1626: KeyError

@jreback jreback added IO Stata read_stata, to_stata Error Reporting Incorrect or improved errors from pandas Unicode Unicode strings labels Mar 21, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial comments. after its green you can push (and just ping me to merge)

def __init__(self, encoding='latin-1'):

if encoding not in VALID_ENCODINGS:
raise ValueError('Unknown encoding. Only latin-1 and ascii '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space before ascii

@@ -1276,3 +1276,9 @@ def test_out_of_range_float(self):
original.to_stata(path)
tm.assertTrue('ColumnTooBig' in cm.exception)
tm.assertTrue('infinity' in cm.exception)

def test_invalid_encoding(self):
original = self.read_csv(self.csv3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the commit number as a comment

@jreback jreback added this to the 0.20.0 milestone Mar 21, 2017
@bashtage
Copy link
Contributor Author

@jreback I rolled a fix for the 32bit issue into this PR. I can separate if you prefer.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

@bashtage totally fine here. thanks! ping on green.

@codecov
Copy link

codecov bot commented Mar 21, 2017

Codecov Report

Merging #15768 into master will decrease coverage by 0.01%.
The diff coverage is 90.9%.

@@            Coverage Diff             @@
##           master   #15768      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.02%     
==========================================
  Files         143      143              
  Lines       49377    49384       +7     
==========================================
- Hits        44941    44938       -3     
- Misses       4436     4446      +10
Impacted Files Coverage Δ
pandas/io/stata.py 93.47% <90.9%> (-0.05%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.86% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 163d18e...8278be7. Read the comment docs.

Fix use of 64-bit integers as keys in general string objects (GSO) by
wrapping in strings when used as dictionary keys
@bashtage bashtage force-pushed the limit-stata-encoding branch from a79a515 to 8278be7 Compare March 21, 2017 19:14
@jreback jreback closed this in 1c9d46a Mar 21, 2017
@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

thanks! always a pleasure @bashtage

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

@bashtage looks like this fixed the 32-bit error as well! https://travis-ci.org/MacPython/pandas-wheels/jobs/213622884

thanks (of course other errors which I am working on ......)

mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes pandas-dev#15723

Author: Kevin Sheppard <[email protected]>

Closes pandas-dev#15768 from bashtage/limit-stata-encoding and squashes the following commits:

8278be7 [Kevin Sheppard] BUG: Fix limited key range on 32-bit platofrms
2f02697 [Kevin Sheppard] BUG: Enforce correct encoding in stata
@bashtage bashtage deleted the limit-stata-encoding branch April 22, 2018 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: validate encoding on to_stata
2 participants