Skip to content

ERR: validate encoding on to_stata #15723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ozak opened this issue Mar 17, 2017 · 15 comments
Closed

ERR: validate encoding on to_stata #15723

ozak opened this issue Mar 17, 2017 · 15 comments
Labels
Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata Unicode Unicode strings
Milestone

Comments

@ozak
Copy link

ozak commented Mar 17, 2017

It seems pandas in python3.5 causes issues due to encoding. For example the following generates a corrupt output file

import pandas as pd
df1 = pd.DataFrame(np.array([1,2,3,4]), columns=['var1'])
df1.to_stata('corrupt.dta', write_index=False, encoding='utf8')

while

df1.to_stata('not-corrupt.dta', write_index=False)

generates a correct file. I imagine this may be due to use of encoding and the difference in the treatment between python 2 and python 3, which breaks compatibility of scripts across python versions. I guess it would be nice if it does not take this option into account on python 3, unless the error is caused by something else.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2017

In [8]: df1 = pd.DataFrame(np.array([1,2,3,4]), columns=['var1'])
   ...: df1.to_stata('corrupt.dta', write_index=False, encoding='latin1')
   ...: 
   ...: 
   ...: 

In [9]: pd.read_stata('corrupt.dta')
Out[9]: 
   var1
0     1
1     2
2     3
3     4

doc-string

Signature: df1.to_stata(fname, convert_dates=None, write_index=True, encoding='latin-1', byteorder=None, time_stamp=None, data_label=None, variable_labels=None)
Docstring:
A class for writing Stata binary dta files from array-like objects

Parameters
----------
fname : str or buffer
    String path of file-like object
convert_dates : dict
    Dictionary mapping columns containing datetime types to stata
    internal format to use when wirting the dates. Options are 'tc',
    'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer
    or a name. Datetime columns that do not have a conversion type
    specified will be converted to 'tc'. Raises NotImplementedError if
    a datetime column has timezone information
write_index : bool
    Write the index to Stata dataset.
encoding : str
    Default is latin-1. Unicode is not supported
byteorder : str
    Can be ">", "<", "little", or "big". default is `sys.byteorder`
time_stamp : datetime
    A datetime to use as file creation date.  Default is the current
    time.
dataset_label : str
    A label for the data set.  Must be 80 characters or smaller.
variable_labels : dict
    Dictionary containing columns as keys and variable labels as
    values. Each label must be 80 characters or smaller.

I would say this is technically correct, passing a unicode encoding is invalid. But I think we should simply reject these, rather than actually write an invalid format. want to do a PR to do this? (now I am not sure which encoding stata can actually support, any idea?)

@jreback jreback added Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata Unicode Unicode strings labels Mar 17, 2017
@jreback jreback added this to the Next Major Release milestone Mar 17, 2017
@jreback jreback changed the title Pandas generates corrupt Stata files in python 3.5 on OSX ERR: validate encoding on to_stata Mar 17, 2017
@ozak
Copy link
Author

ozak commented Mar 17, 2017

I am a it confused now. From the docs it should fail in both python 2 and 3, but in python 2

df1.to_stata('corrupt.dta', write_index=False, encoding='utf8')

generates the correct file. So, I would think it may be better to just ignore the option in python 3, but keep it in python 2.

I think stata 14 now supports UTF8 as a default, but it actually may be more general, not 100% sure (see here). I'll try to find some time to write a PR. I'll see what the code actually does. I'll let you know.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2017

cc @bashtage any ideas here?

@bashtage
Copy link
Contributor

There is no UTF8 in Stata. Only ASCII And the simple 8 bit encoding Latin-1.

@bashtage
Copy link
Contributor

In Stata I mean in to_stata. Adding UTF8 is a major effort since the current format is all fixed width and I don't see much of a case for making the effort.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2017

@bashtage so we should then validate encoding='ascii'|'latin1'|None as the only allowed encodings at all.

@ozak
Copy link
Author

ozak commented Mar 17, 2017

So why doesn't it generate a corrupt file in python 2, but does in python 3? I have the same pandas version in both, so it may not be pandas specific?

@jreback
Copy link
Contributor

jreback commented Mar 17, 2017

because its actually encoding it in PY3 with the passed in encoding (utf8), rather than the default of latin1.

@ozak
Copy link
Author

ozak commented Mar 17, 2017

So in PY2 it is not encoded as UTF8 even when giving the option? How is it saved then? What in pandas is affecting the IO to Stata that corrupts the file? Given that Stata14 uses UTF8 as the default it should not have an issue opening UTF8 encoded files.

@ozak
Copy link
Author

ozak commented Mar 17, 2017

Just noticed one more thing

import pandas as pd
df1 = pd.DataFrame(np.array([u'á',u'Ö']), columns=['var1'])
df1.to_stata('not-corrupt.dta', write_index=False, encoding='utf8')

df = pd.read_stata('corrupt3.dta')
df == df1

generates a usable file in both PY2 and PY3. Still, the data is wrong as seen in the example.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2017

So in PY2 it is not encoded as UTF8 even when giving the option?

not really sure it is actually encoding as utf8 internally. lots of things in py2 are wonky. it probably happens to work.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2017

in any event. seems you have a bunch of tests cases! it seems easy enough to simply validate the encoding that is passed and raise if its not valid.

@ozak
Copy link
Author

ozak commented Mar 17, 2017

Indeed.

I guess my example shows that PY2 is not encoding at all.

bashtage added a commit to bashtage/pandas that referenced this issue Mar 21, 2017
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes pandas-dev#15723
bashtage added a commit to bashtage/pandas that referenced this issue Mar 21, 2017
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes pandas-dev#15723
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 21, 2017
mattip pushed a commit to mattip/pandas that referenced this issue Apr 3, 2017
Ensure StataReader and StataWriter have the correct encoding.
Standardized default encoding to 'latin-1'

closes pandas-dev#15723

Author: Kevin Sheppard <[email protected]>

Closes pandas-dev#15768 from bashtage/limit-stata-encoding and squashes the following commits:

8278be7 [Kevin Sheppard] BUG: Fix limited key range on 32-bit platofrms
2f02697 [Kevin Sheppard] BUG: Enforce correct encoding in stata
@TiesdeKok
Copy link

I would like to report that with the current encoding options it is not possible for Chinese users to interact with Pandas due to their Stata files containing Chinese characters. This has been reported by the user @lianyonghui123 on my ipystata package: TiesdeKok/ipystata#28

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants